mc0077

57
Assignment set-1(60 marks) 1) Describe the following: o Dimensional Model o Object Database Models o Post – Relational Database Models Ans: Dimensional model The dimensional model is a specialized adaptation of the relational model used to represent data in data warehouses in a way that data can be easily summarized using OLAP queries. In the dimensional model, a database consists of a single large table of facts that are described using dimensions and measures. A dimension provides the context of a fact (such as who participated, when and where it happened, and its type) and is used in queries to group related facts together. Dimensions tend to be discrete and are often hierarchical; for example, the location might include the building, state, and country. A measure is a quantity describing the fact, such as revenue. It's important that measures can be meaningfully aggregated – for example, the revenue from different locations can be added together. In an OLAP query, dimensions are chosen and the facts are grouped and added together to create a summary. The dimensional model is often implemented on top of the relational model using a star schema, consisting of one table containing the facts and surrounding tables containing the dimensions. Particularly complicated dimensions might be represented using multiple tables, resulting in a snowflake schema. A data warehouse can contain multiple star schemas that share dimension tables, allowing them to be used together.

Upload: deepak-mahto

Post on 25-Oct-2014

881 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: MC0077

Assignment set-1(60 marks)

1) Describe the following: o Dimensional Model o Object Database Models o Post – Relational Database Models

Ans:Dimensional model

The dimensional model is a specialized adaptation of the relational model used to represent data in data warehouses in a way that data can be easily summarized using OLAP queries. In the dimensional model, a database consists of a single large table of facts that are described using dimensions and measures. A dimension provides the context of a fact (such as who participated, when and where it happened, and its type) and is used in queries to group related facts together. Dimensions tend to be discrete and are often hierarchical; for example, the location might include the building, state, and country. A measure is a quantity describing the fact, such as revenue. It's important that measures can be meaningfully aggregated – for example, the revenue from different locations can be added together.

In an OLAP query, dimensions are chosen and the facts are grouped and added together to create a summary.

The dimensional model is often implemented on top of the relational model using a star schema, consisting of one table containing the facts and surrounding tables containing the dimensions. Particularly complicated dimensions might be represented using multiple tables, resulting in a snowflake schema.

A data warehouse can contain multiple star schemas that share dimension tables, allowing them to be used together. Coming up with a standard set of dimensions is an important part of dimensional modeling.

Object Database Models

In recent years, the object-oriented paradigm has been applied to database technology, creating a new programming model known as object databases. These databases attempt to bring the database world and the application programming world closer together, in particular by ensuring that the database uses the same type system as the application program. This aims to avoid the overhead (sometimes referred to as the impedance mismatch) of converting information between its representation in the database (for example as rows in tables) and its representation in the application program (typically as objects). At the same time, object databases attempt to introduce the key ideas of object programming, such as encapsulation and polymorphism, into the world of databases.

Page 2: MC0077

A variety of these ways have been tried for storing objects in a database. Some products have approached the problem from the application programming end, by making the objects manipulated by the program persistent. This also typically requires the addition of some kind of query language, since conventional programming languages do not have the ability to find objects based on their information content. Others have attacked the problem from the database end, by defining an object-oriented data model for the database, and defining a database programming language that allows full programming capabilities as well as traditional query facilities.

Object databases suffered because of a lack of standardization: although standards were defined by ODMG, they were never implemented well enough to ensure interoperability between products. Nevertheless, object databases have been used successfully in many applications: usually specialized applications such as engineering databases or molecular biology databases rather than mainstream commercial data processing. However, object database ideas were picked up by the relational vendors and influenced extensions made to these products and indeed to the SQL language.

Post-Relational Database Models

Several products have been identified as post-relational because the data model incorporates relations but is not constrained by the Information Principle, requiring that all information is represented by data values in relations. Products using a post-relational data model typically employ a model that actually pre-dates the relational model. These might be identified as a directed graph with trees on the nodes.

Post-relational databases could be considered a sub-set of object databases as there is no need for object-relational mapping when using a post-relational data model. In spite of many attacks on this class of data models, with designations of being hierarchical or legacy, the post-relational database industry continues to grow as a multi-billion dollar industry, even if the growth stays below the relational database radar.

Examples of models that could be classified as post-relational are PICK aka MultiValue, and MUMPS, aka M.

2) Describe the following with respect to Database Management Systems: o Information & Data Retrieval o Image Retrieval Systems o Multiple Media Information Retrieval Systems, MIRS

Ans:Information & Data RetrievalThe terms information and data are often used interchangeably in the data management literature causing some confusion in interpretation of the goals of different data management system types. It is important to remember that despite the name of a data management system type, it can only manage data. These data are representations of

Page 3: MC0077

information. However, historically (since the late 1950's) a distinction has been made between:

· Data Retrieval, as retrieval of 'facts', commonly represented as atomic data about some entity of interest, for example a person's name, and

· Information Retrieval, as the retrieval of documents, commonly text but also visual and audio, that describe objects and/or events of interest.

Both retrieval types match the query specifications to database values. However, while data retrieval only retrieves items that match the query specification exactly, information retrieval systems return items that are deemed (by the retrieval system) to be relevant or similar to the query terms. In the later case, the information requester must select the items that are actually relevant to his/her request. Quick examples include the request for the balance of bank account vs selecting relevant links from a google.com result list.

User requests for data are typically formed as "retrieval-by-content", i.e. the user asks for data related to some desired property or information characteristic. These requests or queries must be specified using one of the query languages supported by the DMS query processing subsystem. A query language is tailored to the data type(s) of the data collection. Figure 3.3 models a multiple media database and illustrates 2 query types:

Page 4: MC0077

Query language examples

1. A Data Retrieval Query expressed in SQL, based on attribute-value matches. In this case, a request for titles of documents containing the term "database" and authored by "Joan Nordbotten" and

2. A Document Retrieval Query, . In this case, the documents requested should contain the search terms (keywords): database, management, sql3 or msql.

The query in Figure 3.3b is stated in standard SQL2, while the query in Figure 3.3b is an example of a content query and is typical of those used with information retrieval systems.

Note that 2 different query languages are needed to retrieve data from the structured and non structured (document) data in the DB.

· The SQL query is based on specification of attribute values and requires that the user knows the attribute names used to describe the database that are stored in the DB schema, while

· The Document query assumes that the system knows the location of Document.Body and is able to perform a keyword search and a similarity evaluation.

Image Retrieval Systems

Due to the large storage requirements for images, computer generation of image material, in the form of charts, illustrations and maps, predated the creation of image databases and the need for ad-hoc image retrieval. Development of scanning devices, particularly for medical applications, and digital cameras, as well as the rapidly increasing capacity of computer storage has lead to the creation of large collections of digital image material. Today, many organizations, such as news media, museums and art galleries, as well as police and immigration authorities, maintain large collections of digital images. For example, the New Your Public Library has made their digital gallery, with over 480,000 scanned images, available to the Internet public.

Maintaining a large image collection leads necessarily to a need for an effective system for image indexing and retrieval. Image data collections have a similar structure as that used for text document collections, i.e. each digital image is associated with descriptive metadata, an example of which is illustrated in Figure 3.6. While management of the metadata is the same for text and image collections, the techniques needed for direct image comparison are quite different from those used for text documents. Therefore, current image retrieval systems use 2 quite different approaches for image retrieval (not necessarily within the same system).

Page 5: MC0077

Digital image document structure

1. Retrieval based on metadata, generated manually, that describe the content, meaning/interpretation and/or context for each image, and/or

2. Retrieval based on automatically selected, low level features, such as color and texture distribution and identifiable shapes. This approach is frequently called CBIR or content based image retrieval

Most of the metadata attributes used for digitized images, such as those listed in Figure 3.6, can be stored as either regular structured attributes or text items. Once collected, metadata can be used to retrieve images using either exact match on attribute values or text-search on text descriptive fields. Most image retrieval systems utilize this approach. For example, a Google search for images about Humpback whales listed over 15,000 links to images based on the text - captions, titles, file names - accompanying the images (July 26th 2006).

As noted earlier, images are strings of pixels with no other explicit relationship to the following pixel(s) than their serial position. Unlike text documents, there is no image vocabulary that can be used to index the semantic content. Instead, image pixel analysis routines extract dominant low-level features, such as the distribution of the colors and texture(s) used, and location(s) of identifiable shapes. This data is used to generate a signature for each image that can be indexed and used to match a similar signature generated for a visual query, i.e. a query based on an image example. Unfortunately, using low-level features does not necessarily give a good 'semantic' result for image retrieval

Page 6: MC0077

Multiple Media Information Retrieval Systems, MIRS

Today, many organizations maintain separate digital collections of text, images, audio, and video data in addition to their basic administrative database systems. Increasingly, these organizations need to integrate their data collections, or at least give seamless access across these collections in order to answer such questions as "What information do we have about this <service/topic>?", for example about a particular kind of medical operation or all of the information from an archeological site.

This gives rise to a requirement for multiple media information retrieval systems, i.e. systems capable of integrating all kinds of media data: tabular/administrative, text, image, spatial, temporal, audio, and/or video data. A Multimedia Information Retrieval System, MIRS can be defined as:

A system for the management (storage, retrieval and manipulation) of multiple types of media data.

In practice, an MIRS is a composite system that can be modeled as shown in Figure 3.7. As indicated in the figure, The principle data retrieval sub-systems, located in the connector 'dots' on the connection lines of Figure 3.7, can be adapted from known technology used in current specialized media retrieval systems. The actual placement of these components within a specific MIRS may vary, depending on the anticipated co-location of the media data.

MIRS architecture

Page 7: MC0077

The major vendors of Object-Relational, O-R systems such as IBM's DB2, Informix, and Oracle, have included data management subsystems for such media types as text documents, images, spatial data, audio and video. These 'new' (to sql) data types have been defined using the user defined type functionality available in SQL3. Management functions/methods, based on those developed for the media types in specialized systems, have been implemented as user defined functions. The result is an extension to the standard for SQL3 with system dependent implementations for the management of multimedia data.

The intent of this book is to explore how or - DBMS technology can be utilized to create generalized MIRS that can support databases containing any combination of media data.

3) Describe the following: o New Features in SQL3 o Query Optimization

Ans:New Features in SQL3

SQL3 was accepted as the new standard for SQL in 1999, after more than 7 years of debate. Basically, SQL3 includes data definition and management techniques from Object-Oriented DBMS, while maintaining the relational DBMS platform. Based on this merger of concepts and techniques, DBMSs that support SQL3 are called Object-Relational or ORDBMS.

The most central data modeling notions included in SQL3 are illustrated in Figure 5.2 and support specification of:

· Classification hierarchies, · Embedded structures that support composite attributes, · Collection data-types (sets, lists/arrays, and multi-sets) that can be used for multi-

valued attribute types, · Large OBject types, LOBs, within the DB, as opposed to requiring external

storage, and · User defined data-types and functions (UDT/UDF) that can be used to define

complex structures and derived attribute value calculations, among many other function extensions.

Query formulation in SQL3 remains based in the structured, relational model, though several functional additions have been made to support access to the new structures and data types.

Accessing Hierarchical Structures

Hierarchic structures can be used at 2 levels, illustrated in Figure 5.2 for:

Page 8: MC0077

1. Distinguishing roles between entity-types and 2. Detailing attribute components.

DMS support for complex data-types

A cascaded dot notation has been added to the SQL3 syntax to support specification of access paths within these structures. For example, the following statement selects the names and pictures of students from Bergen, Norway, using the OR DB specification given by the SQL3 declarations in Figure 5.3a.

Page 10: MC0077

Figure 5.3: Entity and relationship specification in SQ

SELECT name, picture FROM Student

WHERE address.city = 'Bergen'

AND address.country = 'Norway';

The SQL3 query processor recognizes that Student is a sub-type of Person and that the attributes name, picture and address are inherited from Person, making it unnecessary for the user to:

· specify the Person table in the FROM clause, · use the dot notation to specify the parent entity-type Person in the SELECT or

WHERE clauses, or · specify an explicit join between the levels in the entity-type hierarchy, here

Student to Person.

Accessing Multi-Valued Structures

SQL3 supports multi-valued (MV) attributes using a number of different implementation techniques. Basically, MV attribute structures can be defined as ordered or unordered sets and implemented as lists, arrays or tables either embedded in the parent table or 'normalized' to a linked table.

In our example in Figure 5.1a, Person.address is a multi-valued complex attribute, defined as a set of addresses. In execution of the previous query the query processor must search each City and Country combination for the result. If the query intent is to locate students with a home address in Bergen, Norway and we assume that the address set has been implemented as an ordered array in which the 1st address is the home address, the query should be specified as:

SELECT name, picture FROM Student

WHERE address[1].city = 'Bergen'

AND address[1].country = 'Norway';

Utilizing User Defined Data Types (UDT)

User defined functions can be used in either the SELECT or WHERE clauses, as shown in the following example, again based on the DB specification given in Figure 5.3a.

SELECT Avg (age) FROM Student

WHERE Level > 4;

Page 11: MC0077

AND age > 22;

In this query age is calculated by the function defined for Person.age. The SQL3 processor must calculate the relevant student.age for each graduate student (assuming that Level represents the number of years of higher education) and then calculate the average age of this group.

Accessing Large Objects

SQL3 has added data-types and storage support for unstructured binary and character large objects, BLOB and CLOB respectively, that can be used to store multimedia documents. However, no new query functionality has been added to access the content of these LOB data, though most SQL3 implementations have extended the LIKE operator so that it can also search through CLOB data. Thus, access to BLOB/CLOB data must be based on search conditions in the metadata of formatted columns or on use of the LIKE operator. Some ORDBMS implementations have extended other character string operators to operate on CLOB data, such as

· LOCATE, which returns the position of the first character or bit string within a LOB that matches the search string and

· Concatenation, substring, and length calculation.

Note that LIKE, concatenation, substring and length are original SQL operators that has been extended to function with LOBs, while LOCATE is a new SQL3 operator. An example of using the LIKE operator, based on the MDB defined in Figure 5.3a is

SELECT Description FROM Course WHERE Description LIKE '%data management%' OR Description LIKE '%information

management%' ;

Note that the LIKE operator does not make use of any index, rather it searches serially through the CLOB for the pattern given in the query specification.

Result Presentation

While there are no new presentation operators in SQL3, both complex and derived attributes can be used as presentation criteria in the standard clauses "group by, having, and order by". However, Large objects, LOBs, cannot be used, since 2 LOBs are unlikely to be identical and have no logical order. SQL3 expands embedded attributes, displaying them in 1 'column' or as multiple rows.

Depending on ORDBMS implementation, the result set is presented either totally, the first 'n' rows or one tuple at a time. If an attribute of a relation in the result set is defined as a large object, LOB, its presentation may fill one or more screens/pages for each tuple.

Page 12: MC0077

SQL3, as a relational language using exact match selection criterion, has no concept of degrees of relevance and thus no support for ranking the tuples in the result set by semantic nearness to the query. Providing this functionality will require user defined output functions, or specialized document processing subsystems as provided by some OR-DBMS vendors

Query Optimization

The goal of any query processor is to execute each query as efficiently as possible. Efficiency here can be measured in both response time and correctness.

The traditional, relational DB approach to query optimization is to transform the query to an execution tree, and then execute query elements according to a sequence that reduces the search space as quickly as possible and delays execution of the most expensive (in time) elements as long as possible. A commonly used execution heuristic is:

1. Execute all select and project operations on single tables first, in order to eliminate unnecessary rows and columns from the result set.

2. Execute join operations for further reduce the result set. 3. Execute operations on media data, since these can be very time consuming. 4. Prepare the result set for presentation.

Using the example from the query in Figure 5.5, a near optimal execution plan would be to execute the statements in the following order:

1. Clauses 4, 6 and 7 in any order. Each of these statements reduces the number of rows in their respective tables.

2. Clause 3. The join will further reduce the number of course tuples that satisfy the age and time constraints. This will be a reasonably quick operation if:

- There are indexes on TakenBy.Sid and TakenBy.Cid so that an index join can be performed, and

- The Course.Description clob has been stored outside of the Course table and is represented by a link to its location.

3. Clause 5 will now search only course descriptions that meet all other selection criteria. This will still be a time consuming serial search.

4. Finally, clause 8 will order the result set for presentation through the layout specified in clause 1.

4) Describe the following with suitable real time examples: o Data Storage Methods o Data dredging

Ans:Data Storage Methods

Page 13: MC0077

In OLTP - Online Transaction Processing Systems relational database design use the discipline of data modeling and generally follow the Codd rules of data normalization in order to ensure absolute data integrity. Less complex information is broken down into its most simple structures (a table) where all of the individual atomic level elements relate to each other and satisfy the normalization rules. Codd defines 5 increasing stringent rules of normalization and typically OLTP systems achieve a 3rd level normalization. Fully normalized OLTP database designs often result in having information from a business transaction stored in dozens to hundreds of tables. Relational database managers are efficient at managing the relationships between tables and result in very fast insert/update performance because only a little bit of data is affected in each relational transaction.

OLTP databases are efficient because they are typically only dealing with the information around a single transaction. In reporting and analysis, thousands to billions of transactions may need to be reassembled imposing a huge workload on the relational database. Given enough time the software can usually return the requested results, but because of the negative performance impact on the machine and all of its hosted applications, data warehousing professionals recommend that reporting databases be physically separated from the OLTP database.

In addition, data warehousing suggests that data be restructured and reformatted to facilitate query and analysis by novice users. OLTP databases are designed to provide good performance by rigidly defined applications built by programmers fluent in the constraints and conventions of the technology. Add in frequent enhancements, and to many a database is just a collection of cryptic names, seemingly unrelated and obscure structures that store data using incomprehensible coding schemes. All factors that while improving performance, complicate use by untrained people. Lastly, the data warehouse needs to support high volumes of data gathered over extended periods of time and are subject to complex queries and need to accommodate formats and definitions of inherited from independently designed package and legacy systems.

Designing the data warehouse data Architecture synergy is the realm of Data Warehouse Architects. The goal of a data warehouse is to bring data together from a variety of existing databases to support management and reporting needs. The generally accepted principle is that data should be stored at its most elemental level because this provides for the most useful and flexible basis for use in reporting and information analysis. However, because of different focus on specific requirements, there can be alternative methods for design and implementing data warehouses. There are two leading approaches to organizing the data in a data warehouse. The dimensional approach advocated by Ralph Kimball and the normalized approach advocated by Bill Inmon. Whilst the dimension approach is very useful in data mart design, it can result in a rat’s nest of long term data integration and abstraction complications when used in a data warehouse.

In the "dimensional" approach, transaction data is partitioned into either a measured "facts", which are generally numeric data that captures specific values or "dimensions" which contain the reference information that gives each transaction its context. As an example, a sales transaction would be broken up into facts such as the number of

Page 14: MC0077

products ordered, and the price paid, and dimensions such as date, customer, product, geographical location and salesperson. The main advantages of a dimensional approach are that the data warehouse is easy for business staff with limited information technology experience to understand and use. Also, because the data is pre-joined into the dimensional form, the data warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that it is quite difficult to add or change later if the company changes the way in which it does business.

The "normalized" approach uses database normalization. In this method, the data in the data warehouse is stored in third normal form. Tables are then grouped together by subject areas that reflect the general definition of the data (customer, product, finance, etc.). The main advantage of this approach is that it is quite straightforward to add new information into the database – the primary disadvantage of this approach is that because of the number of tables involved, it can be rather slow to produce information and reports. Furthermore, since the segregation of facts and dimensions is not explicit in this type of data model, it is difficult for users to join the required data elements into meaningful information without a precise understanding of the data structure.

Subject areas are just a method of organizing information and can be defined along any lines. The traditional approach has subjects defined as the subjects or nouns within a problem space. For example, in a financial services business, you might have customers, products and contracts. An alternative approach is to organize around the business transactions, such as customer enrollment, sales and trades.

Data Dredging

Data Dredging or Data Fishing are terms one may use to criticize someone's data mining efforts when it is felt the patterns or causal relationships discovered are unfounded. In this case the pattern suffers of over fitting on the training data.

Data Dredging is the scanning of the data for any relationships, and then when one is found coming up with an interesting explanation. The conclusions may be suspect because data sets with large numbers of variables have by chance some "interesting" relationships. Fred Schwed said:

"There have always been a considerable number of people who busy themselves examining the last thousand numbers which have appeared on a roulette wheel, in search of some repeating pattern. Sadly enough, they have usually found it."

Nevertheless, determining correlations in investment analysis has proven to be very profitable for statistical arbitrage operations (such as pairs trading strategies), and correlation analysis has shown to be very useful in risk management. Indeed, finding correlations in the financial markets, when done properly, is not the same as finding false patterns in roulette wheels. Some exploratory data work is always required in any applied statistical analysis to get a feel for the data, so sometimes the line between good statistical practice and data dredging is less than clear. Most data mining efforts are focused on

Page 15: MC0077

developing highly detailed models of some large data set. Other researchers have described an alternate method that involves finding the minimal differences between elements in a data set, with the goal of developing simpler models that represent relevant data.

When data sets contain a big set of variables, the level of statistical significance should be proportional to the patterns that were tested. For example, if we test 100 random patterns, it is expected that one of them will be "interesting" with a statistical significance at the 0.01 level.

Cross Validation is a common approach to evaluating the fitness of a model generated via data mining, where the data is divided into a training subset and a test subset to respectively build and then test the model. Common cross validation techniques include the holdout method, k-fold cross validation, and the leave-one-out method.

5) Describe the following with respect to Fuzzy querying to relational databases: o Proposed Model o Meta knowledge o Implementation

Ans:The proposed model

The easiest way of introducing fuzziness in the database model is to use classical relational databases and formulate a front end to it that shall allow fuzzy querying to the database. A limitation imposed on the system is that because we are not extending the database model nor are we defining a new model in any way, the underlying database model is crisp and hence the fuzziness can only be incorporated in the query.

To incorporate fuzziness we introduce fuzzy sets / linguistic terms on the attribute domains / linguistic variables e.g. on the attribute domain AGE we may define fuzzy sets as YOUNG, MIDDLE and OLD. These are defined as the following:

Page 16: MC0077

For this we take the example of a student database which has a table STUDENTS with the following attributes:

A snapshot of the data existing in the database

Meta Knowledge

At the level of meta knowledge we need to add only a single table, LABELS with the following structure:

Meta Knowledge

This table is used to store the information of all the fuzzy sets defined on all the attribute domains. A description of each column in this table is as follows:

· Label: This is the primary key of this table and stores the linguistic term associated with the fuzzy set.

· Column_Name: Stores the linguistic variable associated with the given linguistic term.

· Alpha,Beta, Gamma, Delta: Stores the range of the fuzzy set.

Implementation

Page 17: MC0077

The main issue in the implementation of this system is the parsing of the input fuzzy query. As the underlying database is crisp, i.e. no fuzzy data is stored in the database, the INSERT query will not change and need not be parsed therefore it can be presented to the database as it is. During parsing the query is parsed and divided into the following

1. Query Type: Whether the query is a SELECT, DELETE or UPDATE.2. Result Attributes: The attributes that are to be displayed used only in the case of

the SELECT query.3. Source Tables: The tables on which the query is to be applied.4. Conditions: The conditions that have to be specified before the operation is

performed. It is further sub-divided into Query Attributes (i.e. the attributes on which the query is to be applied) and the linguistic term. If the condition is not fuzzy i.e. it does not contain a linguistic term then it need not be subdivided.

6) Describe the Data Replication concepts

Ans:Data ReplicationReplication is the process of copying and maintaining database objects, such as tables, in multiple databases that make up a distributed database system. Changes applied at one site are captured and stored locally before being forwarded and applied at each of the remote locations. Advanced Replication is a fully integrated feature of the Oracle server; it is not a separate server.

Replication uses distributed database technology to share data between multiple sites, but a replicated database and a distributed database are not the same. In a distributed database, data is available at many locations, but a particular table resides at only one location. For example, the employees table resides at only the loc1.world database in a distributed database system that also includes the loc2.world and loc3.world databases. Replication means that the same data is available at multiple locations. For example, the employees table is available at loc1.world, loc2.world, and loc3.world. Some of the most common reasons for using replication are described as follows:

Availability

Replication provides fast, local access to shared data because it balances activity over multiple sites. Some users can access one server while other users access different servers, thereby reducing the load at all servers. Also, users can access data from the replication site that has the lowest access cost, which is typically the site that is geographically closest to them.

Performance

Replication provides fast, local access to shared data because it balances activity over multiple sites. Some users can access one server while other users access different servers, thereby reducing the load at all servers. Also, users can access data from the

Page 18: MC0077

replication site that has the lowest access cost, which is typically the site that is geographically closest to them.

Disconnected Computing

A Materialized View is a complete or partial copy (replica) of a target table from a single point in time. Materialized views enable users to work on a subset of a database while disconnected from the central database server. Later, when a connection is established, users can synchronize (refresh) materialized views on demand. When users refresh materialized views, they update the central database with all of their changes, and they receive any changes that may have happened while they were disconnected.

Network Load Reduction

Replication can be used to distribute data over multiple regional locations. Then, applications can access various regional servers instead of accessing one central server. This configuration can reduce network load dramatically.

Mass Deployment

Replication can be used to distribute data over multiple regional locations. Then, applications can access various regional servers instead of accessing one central server. This configuration can reduce network load dramatically.

Page 19: MC0077

Assignment- set-2(60 marks)

1) Describe the following with suitable examples: o Cost Estimation o Measuring Index Selectivity

Ans:Cost Estimation

One of the hardest problems in query optimization is to accurately estimate the costs of alternative query plans. Optimizers cost query plans using a mathematical model of query execution costs that relies heavily on estimates of the cardinality, or number of tuples, flowing through each edge in a query plan. Cardinality estimation in turn depends on estimates of the selection factor of predicates in the query. Traditionally, database systems estimate selectivity through fairly detailed statistics on the distribution of values in each column, such as histograms. This technique works well for estimation of selectivity of individual predicates. However many queries have conjunctions of predicates such as select count(*) from R, S where R.make='Honda' and R.model='Accord'. Query predicates are often highly correlated (for example, model='Accord' implies make='Honda'), and it is very hard to estimate the selectivity of the conjunct in general. Poor cardinality estimates and uncaught correlation are one of the main reasons why query optimizers pick poor query plans. This is one reason why a DBA should regularly update the database statistics, especially after major data loads/unloads.

The Cardinality of a set is a measure of the "number of elements of the set". There are two approaches to cardinality – one which compares sets directly using bijections and injections, and another which uses cardinal numbers

Measuring Index Selectivity

B*TREE Indexes improve the performance of queries that select a small percentage of rows from a table. As a general guideline, we should create indexes on tables that are often queried for less than 15% of the table's rows. This value may be higher in situations where all data can be retrieved from an index, or where the indexed columns can be used for joining to other tables.

The ratio of the number of distinct values in the indexed column / columns to the number of records in the table represents the selectivity of an index. The ideal selectivity is 1. Such selectivity can be reached only by unique indexes on NOT NULL columns.

Example with good Selectivity

A table having 100'000 records and one of its indexed column has 88000 distinct values, then the selectivity of this index is 88'000 / 10'0000 = 0.88.

Page 20: MC0077

Oracle implicitly creates indexes on the columns of all unique and primary keys that you define with integrity constraints. These indexes are the most selective and the most effective in optimizing performance. The selectivity of an index is the percentage of rows in a table having the same value for the indexed column. An index's selectivity is good if few rows have the same value.

Example with Bad Selectivity

lf an index on a table of 100'000 records had only 500 distinct values, then the index's selectivity is 500 / 100'000 = 0.005 and in this case a query which uses the limitation of such an index will retum 100'000 / 500 = 200 records for each distinct value. It is evident that a full table scan is more efficient as using such an index where much more I/O is needed to scan repeatedly the index and the table.

Measuring Index Selectivity

Manually measure index selectivity

The ratio of the number of distinct values to the total number of rows is the selectivity of the columns. This method is useful to estimate the selectivity of an index before creating it.

select count (distinct job) "Distinct Values" from emp;

select count(*) "Total Number Rows" from emp;

Selectivity = Distinct Values / Total Number Rows = 5 / 14 = 0.35

Automatically measuring index selectivity

We can determine the selectivity of an index by dividing the number of distinct indexed values by the number of rows in the table.

create index idx_emp_job on emp(job); analyze table emp compute statistics;

select distinct_keys from user_indexes where table_name = 'EMP' and index_name = 'IDX_EMP_JOB';

Page 21: MC0077

select num_rows from user_tables where table_name = 'EMP';

Selectivity = DISTINCT_KEYS / NUM_ROWS = 0.35

Selectivity of each individual Column

Assuming that the table has been analyzed it is also possible to query USER_TAB_COLUMNS to investigate the selectivity of each column individually.

select column_name, num_distinct from user_tab_columns where table_name = 'EMP';

2) Describe the following with suitable examples: o Graphic vs. Declarative Data Models o Structural Semantic Data Model – SSM

Ans:Graphic vs. Declarative Data Models

A data model is a tool used to specify the structure and (some) semantics of the information to be represented in a database. Depending on the model type used, a data model can be expressed in diverse formats, including:

Page 22: MC0077

· Graphic, as used in most semantic data model types, such as ER and extended/enhanced ER (EER) models.

· Lists of declarative statements, as used in - The relational model for relation definitions, - AI/deductive systems for specification of facts and rules, - Metadata standards such as Dublin Core for specification of descriptive attribute-

value pairs, - Data definition languages (DDL). · Tabular, as used to present the content of a DB schema.

Even the implemented and populated DB is only a model of the real world as represented by the data. Studies of the utility of different model forms indicates that

· Graphic models are easier for a human reader to interpret and check for completeness and correctness than list models, while

· List formed models are more readily converted to a set of data definition statements for compilation and construction of a DB schema.

These observations support the common practice of using 2 model types, a graphic model type for requirements analysis and a list model type - relational, functional, or OO - for implementation. Translation of an ER-based graphic model to list form, or directly to a set of DDL (data definition language) statements, is so straight forward, that most CASE (computer aided software engineering) tools include support for the translation. The problem with automated translations is that the designer may not be sufficiently aware of the semantics lost in the translation.

Structural Semantic Data Model – SSM

The Structural Semantic Model, SSM, first described in Nordbotten (1993a & b), is an extension and graphic simplification of the EER modeling tool 1st presented in the '89 edition of (Elmasri & Navathe, 2003). SSM was developed as a teaching tool and has been and can continue to be modified to include new modeling concepts. A particular requirement today is the inclusion of concepts and syntax symbols for modeling multimedia objects.

SSM Concepts

The current version of SSM belongs to the class of Semantic Data Model types extended with concepts for specification of user defined data types and functions, UDT and UDF. It supports the modeling concepts defined in Table 4.4 and compared in Table 4. Figure 4.2 shows the concepts and graphic syntax of SSM, which include:

Table : Data Modeling Concepts

Page 24: MC0077

1. Three types of entity specifications: base (root), subclass, and weak 2. Four types of inter-entity relationships: n-ary associative, and 3 types of

classification hierarchies, 3. Four attribute types: atomic, multi-valued, composite, and derived, 4. Domain type specifications in the graphic model, including;

standard data types, Binary large objects (blob, text, image, ...), user-defined types (UDT) and functions (UDF),

Page 25: MC0077

5. Cardinality specifications for entity to relationship-type connections and for multi-valued attribute types and

6. Data value constraints.

SSM Entity Relationships - hierarchical and associative

Page 26: MC0077

SSM Attribute and Data Types

3) Discuss the following: o Query Processing in Object-Oriented Database Systems o Query Processing Architecture

Ans:Query Processing in Object-Oriented Database Systems

One of the criticisms of first-generation object-oriented database management systems (OODBMSs) was their lack of declarative query capabilities. This led some researchers to brand first generation (network and hierarchical) DBMSs as object-oriented. It was commonly believed that the application domains that OODBMS technology targets do not need querying capabilities. This belief no longer holds, and declarative query capability is accepted as one of the fundamental features of OO-DBMS. Indeed, most of the current prototype systems experiment with powerful query languages and investigate their optimization. Commercial products have started to include such languages as well e.g. O2 and ObjectStore.

In this Section we discuss the issues related to the optimization and execution of OODBMS query languages (which we collectively call query processing). Query

Page 27: MC0077

optimization techniques are dependent upon the query model and language. For example, a functional query language lends itself to functional optimization which is quite different from the algebraic, cost-based optimization techniques employed in relational as well as a number of object-oriented systems. The query model, in turn, is based on the data (or object) model since the latter defines the access primitives which are used by the query model. These primitives, at least partially, determine the power of the query model. Despite this close relationship, in this unit we do not consider issues related to the design of object models, query models, or query languages in any detail.

Almost all object query processors proposed to date use optimization techniques developed for relational systems. However, there are a number of issues that make query processing more difficult in OODBMSs. The following are some of the more important issues:

Type System

Relational query languages operate on a simple type system consisting of a single aggregate type: relation. The closure property of relational languages implies that each relational operator takes one or more relations as operands and produces a relation as a result. In contrast, object systems have richer type systems. The results of object algebra operators are usually sets of objects (or collections) whose members may be of different types. If the object languages are closed under the algebra operators, these heterogeneous sets of objects can be operands to other operators. This requires the development of elaborate type inference schemes to determine which methods can be applied to all the objects in such a set. Furthermore, object algebras often operate on semantically different collection types (e.g., set, bag, list) which imposes additional requirements on the type inference schemes to determine the type of the results of operations on collections of different types.

Encapsulation

Relational query optimization depends on knowledge of the physical storage of data (access paths) which is readily available to the query optimizer. The encapsulation of methods with the data that they operate on in OODBMSs raises (at least) two issues. First, estimating the cost of executing methods is considerably more difficult than estimating the cost of accessing an attribute according to an access path. In fact, optimizers have to worry about optimizing method execution, which is not an easy problem because methods may be written using a general-purpose programming language. Second, encapsulation raises issues related to the accessibility of storage information by the query optimizer. Some systems overcome this difficulty by treating the query optimizer as a special application that can break encapsulation and access information directly. Others propose a mechanism whereby objects “reveal” their costs as part of their interface.

Complex Objects and Inheritance

Page 28: MC0077

Objects usually have complex structures where the state of an object references other objects. Accessing such complex objects involves path expressions. The optimization of path expressions is a difficult and central issue in object query languages. We discuss this issue in some detail in this unit. Furthermore, objects belong to types related through inheritance hierarchies. Efficient access to objects through their inheritance hierarchies is another problem that distinguishes object-oriented from relational query processing.

Object Models

OODBMSs lack a universally accepted object model definition. Even though there is some consensus on the basic features that need to be supported by any object model (e.g., object identity, encapsulation of state and behavior, type inheritance, and typed collections), how these features are supported differs among models and systems. As a result, the numerous projects that experiment with object query processing follow quite different paths and are, to a certain degree, incompatible, making it difficult to amortize on the experiences of others. This diversity of approaches is likely to prevail for some time, therefore, it is important to develop extensible approaches to query processing that allow experimentation with new ideas as they evolve. We provide an overview of various extensible object query processing approaches.

Query Processing Architecture

In this section we focus on two architectural issues: the query processing methodology and the query optimizer architecture.

Query Processing Methodology

A query processing methodology similar to relational DBMSs, but modified to deal with the difficulties discussed in the previous section, can be followed in OODBMSs. Figure 6.1 depicts such a methodology proposed in.

The steps of the methodology are as follows. 1. Queries are expressed in a declarative language 2. It requires no user knowledge of object implementations, access paths or

processing strategies3. The calculus expression is first 4. Calculus Optimization5. Calculus Algebra Transformation6. Type check7. Algebra Optimization8. Execution Plan Generation9. Execution

Page 29: MC0077

Object Query Processing Methodology

4) Describe the following: o Data Mining Functions o Data Mining Techniques

Ans:Data Mining Functions

Data mining methods may be classified by the function they perform or according to the class of application they can be used in. Some of the main techniques used in data mining are described in this section.

Classification

Data Mining tools have to infer a model from the database, and in the case of Supervised Learning this requires the user to define one or more classes. The database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes whereas the remaining attributes are called predicting attributes. A combination of values for the predicted attributes defines a class.

Once classes are defined the system should infer rules that govern the classification therefore the system should be able to find the description of each class. The descriptions should only refer to the predicting attributes of the training set so that the positive examples should satisfy the description and none of the negative. A rule said to be correct if its description covers all the positive examples and none of the negative examples of a class.

A rule is generally presented as, if the left hand side (LHS) then the right hand side (RHS), so that in all instances where LHS is true then RHS is also true, is very probable. The categories of rules are:

Page 30: MC0077

· Exact Rule – permits no exceptions so each object of LHS must be an element of RHS

· Strong Rule – allows some exceptions, but the exceptions have a given limit

· Probabilistic Rule – relates the conditional probability P(RHS|LHS) to the probability P(RHS) Other types of rules are classification rules where LHS is a sufficient condition to classify objects as belonging to the concept referred to in the RHS.

AssociationsGiven a collection of items and a set of records, each of which contain some number of items from the given collection, an association function is an operation against this set of records which return affinities or patterns that exist among the collection of items. These patterns can be expressed by rules such as "72% of all the records that contain items A, B and C also contain items D and E." The specific percentage of occurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule, A, B and C are said to be on an opposite side of the rule to D and E. Associations can involve any number of items on either side of the rule.

Sequential/Temporal patterns

Sequential/temporal pattern functions analyze a collection of records over a period of time for example to identify trends. Where the identity of a customer who made a purchase is known an analysis can be made of the collection of related records of the same structure (i.e. consisting of a number of items drawn from a given collection of items). The records are related by the identity of the customer who did the repeated purchases. Such a situation is typical of a direct mail application where for example a catalogue merchant has the information, for each customer, of the sets of products that the customer buys in every purchase order. A sequential pattern function will analyze such collections of related records and will detect frequently occurring patterns of products bought over time. A sequential pattern operator could also be used to discover for example the set of purchases that frequently precedes the purchase of a microwave oven.

Clustering/Segmentation

Clustering and Segmentation are the processes of creating a partition so that all the members of each set of the partition are similar according to some metric. A Cluster is a set of objects grouped together because of their similarity or proximity. Objects are often decomposed into an exhaustive and/or mutually exclusive set of clusters.

IBM – Market Basket Analysis example

IBM have used segmentation techniques in their Market Basket Analysis on POS transactions where they separate a set of untagged input records into reasonable groups according to product revenue by market basket i.e. the market baskets were segmented based on the number and type of products in the individual baskets.

Page 31: MC0077

Each segment reports total revenue and number of baskets and using a neural network 275,000 transaction records were divided into 16 segments. The following types of analysis were also available:

1. Revenue by segment2. Baskets by segment3. Average revenue by segment etc.Data Mining Techniques

Cluster Analysis

In an unsupervised learning environment the system has to discover its own classes and one way in which it does this is to cluster the data in the database as shown in the following diagram. The first step is to discover subsets of related objects and then find descriptions e.eg D1, D2, D3 etc. which describe each of these subsets.

Induction

A database is a store of information but more important is the information which can be inferred from it. There are two main inference techniques available i.e. deduction and induction.

· Deduction is a technique to infer information that is a logical consequence of the information in the database e.g. the join operator applied to two relational tables where the first concerns employees and departments and the second departments and managers infers a relation between employee and managers.

· Induction has been described earlier as the technique to infer information that is generalised from the database as in the example mentioned above to infer that each employee has a manager. This is higher level information or knowledge in that it is a general statement about objects in the database. The database is searched for patterns or regularities.

Induction has been used in the following ways within data mining.

Decision Trees

Page 32: MC0077

Decision Trees are simple knowledge representation and they classify examples to a finite number of classes, the nodes are labeled with attribute names, the edges are labeled with possible values for this attribute and the leaves labeled with different classes. Objects are classified by following a path down the tree, by taking the edges, corresponding to the values of the attributes in an object.

The following is an example of objects that describe the weather at a given time. The objects contain information on the outlook, humidity etc. Some objects are positive examples denote by P and others are negative i.e. N. Classification is in this case the construction of a tree structure, illustrated in the following diagram, which can be used to classify all the objects correctly.

Decision Tree Structure

Neural Networks

Neural Networks are an approach to computing that involves developing mathematical structures with the ability to learn. The methods are the result of academic investigations to model nervous system learning. Neural Networks have the remarkable ability to derive meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained Neural Network can be thought of as an "expert" in the category of information it has been given to analyze. This expert can then be used to provide projections given new situations of interest and answer "what if" questions.

Neural Networks have broad applicability to real world business problems and have already been successfully applied in many industries. Since neural networks are best at identifying patterns or trends in data, they are well suited for prediction or forecasting needs including:

Page 33: MC0077

· Sales Forecasting · Industrial Process Control · Customer Research · Data Validation · Risk Management · Target Marketing etc.

Neural Networks use a set of processing elements (or nodes) analogous to Neurons in the brain. These processing elements are interconnected in a network that can then identify patterns in data once it is exposed to the data, i.e. the network learns from experience just as people do. This distinguishes neural networks from traditional computing programs that simply follow instructions in a fixed sequential order.

The structure of a neural network looks something like the following:

Structure of a neural network

The bottom layer represents the input layer, in this case with 5 inputs labels X1 through X5. In the middle is something called the hidden layer, with a variable number of nodes. It is the hidden layer that performs much of the work of a network. The output layer in this case has two nodes, Z1 and Z2 representing output values we are trying to determine from the inputs. For example, predict sales (output) based on past sales, price and season (input).

Each node in the hidden layer is fully connected to the inputs which means that what is learned in a hidden node is based on all the inputs taken together. Statisticians maintain

Page 34: MC0077

that the network can pick up the interdependencies in the model. The following diagram provides some detail into what goes on inside a hidden node.

Simply speaking a weighted sum is performed: X1 times W1 plus X2 times W2 on through X5 and W5. This weighted sum is performed for each hidden node and each output node and is how interactions are represented in the network.

The issue of where the network get the weights from is important but suffice to say that the network learns to reduce error in it's prediction of events already known (i.e. past history).

The problems of using neural networks have been summed by Arun Swami of Silicon Graphics Computer Systems. Neural networks have been used successfully for classification but suffer somewhat in that the resulting network is viewed as a black box and no explanation of the results is given. This lack of explanation inhibits confidence, acceptance and application of results. He also notes as a problem the fact that neural networks suffered from long learning times which become worse as the volume of data grows.

The Clementine User Guide has the following simple diagram 7.6 to summarize a Neural Net trained to identify the risk of cancer from a number of factors.

On-line Analytical processing

A major issue in information processing is how to process larger and larger databases, containing increasingly complex data, without sacrificing response time. The client/server architecture gives organizations the opportunity to deploy specialized servers which are optimized for handling specific data management problems. Until recently, organizations have tried to target Relational Database Management Systems

Page 35: MC0077

(RDBMSs) for the complete spectrum of database applications. It is however apparent that there are major categories of database applications which are not suitably serviced by relational database systems. Oracle, for example, has built a totally new Media Server for handling multimedia applications. Sybase uses an Object - Oriented DBMS (OODBMS) in its Gain Momentum product which is designed to handle complex data such as images and audio. Another category of applications is that of On-Line Analytical Processing (OLAP). OLAP was a term coined by E F Codd (1993) and was defined by him as “the dynamic synthesis, analysis and consolidation of large volumes of multidimensional data”

Codd has developed rules or requirements for an OLAP system; · Multidimensional Conceptual View · Transparency · Accessibility · Consistent Reporting Performance · Client/Server Architecture · Generic Dimensionality · Dynamic Sparse Matrix Handling · Multi-User Support · Unrestricted Cross Dimensional Operations · Intuitive Data Manipulation · Flexible Reporting · Unlimited Dimensions and Aggregation Levels

5) Describe the following: o Statements and Transactions in a Distributed Database o Heterogeneous Distributed Database Systems

Ans:Statements and Transactions in a Distributed Database

The following sections introduce the terminology used when discussing statements and transactions in a distributed database environment.

Remote and Distributed Statements

A Remote Query is a query that selects information from one or more remote tables, all of which reside at the same remote node.

A Remote Update is an update that modifies data in one or more tables, all of which are located at the same remote node.

Note: A remote update may include a sub-query that retrieves data from one or more remote nodes, but because the update is performed at only a single remote node, the statement is classified as a remote update.

Page 36: MC0077

A Distributed Query retrieves information from two or more nodes. A distributed update modifies data on two or more nodes. A distributed update is possible using a program unit, such as a procedure or a trigger, that includes two or more remote updates that access data on different nodes. Statements in the program unit are sent to the remote nodes, and the execution of the program succeeds or fails as a unit.

Remote and Distributed Transactions

A Remote Transaction is a transaction that contains one or more remote statements, all of which reference the same remote node. A Distributed Transaction is any transaction that includes one or more statements that, individually or as a group, update data on two or more distinct nodes of a distributed database. If all statements of a transaction reference only a single remote node, the transaction is remote, not distributed.

Heterogeneous Distributed Database Systems

The Oracle distributed database architecture allows the mix of different versions of Oracle along with database products from other companies to create a heterogeneous distributed database system.

The Mechanics of a Heterogeneous Distributed Database

In a distributed database, any application directly connected to a database can issue a SQL statement that accesses remote data in the following ways (For the sake of explanation we have taken Oracle as a base):

· Data in another database is available, no matter what version. Databases at other physical locations are connected through a network and maintain communication.

· Data in a non-compatible database (such as an IBM DB2 database) is available, assuming that the non-Compatible database is supported by the application's gateway architecture, say SQL*Connect in case of Oracle, One can connect the Oracle and non

Page 37: MC0077

Oracle databases with a network and use SQL*Net to maintain communication.

Heterogeneous Distributed Database Systems

When connections from an Oracle node to a remote node (Oracle or non-Oracle) initially are established, the connecting Oracle node records the capabilities of each remote system and the associated gateways. SQL statement execution proceeds. However, in heterogeneous distributed systems, SQL statements issued from an Oracle database to a non-Oracle remote database server are limited by the capabilities of the remote database server and associated gateway. For example, if a remote or distributed query includes an Oracle extended SQL function (for example, an outer join), the function may have to be performed by the local Oracle database. Extended SQL functions in remote updates (for example, an outer join in a sub-query) are not supported by all gateways.

6) Discuss the following with respect to Distributed Database Systems: o Problem Areas of Distributed Databases o Transaction Processing Framework o Models Of Failure

Ans:Problem Areas of Distributed Databases

Page 38: MC0077

Following are the crucial areas in a Distributed Database environment that needs to look into carefully in order to make it a successful. We shall be discussing these in much detail in following sections:

· Distributed Database Design· Distributed Query Processing· Distributed Directory Management· Distributed Concurrency Control· Distributed Deadlock Management· Reliability in Distributed DBMS· Operating System Support· Heterogeneous Databases

Transaction Processing Framework

A transaction is always part of an application. At some time after its invocation by the user, the application issues a begin_transaction primitive; from this moment, all actions which are performed by the application, until a commit or abort primitive is issued, are to be considered part of the same transaction. Alternatively, the beginning of a transaction is implicitly associated with the beginning of the application, and commit/abort primitive ends a transaction and automatically begins a new one, so that explicit begin_transaction primitive is not necessary.

In order to perform functions at different sites, a distributed application has to execute several processes at these sites. Let us call these processes as agents of application. An agent is therefore a local process which performs some actions on behalf of an application.

Any transaction must satisfy the four properties,

Atomicity: Either all or none of the transaction's operations are performed. In other words if a transaction is interrupted by a failure, its partial results are undone.

Consistency Preservation: A transaction is consistency preserving if its complete execution takes the database from one consistent state to another.

Isolation: Execution of a transaction should not be interfered with by any other transactions executing concurrently. It should appear that a transaction is being executed in isolation from other transactions. An incomplete transaction cannot reveal its results to other transactions before its commitment. This property is needed in order to avoid the problem of cascading aborts.

Durability (Permanency): Once a transaction has committed, the system must guarantee that the results of its operations will never be lost, independent of subsequent failures.

Page 39: MC0077

Since the results of a transaction, which must be preserved by the system, are stored in the database, the activity of providing the transaction's durability is called database recovery.

Goals of Transaction Management in a Distributed Database: Efficient, reliable and concurrent execution of transactions. These three goals are strongly interrelated; moreover, there is a trade-off between them.

In order to cooperate in the execution of the global operation required by the application, the agents have to communicate. As they are resident at different sites, the communication between agents is performed through messages. Assume that

1) There exists a root agent which starts the whole transaction, so that when the user requests the execution of an application, the root agent is started; the site of the root agent is called the site of origin of the transaction.

2) The root agent has the responsibility of issuing the begin_transaction, commit and abort primitives.

3) Only the root agent can request the creation of a new agent.In order to build a distributed transaction manager which implements global primitives begin_transaction, commit and abort, it is convenient to assume that we have at each site a 'local transaction manager(LTM)' which is capable of implementing local transactions.

Let us take the example of fund transfer to demonstrate the application of above reference model

FUND_TRANSFER:

Read(terminal,$AMOUNT,$FROM_ACC,$TO-ACC)

Begin_transaction;

Select AMOUNT into $FROM_AMOUNT

from ACCOUNT

where ACCOUNT_NUMBER = $FROM_ACC;

if $FROM_AMOUNT - $AMOUNT < 0 then abort

else begin

Update ACCOUNT

set AMOUNT = AMOUNT - $AMOUNT

where ACCOUNT = $FROM_ACC;

Page 40: MC0077

Update ACCOUNT

set AMOUNT = AMOUNT + $AMOUNT

where ACCOUNT = $TO_ACC;

Commit

end

a) The FUND_TRANSFER transaction at the global level

Note: The above reference model is a conceptual model for understanding at which level an operation belongs and is not necessarily an implementation structure.

ROOT_AGENT;

Read(terminal,$AMOUNT,$FROM_ACC,$TO-ACC)

Begin_transaction;

Select AMOUNT into $FROM_AMOUNT

From ACCOUNT

where ACCOUNT_NUMBER = $FROM_ACC;

if $FROM_AMOUNT - $AMOUNT < 0 then abort

else begin

Update ACCOUNT

set AMOUNT = AMOUNT - $AMOUNT

where ACCOUNT = $FROM_ACC;

Create AGENT1;

Send to AGENT1($AMOUNT,$TO_ACC);

Commit

end

AGENT1:

Page 41: MC0077

Receive from ROOT_AGENT($AMOUNT,$TO_ACC);

Update ACCOUNT

set AMOUNT = AMOUNT + $AMOUNT

where ACCOUNT = $TO_ACC;

b) The FUND_TRANSFER transaction constituted by two agents

When a begin_transaction is issued by the root agent, the DTM will have to issue a local_begin primitive to the LTM at the site of origin and at all the sites at which there are already active agents of the same application, thus transforming all agents into sub-transactions; from this time on the activation of a new agent by the same distributed transaction requires that the local_g=begin be issued to the LTM where the agent is activated, so that the new agent is created as a sub-transaction.

Models of Failures

Failures can be classified as

1) Transaction Failures

a) Error in transaction due to incorrect data input.

b) Present or potential deadlock.

c) 'Abort' of transactions due to non-availability of resources or deadlock.

2) Site Failures: From recovery point of view, failure has to be judged from the viewpoint of loss of memory. So failures can be classified as

a) Failure with Loss of Volatile Storage: In these failures, the content of main memory is lost; however, all the information which is recorded on disks is not affected by failure. Typical failures of this kind are system crashes.

b) Media Failures (Failures with loss of Nonvolatile Storage): In these failures the content of disk storage is lost. Failures of this type can be reduced by replicating the information on several disks having 'independent failure modes'.

Stable storage is the most resilient storage medium available in the system implemented by replicating the same information on several disks with (i) independent failure modes, and (ii) using the so-called careful replacement strategy, at every update operation, first one copy of the information is updated, then the correctness of the update is verified, and finally the second copy is updated.

Page 42: MC0077

3) Communication Failures: There are two basic types of possible communication errors: lost messages and partitions.

When a site X does not receive an acknowledgment of a message from a site Y within a predefined time interval, X is uncertain about the following things:

i) Did a failure occur at all, or is the system simply slow?

ii) If a failure occurred, was it a communication failure, or a crash of site Y?

iii) Has the message been delivered at Y or not? (as the communication failure or the crash can happen before or after the delivery of the message.)

Network Partition

Thus all failures can be regrouped as

i) Failure of a site

ii) Loss of message(s), with or without site failures but no partitions.

iii) Network Partition: Dealing with network partitions is a harder problem than dealing with site crashes or lost messages