granularity in the data warehouse

Granularity in the Data Warehouse

Chapter 4

Raw Estimates• The single most important design issue facing

the data warehouse developer is determining the proper level of granularity of the data that will reside in the data warehouse.

• Granularity is also important to the warehouse architect because it affects all the environments that depend on the warehouse for data.

• Granularity affects how efficiently data can be shipped to the different environments and determines the types of analysis that can be done.

• The primary issue of granularity is that of getting it at the right level. The level of granularity needs to be neither too high nor too low.

Raw Estimates• The starting point for determining the appropriate

level of granularity is to do a raw estimate of the number of rows of data and the DASD (direct access storage device) that will be in the data warehouse.

• The raw estimate of the number of rows of data that will reside in the data warehouse tells the architect a great deal. – If there are only 10,000 rows, almost any level of granularity

will do. – If there are 10 million rows, a low level of granularity is

possible.– If there are 10 billion rows, not only is a higher level of

granularity needed, but a major portion of the data will probably go into overflow storage.

• Figure shows an algorithmic path to calculate the space occupied by a data warehouse.

Input to the Planning Process

The estimate of rows and DASD then serves as input to the planning process, as shown in Figure

Data in Overflow

• Once the raw estimate as to the size of the data warehouse is made, the next step is to compare the total number of rows in the warehouse environment to the charts shown in Figure.

• Depending on how many total rows will be in the warehouse environment, different approaches to design, development, and storage are necessary.

Data in Overflow

Data in Overflow

On the five-year horizon, the totals shift by about an order of magnitude or perhaps even more. The theory is that after five years, these factors will be in place:

■■There will be more expertise available in managing the data warehouse volumes of data.

■■Hardware costs will have dropped to some extent.

■■More powerful software tools will be available.

■■The end user will be more sophisticated.

Overflow Storage• Data in the data warehouse environment grows

at a rate never before seen by IT professionals. The combination of historical data and detailed data produces a growth rate that is phenomenal.

• As data grows large, a natural subdivision of data occurs between actively used data and inactively used data. Inactive data is sometimes called dormant data or infrequently used data.

• At some point in the life of the data warehouse, the vast majority of the data in the warehouse becomes stale and unused. At this point, it makes sense to start separating the data onto different storage media.

Overflow Storage• Figure shows that a data monitor is needed to

determine the usage of data. The data monitor tells where to place data by determining what data is and is not being used in the data warehouse.

• The movement between disk storage and near-line storage is controlled by means of software called a cross-media storage manager (CMSM).

• The data in alternate storage or near-line storage can be accessed directly by means of software that has the intelligence to know where data is located in near-line storage.

• These three software components are the minimum required for alternate storage or near-line storage to be used effectively.

What the Levels of Granularity Will Be

Some Feedback Loop Techniques

• Build the first parts of the data warehouse in very small, very fast steps, and carefully listen to the end users’ comments at the end of each step of development. Be prepared to make adjustments quickly.

• If available, use prototyping and allow the feedback loop to function using observations gleaned from the prototype.

• Look at how other people have built their levels of granularity and learn from their experience.

• Go through the feedback process with an experienced user who is aware of the process occurring.

• Under no circumstances should you keep your users in the dark as to the dynamics of the feedback loop.

• Look at whatever the organization has now that appears to be working, and use those functional requirements as a guideline.

• Execute joint application design (JAD) sessions and simulate the output to achieve the desired feedback.

Some Feedback Loop Techniques

Granularity of data can be raised in many ways, such as the following:

• Summarize data from the source as it goes into the target.

• Average or otherwise calculate data as it goes into the target.

• Push highest and/or lowest set values into the target.• Push only data that is obviously needed into the target.• Use conditional logic to select only a subset of records to

go into the target.The ways that data may be summarized or aggregated are

limitless. When building a data warehouse, keep one important point in mind: In classical requirements systems development, it is unwise to proceed until the vast majority of the requirements are identified. But in building the data warehouse, it is unwise not to proceed if at least half of the requirements for the data warehouse are identified.

Levels of Granularity — Banking Environment

The Data Warehouseand Technology

Chapter 5

Some basic requirement of technology supporting Data warehousing

Managing Large Amounts of Data

• In the ideal case, the data warehouse developer builds a data warehouse under the assumption that the technology that houses the data warehouse can handle the volumes required.

• When the designer has to go to extraordinary lengths in design and implementation to map the technology to the data warehouse, then there is a problem with the underlying technology.

• When technology is an issue, it is normal to engage more than one technology.

• The ability to participate in moving dormant data to overflow storage is perhaps the most strategic capability that a technology can have.

Managing Multiple Media

• In conjunction with managing large amounts of data efficiently and cost effectively, the technology underlying the data warehouse must handle multiple storage media. It is insufficient to manage a mature data warehouse on DASD alone.

• Following is a hierarchy of storage of data in terms of speed of access and cost of storage:

Indexing and Monitoring Data

• Of course, the designer uses many practices to make data as flexible as possible, such as spreading data across different storage media and partitioning data. But the technology that houses the data must be able to support easy indexing as well.

• Unlike the monitoring of transaction processing, where the transactions themselves are monitored, data warehouse activity monitoring determines what data has and has not been used.

Monitoring data warehouse data determines such factors as the following:– If a reorganization needs to be done– If an index is poorly structured– If too much or not enough data is in overflow– The statistical composition of the access of the data– Available remaining space

Interfaces to Many Technologies

granularity in the data warehouse

Documents