0071610391_chap01-libre.pdf

42
CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1 1 Introduction to Data Warehousing I nformationassetsareimmenselyvaluabletoanyenterprise,andbecauseofthis, theseassetsmustbeproperlystoredandreadilyaccessiblewhentheyareneeded. However,theavailabilityoftoomuchdatamakestheextractionofthemost importantinformationdifficult,ifnotimpossible.ViewresultsfromanyGooglesearch, andyou’llseethatthedata=informationequationisnotalwayscorrect—thatis,too muchdataissimplytoomuch. Datawarehousingisaphenomenonthatgrewfromthehugeamountofelectronicdata storedinrecentyearsandfromtheurgentneedtousethatdatatoaccomplishgoalsthatgo beyondtheroutinetaskslinkedtodailyprocessing.Inatypicalscenario,alargecorporation hasmanybranches,andseniormanagersneedtoquantifyandevaluatehoweachbranch contributestotheglobalbusinessperformance.Thecorporatedatabasestoresdetaileddata onthetasksperformedbybranches.Tomeetthemanagers’needs,tailor-madequeriescan beissuedtoretrievetherequireddata.Inorderforthisprocesstowork,database administratorsmustfirstformulatethedesiredquery(typicallyanaggregateSQLquery) aftercloselystudyingdatabasecatalogs.Thenthequeryisprocessed.Thiscantakeafew hoursbecauseofthehugeamountofdata,thequerycomplexity,andtheconcurrenteffects ofotherregularworkloadqueriesondata.Finally,areportisgeneratedandpassedto seniormanagersintheformofaspreadsheet. Manyyearsago,databasedesignersrealizedthatsuchanapproachishardlyfeasible, becauseitisverydemandingintermsoftimeandresources,anditdoesnotalwaysachieve thedesiredresults.Moreover,amixofanalyticalquerieswithtransactionalroutinequeries inevitablyslowsdownthesystem,andthisdoesnotmeettheneedsofusersofeithertype ofquery.Today’sadvanceddatawarehousingprocessesseparateonlineanalyticalprocessing (OLAP)fromonlinetransactionalprocessing(OLTP)bycreatinganewinformationrepository thatintegratesbasicdatafromvarioussources,properlyarrangesdataformats,andthen makesdataavailableforanalysisandevaluationaimedatplanninganddecision-making processes(Lechtenbörger,2001). 1 CHAPTER ch01.indd1 4/21/093:23:27PM

Upload: pragativbora

Post on 18-Jul-2016

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

1Introduction to Data

Warehousing

Informationassetsareimmenselyvaluabletoanyenterprise,andbecauseofthis,theseassetsmustbeproperlystoredandreadilyaccessiblewhentheyareneeded.However,theavailabilityoftoomuchdatamakestheextractionofthemost

importantinformationdifficult,ifnotimpossible.ViewresultsfromanyGooglesearch,andyou’llseethatthedata=informationequationisnotalwayscorrect—thatis,toomuchdataissimplytoomuch.

Datawarehousingisaphenomenonthatgrewfromthehugeamountofelectronicdatastoredinrecentyearsandfromtheurgentneedtousethatdatatoaccomplishgoalsthatgobeyondtheroutinetaskslinkedtodailyprocessing.Inatypicalscenario,alargecorporationhasmanybranches,andseniormanagersneedtoquantifyandevaluatehoweachbranchcontributestotheglobalbusinessperformance.Thecorporatedatabasestoresdetaileddataonthetasksperformedbybranches.Tomeetthemanagers’needs,tailor-madequeriescanbeissuedtoretrievetherequireddata.Inorderforthisprocesstowork,databaseadministratorsmustfirstformulatethedesiredquery(typicallyanaggregateSQLquery)aftercloselystudyingdatabasecatalogs.Thenthequeryisprocessed.Thiscantakeafewhoursbecauseofthehugeamountofdata,thequerycomplexity,andtheconcurrenteffectsofotherregularworkloadqueriesondata.Finally,areportisgeneratedandpassedtoseniormanagersintheformofaspreadsheet.

Manyyearsago,databasedesignersrealizedthatsuchanapproachishardlyfeasible,becauseitisverydemandingintermsoftimeandresources,anditdoesnotalwaysachievethedesiredresults.Moreover,amixofanalyticalquerieswithtransactionalroutinequeriesinevitablyslowsdownthesystem,andthisdoesnotmeettheneedsofusersofeithertypeofquery.Today’sadvanceddatawarehousingprocessesseparateonlineanalyticalprocessing(OLAP)fromonlinetransactionalprocessing(OLTP)bycreatinganewinformationrepositorythatintegratesbasicdatafromvarioussources,properlyarrangesdataformats,andthenmakesdataavailableforanalysisandevaluationaimedatplanninganddecision-makingprocesses(Lechtenbörger,2001).

1

CHAPTER

ch01.indd1 4/21/093:23:27PM

Page 2: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

2 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 2 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

Let’sreviewsomefieldsofapplicationforwhichdatawarehousetechnologiesaresuccessfullyused:

• Trade Salesandclaimsanalyses,shipmentandinventorycontrol,customercareandpublicrelations

• Craftsmanship Productioncostcontrol,supplierandordersupport

• Financialservices Riskanalysisandcreditcards,frauddetection

• Transportindustry Vehiclemanagement

• Telecommunicationservices Callflowanalysisandcustomerprofileanalysis

• Healthcareservice Patientadmissionanddischargeanalysisandbookkeepinginaccountsdepartments

Thefieldofapplicationofdatawarehousesystemsisnotonlyrestrictedtoenterprises,butitalsorangesfromepidemiologytodemography,fromnaturalsciencetoeducation.ApropertythatiscommontoallfieldsistheneedforstorageandquerytoolstoretrieveinformationsummarieseasilyandquicklyfromthehugeamountofdatastoredindatabasesormadeavailablebytheInternet.Thiskindofinformationallowsustostudybusinessphenomena,learnaboutmeaningfulcorrelations,andgainusefulknowledgetosupportdecision-makingprocesses.

1.1 Decision Support SystemsUntilthemid-1980s,enterprisedatabasesstoredonlyoperationaldata—datacreatedbybusinessoperationsinvolvedindailymanagementprocesses,suchaspurchasemanagement,salesmanagement,andinvoicing.However,everyenterprisemusthavequick,comprehensiveaccesstotheinformationrequiredbydecision-makingprocesses.ThisstrategicinformationisextractedmainlyfromthehugeamountofoperationaldatastoredinenterprisedatabasesbymeansofaprogressiveselectionandaggregationprocessshowninFigure1-1.

FIGURE 1-1 Information value as a function of quantity

ch01.indd2 4/21/093:23:28PM

Page 3: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 3

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 3

Anexponentialincreaseinoperationaldatahasmadecomputerstheonlytoolssuitableforprovidingdatafordecision-makingperformedbybusinessmanagers.Thisfacthasdramaticallyaffectedtheroleofenterprisedatabasesandfosteredtheintroductionofdecisionsupportsystems.Theconceptofdecisionsupportsystemsmainlyevolvedfromtworesearchfields:theoreticalstudiesondecision-makingprocessesfororganizationsandtechnicalresearchoninteractiveITsystems.However,thedecisionsupportsystemconceptisbasedonseveraldisciplines,suchasdatabases,artificialintelligence,man-machineinteraction,andsimulation.Decisionsupportsystemsbecamearesearchfieldinthemid-’70sandbecamemorepopularinthe’80s.

Inpractice,aDSSisanITsystemthathelpsmanagersmakedecisionsorchooseamongdifferentalternatives.Thesystemprovidesvalueestimatesforeachalternative,allowingthemanagertocriticallyreviewtheresults.Table1-1showsapossibleclassificationofDSSsonthebasisoftheirfunctions(Power,2002).

Fromthearchitecturalviewpoint,aDSStypicallyincludesamodel-basedmanagementsystemconnectedtoaknowledgeengineand,ofcourse,aninteractivegraphicaluserinterface(SpragueandCarlson,1982).Datawarehousesystemshavebeenmanagingthedataback-endsofDSSssincethe1990s.Theymustretrieveusefulinformationfromahugeamountofdatastoredonheterogeneousplatforms.Inthisway,decision-makerscanformulatetheirqueriesandconductcomplexanalysesonrelevantinformationwithoutslowingdownoperationalsystems.

TABLE 1-1 Classification of Decision Support Systems

System Description

Passive DSS Supports decision-making processes, but it does not offer explicit suggestions on decisions or solutions.

Active DSS Offers suggestions and solutions.

Collaborative DSS Operates interactively and allows decision-makers to modify, integrate, or refine suggestions given by the system. Suggestions are sent back to the system for validation.

Model-driven DSS Enhances management of statistical, financial, optimization, and simulation models.

Communication-driven DSS Supports a group of people working on a common task.

Data-driven DSS Enhances the access and management of time series of corporate and external data.

Document-driven DSS Manages and processes nonstructured data in many formats.

Knowledge-driven DSS Provides problem-solving features in the form of facts, rules, and procedures.

Decision Support SystemAdecisionsupportsystem(DSS)isasetofexpandable,interactiveITtechniquesandtoolsdesignedforprocessingandanalyzingdataandforsupportingmanagersindecisionmaking.Todothis,thesystemmatchesindividualresourcesofmanagerswithcomputerresourcestoimprovethequalityofthedecisionsmade.

ch01.indd3 4/21/093:23:28PM

Page 4: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

4 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 4 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

1.2 Data WarehousingDatawarehousesystemsareprobablythesystemstowhichacademiccommunitiesandindustrialbodieshavebeenpayingthegreatestattentionamongalltheDSSs.Datawarehousingcanbeinformallydefinedasfollows:

Thedefinitionofdatawarehousingpresentedhereisintentionallygeneric;itgivesyouanideaoftheprocessbutdoesnotincludespecificfeaturesoftheprocess.Tounderstandtheroleandtheusefulpropertiesofdatawarehousingcompletely,youmustfirstunderstandtheneedsthatbroughtitintobeing.In1996,R.Kimballefficientlysummedupafewclaimsfrequentlysubmittedbyendusersofclassicinformationsystems:

• “Wehaveheapsofdata,butwecannotaccessit!”Thisshowsthefrustrationofthosewhoareresponsibleforthefutureoftheirenterprisesbuthavenotechnicaltoolstohelpthemextracttherequiredinformationinaproperformat.

• “Howcanpeopleplayingthesameroleachievesubstantiallydifferentresults?”Inmidsizetolargeenterprises,manydatabasesareusuallyavailable,eachdevotedtoaspecificbusinessarea.Theyareoftenstoredondifferentlogicalandphysicalmediathatarenotconceptuallyintegrated.Forthisreason,theresultsachievedineverybusinessareaarelikelytobeinconsistent.

• “Wewanttoselect,group,andmanipulatedataineverypossibleway!”Decision-makingprocessescannotalwaysbeplannedbeforethedecisionsaremade.Endusersneedatoolthatisuser-friendlyandflexibleenoughtoconductadhocanalyses.Theywanttochoosewhichnewcorrelationstheyneedtosearchforinrealtimeastheyanalyzetheinformationretrieved.

• “Showmejustwhatmatters!”Examiningdataatthemaximumlevelofdetailisnotonlyuselessfordecision-makingprocesses,butisalsoself-defeating,becauseitdoesnotallowuserstofocustheirattentiononmeaningfulinformation.

• “Everyoneknowsthatsomedataiswrong!”Thisisanothersorepoint.Anappreciablepercentageoftransactionaldataisnotcorrect—oritisunavailable.Itisclearthatyoucannotachievegoodresultsifyoubaseyouranalysesonincorrectorincompletedata.

Wecanusethepreviouslistofproblemsanddifficultiestoextractalistofkeywordsthatbecomedistinguishingmarksandessentialrequirementsforadatawarehouseprocess,asetoftasksthatallowustoturnoperationaldataintodecision-makingsupportinformation:

• accessibilitytousersnotveryfamiliarwithITanddatastructures;

• integrationofdataonthebasisofastandardenterprisemodel;

• queryflexibilitytomaximizetheadvantagesobtainedfromtheexistinginformation;

Data WarehousingDatawarehousingisacollectionofmethods,techniques,andtoolsusedtosupportknowledgeworkers—seniormanagers,directors,managers,andanalysts—toconductdataanalysesthathelpwithperformingdecision-makingprocessesandimprovinginformationresources.

ch01.indd4 4/21/093:23:28PM

Page 5: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 5

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 5

• informationconcisenessallowingfortarget-orientedandeffectiveanalyses;

• multidimensionalrepresentationgivingusersanintuitiveandmanageableviewofinformation;

• correctnessandcompletenessofintegrateddata.

Datawarehousesareplacedrightinthemiddleofthisprocessandactasrepositoriesfordata.Theymakesurethattherequirementssetcanbefulfilled.

Datawarehousesaresubject-orientedbecausetheyhingeonenterprise-specificconcepts,suchascustomers,products,sales,andorders.Onthecontrary,operationaldatabaseshingeonmanydifferententerprise-specificapplications.

Weputemphasisonintegrationandconsistencybecausedatawarehousestakeadvantageofmultipledatasources,suchasdataextractedfromproductionandthenstoredtoenterprisedatabases,orevendatafromathirdparty’sinformationsystems.Adatawarehouseshouldprovideaunifiedviewofallthedata.Generallyspeaking,wecanstatethatcreatingadatawarehousesystemdoesnotrequirethatnewinformationbeadded;rather,existinginformationneedsrearranging.Thisimplicitlymeansthataninformationsystemshouldbepreviouslyavailable.

Operationaldatausuallycoversashortperiodoftime,becausemosttransactionsinvolvethelatestdata.Adatawarehouseshouldenableanalysesthatinsteadcoverafewyears.Forthisreason,datawarehousesareregularlyupdatedfromoperationaldataandkeepongrowing.Ifdatawerevisuallyrepresented,itmightprogresslikeso:Aphotographofoperationaldatawouldbemadeatregularintervals.Thesequenceofphotographswouldbestoredtoadatawarehouse,andresultswouldbeshowninamoviethatrevealsthestatusofanenterprisefromitsfoundationuntilpresent.

Fundamentally,dataisneverdeletedfromdatawarehousesandupdatesarenormallycarriedoutwhendatawarehousesareoffline.Thismeansthatdatawarehousescanbeessentiallyviewedasread-onlydatabases.Thissatisfiestheusers’needforashortanalysisqueryresponsetimeandhasotherimportanteffects.First,itaffectsdatawarehouse–specificdatabasemanagementsystem(DBMS)technologies,becausethereisnoneedforadvancedtransactionmanagementtechniquesrequiredbyoperationalapplications.Second,datawarehousesoperateinread-onlymode,sodatawarehouse–specificlogicaldesignsolutionsarecompletelydifferentfromthoseusedforoperationaldatabases.Forinstance,themostobviousfeatureofdatawarehouserelationalimplementationsisthattablenormalizationcanbegivenuptopartiallydenormalizetablesandimproveperformance.

Otherdifferencesbetweenoperationaldatabasesanddatawarehousesareconnectedwithquerytypes.Operationalqueriesexecutetransactionsthatgenerallyread/writea

Data WarehouseAdatawarehouseisacollectionofdatathatsupportsdecision-makingprocesses.Itprovidesthefollowingfeatures(Inmon,2005):

• Itissubject-oriented.

• Itisintegratedandconsistent.

• Itshowsitsevolutionovertimeanditisnotvolatile.

ch01.indd5 4/21/093:23:29PM

Page 6: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

6 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 6 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

smallnumberoftuplesfrom/tomanytablesconnectedbysimplerelations.Forexample,thisappliesifyousearchforthedataofacustomerinordertoinsertanewcustomerorder.ThiskindofqueryisanOLTPquery.Onthecontrary,thetypeofqueryrequiredindatawarehousesisOLAP.Itfeaturesdynamic,multidimensionalanalysesthatneedtoscanahugeamountofrecordstoprocessasetofnumericdatasumminguptheperformanceofanenterprise.ItisimportanttonotethatOLTPsystemshaveanessentialworkloadcore“frozen”inapplicationprograms,andadhocdataqueriesareoccasionallyrunfordatamaintenance.Conversely,datawarehouseinteractivityisanessentialpropertyforanalysissessions,sotheactualworkloadconstantlychangesastimegoesby.

ThedistinctivefeaturesofOLAPqueriessuggestadoptionofamultidimensionalrepresentationfordatawarehousedata.Basically,dataisviewedaspointsinspace,whosedimensionscorrespondtomanypossibleanalysisdimensions.Eachpointrepresentsaneventthatoccursinanenterpriseandisdescribedbyasetofmeasuresrelevanttodecision-makingprocesses.Section1.5givesadetaileddescriptionofthemultidimensionalmodelyouabsolutelyneedtobefamiliarwithtounderstandhowtomodelconceptualandlogicallevelsofadatawarehouseandhowtoquerydatawarehouses.

Table1-2summarizesthemaindifferencesbetweenoperationaldatabasesanddatawarehouses.

NOTE Forfurtherdetailsonthedifferentissuesrelatedtothedatawarehouseprocess,refertoChaudhuriandDayal,1997;Inmon,2005;Jarkeetal.,2000;Kelly,1997;Kimball,1996;Mattison,2006;andWrembelandKoncilia,2007.

TABLE 1-2 Differences Between Operational Databases and Data Warehouses (Kelly, 1997)

Feature Operational Databases Data Warehouses

Users Thousands Hundreds

Workload Preset transactions Specific analysis queries

Access To hundreds of records, write and read mode

To millions of records, mainly read-only mode

Goal Depends on applications Decision-making support

Data Detailed, both numeric and alphanumeric

Summed up, mainly numeric

Data integration Application-based Subject-based

Quality In terms of integrity In terms of consistency

Time coverage Current data only Current and historical data

Updates Continuous Periodical

Model Normalized Denormalized, multidimensional

Optimization For OLTP access to a database part

For OLAP access to most of the database

ch01.indd6 4/21/093:23:29PM

Page 7: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 7

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 7

1.3 Data Warehouse ArchitecturesThefollowingarchitecturepropertiesareessentialforadatawarehousesystem(Kelly,1997):

• Separation Analyticalandtransactionalprocessingshouldbekeptapartasmuchaspossible.

• Scalability Hardwareandsoftwarearchitecturesshouldbeeasytoupgradeasthedatavolume,whichhastobemanagedandprocessed,andthenumberofusers’requirements,whichhavetobemet,progressivelyincrease.

• Extensibility Thearchitectureshouldbeabletohostnewapplicationsandtechnologieswithoutredesigningthewholesystem.

• Security Monitoringaccessesisessentialbecauseofthestrategicdatastoredindatawarehouses.

• Administerability Datawarehousemanagementshouldnotbeoverlydifficult.

Twodifferentclassificationsarecommonlyadoptedfordatawarehousearchitectures.Thefirstclassification,describedinsections1.3.1,1.3.2,and1.3.3,isastructure-orientedonethatdependsonthenumberoflayersusedbythearchitecture.Thesecondclassification,describedinsection1.3.4,dependsonhowthedifferentlayersareemployedtocreateenterprise-orientedordepartment-orientedviewsofdatawarehouses.

1.3.1 Single-Layer ArchitectureAsingle-layerarchitectureisnotfrequentlyusedinpractice.Itsgoalistominimizetheamountofdatastored;toreachthisgoal,itremovesdataredundancies.Figure1-2showstheonlylayerphysicallyavailable:thesourcelayer.Inthiscase,datawarehousesarevirtual.

FIGURE 1-2 Single-layer architecture for a data warehouse system

ch01.indd7 4/21/093:23:29PM

Page 8: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

8 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 8 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

Thismeansthatadatawarehouseisimplementedasamultidimensionalviewofoperationaldatacreatedbyspecificmiddleware,oranintermediateprocessinglayer(Devlin,1997).

Theweaknessofthisarchitectureliesinitsfailuretomeettherequirementforseparationbetweenanalyticalandtransactionalprocessing.Analysisqueriesaresubmittedtooperationaldataafterthemiddlewareinterpretsthem.Itthisway,thequeriesaffectregulartransactionalworkloads.Inaddition,althoughthisarchitecturecanmeettherequirementforintegrationandcorrectnessofdata,itcannotlogmoredatathansourcesdo.Forthesereasons,avirtualapproachtodatawarehousescanbesuccessfulonlyifanalysisneedsareparticularlyrestrictedandthedatavolumetoanalyzeishuge.

1.3.2 Two-Layer ArchitectureTherequirementforseparationplaysafundamentalroleindefiningthetypicalarchitectureforadatawarehousesystem,asshowninFigure1-3.Althoughitistypicallycalledatwo-layerarchitecturetohighlightaseparationbetweenphysicallyavailablesourcesanddatawarehouses,itactuallyconsistsoffoursubsequentdataflowstages(Lechtenbörger,2001):

1. Sourcelayer Adatawarehousesystemusesheterogeneoussourcesofdata.Thatdataisoriginallystoredtocorporaterelationaldatabasesorlegacy1databases,oritmaycomefrominformationsystemsoutsidethecorporatewalls.

2. Datastaging Thedatastoredtosourcesshouldbeextracted,cleansedtoremoveinconsistenciesandfillgaps,andintegratedtomergeheterogeneoussourcesintoonecommonschema.Theso-calledExtraction,Transformation,andLoadingtools(ETL)canmergeheterogeneousschemata,extract,transform,cleanse,validate,filter,andloadsourcedataintoadatawarehouse(Jarkeetal.,2000).Technologicallyspeaking,thisstagedealswithproblemsthataretypicalfordistributedinformationsystems,suchasinconsistentdatamanagementandincompatibledatastructures(Zhugeetal.,1996).Section1.4dealswithafewpointsthatarerelevanttodatastaging.

3. Datawarehouselayer Informationisstoredtoonelogicallycentralizedsinglerepository:adatawarehouse.Thedatawarehousecanbedirectlyaccessed,butitcanalsobeusedasasourceforcreatingdatamarts,whichpartiallyreplicatedatawarehousecontentsandaredesignedforspecificenterprisedepartments.Meta-datarepositories(section1.6)storeinformationonsources,accessprocedures,datastaging,users,datamartschemata,andsoon.

4. Analysis Inthislayer,integrateddataisefficientlyandflexiblyaccessedtoissuereports,dynamicallyanalyzeinformation,andsimulatehypotheticalbusinessscenarios.Technologicallyspeaking,itshouldfeatureaggregatedatanavigators,complexqueryoptimizers,anduser-friendlyGUIs.Section1.7dealswithdifferenttypesofdecision-makingsupportanalyses.

Thearchitecturaldifferencebetweendatawarehousesanddatamartsneedstobestudiedcloser.ThecomponentmarkedasadatawarehouseinFigure1-3isalsooftencalledtheprimarydatawarehouseorcorporatedatawarehouse.Itactsasacentralizedstoragesystemfor

1Thetermlegacysystemdenotescorporateapplications,typicallyrunningonmainframesorminicomputers,that are currently used for operational tasks but do not meet modern architectural principles and currentstandards.Forthisreason,accessinglegacysystemsandintegratingthemwithmorerecentapplicationsisacomplextask.Allapplicationsthatuseanonrelationaldatabaseareexamplesoflegacysystems.

ch01.indd8 4/21/093:23:30PM

Page 9: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 9

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 9

allthedatabeingsummedup.Datamartscanbeviewedassmall,localdatawarehousesreplicating(andsummingupasmuchaspossible)thepartofaprimarydatawarehouserequiredforaspecificapplicationdomain.

Thedatamartspopulatedfromaprimarydatawarehouseareoftencalleddependent.Althoughdatamartsarenotstrictlynecessary,theyareveryusefulfordatawarehousesystemsinmidsizetolargeenterprisesbecause

• theyareusedasbuildingblockswhileincrementallydevelopingdatawarehouses;

• theymarkouttheinformationrequiredbyaspecificgroupofuserstosolvequeries;

• theycandeliverbetterperformancebecausetheyaresmallerthanprimarydatawarehouses.

Data MartsAdatamartisasubsetoranaggregationofthedatastoredtoaprimarydatawarehouse.Itincludesasetofinformationpiecesrelevanttoaspecificbusinessarea,corporatedepartment,orcategoryofusers.

FIGURE 1-3 Two-layer architecture for a data warehouse system

ch01.indd9 4/21/093:23:30PM

Page 10: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

10 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 10 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

Sometimes,mainlyfororganizationandpolicypurposes,youshoulduseadifferentarchitectureinwhichsourcesareusedtodirectlypopulatedatamarts.Thesedatamartsarecalledindependent(seesection1.3.4).Ifthereisnoprimarydatawarehouse,thisstreamlinesthedesignprocess,butitleadstotheriskofinconsistenciesbetweendatamarts.Toavoidtheseproblems,youcancreateaprimarydatawarehouseandstillhaveindependentdatamarts.Incomparisonwiththestandardtwo-layerarchitectureofFigure1-3,therolesofdatamartsanddatawarehousesareactuallyinverted.Inthiscase,thedatawarehouseispopulatedfromitsdatamarts,anditcanbedirectlyqueriedtomakeaccesspatternsaseasyaspossible.

Thefollowinglistsumsupallthebenefitsofatwo-layerarchitecture,inwhichadatawarehouseseparatessourcesfromanalysisapplications(Jarkeetal.,2000;Lechtenbörger,2001):

• Indatawarehousesystems,goodqualityinformationisalwaysavailable,evenwhenaccesstosourcesisdeniedtemporarilyfortechnicalororganizationalreasons.

• Datawarehouseanalysisqueriesdonotaffectthemanagementoftransactions,thereliabilityofwhichisvitalforenterprisestoworkproperlyatanoperationallevel.

• Datawarehousesarelogicallystructuredaccordingtothemultidimensionalmodel,whileoperationalsourcesaregenerallybasedonrelationalorsemi-structuredmodels.

• AmismatchintermsoftimeandgranularityoccursbetweenOLTPsystems,whichmanagecurrentdataatamaximumlevelofdetail,andOLAPsystems,whichmanagehistoricalandsummarizeddata.

• Datawarehousescanusespecificdesignsolutionsaimedatperformanceoptimizationofanalysisandreportapplications.

NOTE Afewauthorsusethesameterminologytodefinedifferentconcepts.Inparticular,thoseauthorsconsideradatawarehouseasarepositoryofintegratedandconsistent,yetoperational,data,whiletheyuseamultidimensionalrepresentationofdataonlyindatamarts.Accordingtoourterminology,this“operationalview”ofdatawarehousesessentiallycorrespondstothereconcileddatalayerinthree-layerarchitectures.

1.3.3 Three-Layer ArchitectureInthisarchitecture,thethirdlayeristhereconcileddatalayeroroperationaldatastore.Thislayermaterializesoperationaldataobtainedafterintegratingandcleansingsourcedata.Asaresult,thosedataareintegrated,consistent,correct,current,anddetailed.Figure1-4showsadatawarehousethatisnotpopulatedfromitssourcesdirectly,butfromreconcileddata.

Themainadvantageofthereconcileddatalayeristhatitcreatesacommonreferencedatamodelforawholeenterprise.Atthesametime,itsharplyseparatestheproblemsofsourcedataextractionandintegrationfromthoseofdatawarehousepopulation.Remarkably,insomecases,thereconciledlayerisalsodirectlyusedtobetteraccomplishsomeoperationaltasks,suchasproducingdailyreportsthatcannotbesatisfactorilypreparedusingthecorporateapplications,orgeneratingdataflowstofeedexternalprocessesperiodicallysoastobenefitfromcleaningandintegration.However,reconcileddataleadstomoreredundancyofoperationalsourcedata.Notethatwemayassumethateventwo-layerarchitecturescanhaveareconciledlayerthatisnotspecificallymaterialized,butonlyvirtual,becauseitisdefinedasaconsistentintegratedviewofoperationalsourcedata.

ch01.indd10 4/21/093:23:31PM

Page 11: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 11

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 11

Finally,let’sconsiderasupplementaryarchitecturalapproach,whichprovidesacomprehensivepicture.Thisapproachcanbedescribedasahybridsolutionbetweenthesingle-layerarchitectureandthetwo/three-layerarchitecture.Thisapproachassumesthatalthoughadatawarehouseisavailable,itisunabletosolveallthequeriesformulated.Thismeansthatusersmaybeinterestedindirectlyaccessingsourcedatafromaggregatedata(drill-through).Toreachthisgoal,somequerieshavetoberewrittenonthebasisofsourcedata(orreconcileddataifitisavailable).ThistypeofarchitectureisimplementedinaprototypebyCuiandWidom,2000,anditneedstobeabletogodynamicallybacktothesourcedatarequiredforqueriestobesolved(lineage).

FIGURE 1-4 Three-layer architecture for a data warehouse system

ch01.indd11 4/21/093:23:31PM

Page 12: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

12 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 12 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

NOTE Gupta,1997a;HullandZhou,1996;andYangetal.,1997discusstheimplicationsofthisapproachfromtheviewpointofperformanceoptimization,andinparticularviewmaterialization.

1.3.4 An Additional Architecture ClassificationThescientificliteratureoftendistinguishesfivetypesofarchitecturefordatawarehousesystems,inwhichthesamebasiclayersmentionedintheprecedingparagraphsarecombinedindifferentways(Rizzi,2008).

Inindependentdatamartsarchitecture,differentdatamartsareseparatelydesignedandbuiltinanonintegratedfashion(Figure1-5).Thisarchitecturecanbeinitiallyadoptedintheabsenceofastrongsponsorshiptowardanenterprise-widewarehousingproject,orwhentheorganizationaldivisionsthatmakeupthecompanyarelooselycoupled.However,ittendstobesoonreplacedbyotherarchitecturesthatbetterachievedataintegrationandcross-reporting.

Thebusarchitecture,recommendedbyRalphKimball,isapparentlysimilartotheprecedingarchitecture,withoneimportantdifference.Abasicsetofconformeddimensions(thatis,analysisdimensionsthatpreservethesamemeaningthroughoutallthefactstheybelongto),derivedbyacarefulanalysisofthemainenterpriseprocesses,isadoptedandsharedasacommondesignguideline.Thisensureslogicalintegrationofdatamartsandanenterprise-wideviewofinformation.

Inthehub-and-spokearchitecture,oneofthemostusedinmediumtolargecontexts,thereismuchattentiontoscalabilityandextensibility,andtoachievinganenterprise-wideview

FIGURE 1-5 Independent data marts architecture

ch01.indd12 4/21/093:23:32PM

Page 13: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 13

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 13

ofinformation.Atomic,normalizeddataisstoredinareconciledlayerthatfeedsasetofdatamartscontainingsummarizeddatainmultidimensionalform(Figure1-6).Usersmainlyaccessthedatamarts,buttheymayoccasionallyquerythereconciledlayer.

Thecentralizedarchitecture,recommendedbyBillInmon,canbeseenasaparticularimplementationofthehub-and-spokearchitecture,wherethereconciledlayerandthedatamartsarecollapsedintoasinglephysicalrepository.

Thefederatedarchitectureissometimesadoptedindynamiccontextswherepreexistingdatawarehouses/datamartsaretobenoninvasivelyintegratedtoprovideasingle,cross-organizationdecisionsupportenvironment(forinstance,inthecaseofmergersandacquisitions).Eachdatawarehouse/datamartiseithervirtuallyorphysicallyintegratedwiththeothers,leaningonavarietyofadvancedtechniquessuchasdistributedquerying,ontologies,andmeta-datainteroperability(Figure1-7).

FIGURE 1-6 Hub-and-spoke architecture

ch01.indd13 4/21/093:23:32PM

Page 14: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

14 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 14 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

Thefollowinglistincludesthefactorsthatareparticularlyinfluentialwhenitcomestochoosingoneofthesearchitectures:

• Theamountofinterdependentinformationexchangedbetweenorganizationalunitsinanenterpriseandtheorganizationalroleplayedbythedatawarehouseprojectsponsormayleadtotheimplementationofenterprise-widearchitectures,suchasbusarchitectures,ordepartment-specificarchitectures,suchasindependentdatamarts.

• Anurgentneedforadatawarehouseproject,restrictionsoneconomicandhumanresources,aswellaspoorITstaffskillsmaysuggestthatatypeof“quick”architecture,suchasindependentdatamarts,shouldbeimplemented.

• Theminorroleplayedbyadatawarehouseprojectinenterprisestrategiescanmakeyoupreferanarchitecturetypebasedonindependentdatamartsoverahub-and-spokearchitecturetype.

• Thefrequentneedforintegratingpreexistingdatawarehouses,possiblydeployedonheterogeneousplatforms,andthepressingdemandforuniformlyaccessingtheirdatacanrequireafederatedarchitecturetype.

FIGURE 1-7 Federated architecture

ch01.indd14 4/21/093:23:32PM

Page 15: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 15

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 15

1.4 Data Staging and ETLNowlet’scloselystudysomebasicfeaturesofthedifferentarchitecturelayers.Wewillstartwiththedatastaginglayer.

ThedatastaginglayerhoststheETLprocessesthatextract,integrate,andcleandatafromoperationalsourcestofeedthedatawarehouselayer.Inathree-layerarchitecture,ETLprocessesactuallyfeedthereconcileddatalayer—asingle,detailed,comprehensive,top-qualitydatasource—thatinitsturnfeedsthedatawarehouse.Forthisreason,theETLprocessoperationsasawholeareoftendefinedasreconciliation.Thesearealsothemostcomplexandtechnicallychallengingamongallthedatawarehouseprocessphases.

ETLtakesplaceoncewhenadatawarehouseispopulatedforthefirsttime,thenitoccurseverytimethedatawarehouseisregularlyupdated.Figure1-8showsthatETLconsistsoffourseparatephases:extraction(orcapture),cleansing(orcleaningorscrubbing),transformation,andloading.Inthefollowingsections,weofferbriefdescriptionsofthesephases.

NOTE RefertoJarkeetal.,2000;Hofferetal.,2005;KimballandCaserta,2004;andEnglish,1999formoredetailsonETL.

Thescientificliteratureshowsthattheboundariesbetweencleansingandtransformingareoftenblurredfromtheterminologicalviewpoint.Forthisreason,aspecificoperationisnotalwaysclearlyassignedtooneofthesephases.Thisisobviouslyaformalproblem,butnotasubstantialone.WewilladopttheapproachusedbyHofferandothers(2005)tomakeourexplanationsasclearaspossible.Theirapproachstatesthatcleansingisessentiallyaimedatrectifyingdatavalues,andtransformationmorespecificallymanagesdataformats.

Chapter10discussesallthedetailsofthedata-stagingdesignphase.Chapter3dealswithanearlydatawarehousedesignphase:integration.Thisphaseisnecessaryifthereareheterogeneoussourcestodefineaschemaforthereconcileddatalayer,andtospecificallytransformoperationaldatainthedata-stagingphase.

1.4.1 ExtractionRelevantdataisobtainedfromsourcesintheextractionphase.Youcanusestaticextractionwhenadatawarehouseneedspopulatingforthefirsttime.Conceptuallyspeaking,thislookslikeasnapshotofoperationaldata.Incrementalextraction,usedtoupdatedatawarehousesregularly,seizesthechangesappliedtosourcedatasincethelatestextraction.IncrementalextractionisoftenbasedonthelogmaintainedbytheoperationalDBMS.Ifatimestampisassociatedwithoperationaldatatorecordexactlywhenthedataischangedoradded,itcanbeusedtostreamlinetheextractionprocess.Extractioncanalsobesource-drivenifyoucanrewriteoperationalapplicationstoasynchronouslynotifyofthechangesbeingapplied,orifyouroperationaldatabasecanimplementtriggersassociatedwithchangetransactionsforrelevantdata.

Thedatatobeextractedismainlyselectedonthebasisofitsquality(English,1999).Inparticular,thisdependsonhowcomprehensiveandaccuratetheconstraintsimplementedinsourcesare,howsuitablethedataformatsare,andhowcleartheschemataare.

ch01.indd15 4/21/093:23:33PM

Page 16: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

16 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 16 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

1.4.2 CleansingThecleansingphaseiscrucialinadatawarehousesystembecauseitissupposedtoimprovedataquality—normallyquitepoorinsources(Galhardasetal.,2001).Thefollowinglistincludesthemostfrequentmistakesandinconsistenciesthatmakedata“dirty”:

• Duplicatedata Forexample,apatientisrecordedmanytimesinahospitalpatientmanagementsystem

FIGURE 1-8 Extraction, transformation, and loading

ch01.indd16 4/21/093:23:33PM

Page 17: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 17

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 17

• Inconsistentvaluesthatarelogicallyassociated SuchasaddressesandZIPcodes

• Missingdata Suchasacustomer’sjob

• Unexpecteduseoffields Forexample,asocialSecurityNumberfieldcouldbeusedimproperlytostoreofficephonenumbers

• Impossibleorwrongvalues Suchas2/30/2009

• Inconsistentvaluesforasingleentitybecausedifferentpracticeswereused Forexample,tospecifyacountry,youcanuseaninternationalcountryabbreviation(I)orafullcountryname(Italy);similarproblemsarisewithaddresses(HamletRd.andHamletRoad)

• Inconsistentvaluesforoneindividualentitybecauseoftypingmistakes SuchasHametRoadinsteadofHamletRoad

Inparticular,notethatthelasttwotypesofmistakesareveryfrequentwhenyouaremanagingmultiplesourcesandareenteringdatamanually.

ThemaindatacleansingfeaturesfoundinETLtoolsarerectificationandhomogenization.Theyusespecificdictionariestorectifytypingmistakesandtorecognizesynonyms,aswellasrule-basedcleansingtoenforcedomain-specificrulesanddefineappropriateassociationsbetweenvalues.Seesection10.2formoredetailsonthesepoints.

1.4.3 TransformationTransformationisthecoreofthereconciliationphase.Itconvertsdatafromitsoperationalsourceformatintoaspecificdatawarehouseformat.Ifyouimplementathree-layerarchitecture,thisphaseoutputsyourreconcileddatalayer.Independentlyofthepresenceofareconcileddatalayer,establishingamappingbetweenthesourcedatalayerandthedatawarehouselayerisgenerallymadedifficultbythepresenceofmanydifferent,heterogeneoussources.Ifthisisthecase,acomplexintegrationphaseisrequiredwhendesigningyourdatawarehouse.SeeChapter3formoredetails.

Thefollowingpointsmustberectifiedinthisphase:

• Loosetextsmayhidevaluableinformation.Forexample,BigDealLtDdoesnotexplicitlyshowthatthisisaLimitedPartnershipcompany.

• Differentformatscanbeusedforindividualdata.Forexample,adatecanbesavedasastringorasthreeintegers.

Followingarethemaintransformationprocessesaimedatpopulatingthereconcileddatalayer:

• Conversionandnormalizationthatoperateonbothstorageformatsandunitsofmeasuretomakedatauniform

• Matchingthatassociatesequivalentfieldsindifferentsources

• Selectionthatreducesthenumberofsourcefieldsandrecords

Whenpopulatingadatawarehouse,normalizationisreplacedbydenormalizationbecausedatawarehousedataaretypicallydenormalized,andyouneedaggregationtosumupdataproperly.

ch01.indd17 4/21/093:23:33PM

Page 18: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

18 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 18 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

CleansingandtransformationprocessesareoftencloselyconnectedinETLtools.Figure1-9showsanexampleofcleansingandtransformationofcustomerdata:afield-basedstructureisextractedfromaloosetext,thenafewvaluesarestandardizedsoastoremoveabbreviations,andeventuallythosevaluesthatarelogicallyassociatedcanberectified.

1.4.4 LoadingLoadingintoadatawarehouseisthelaststeptotake.Loadingcanbecarriedoutintwoways:

• Refresh Datawarehousedataiscompletelyrewritten.Thismeansthatolderdataisreplaced.Refreshisnormallyusedincombinationwithstaticextractiontoinitiallypopulateadatawarehouse.

• Update Onlythosechangesappliedtosourcedataareaddedtothedatawarehouse.Updateistypicallycarriedoutwithoutdeletingormodifyingpreexistingdata.Thistechniqueisusedincombinationwithincrementalextractiontoupdatedatawarehousesregularly.

1.5 Multidimensional ModelThedatawarehouselayerisavitallyimportantpartofthisbook.Here,weintroduceadatawarehousekeyword:multidimensional.Youneedtobecomefamiliarwiththeconceptsandterminologyusedheretounderstandtheinformationpresentedthroughoutthisbook,particularlyinformationregardingconceptualandlogicalmodelinganddesigning.

FIGURE 1-9 Example of cleansing and transforming customer data

ch01.indd18 4/21/093:23:34PM

Page 19: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 19

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 19

Overthelastfewyears,multidimensionaldatabaseshavegeneratedmuchresearchandmarketinterestbecausetheyarefundamentalformanydecision-makingsupportapplications,suchasdatawarehousesystems.ThereasonwhythemultidimensionalmodelisusedasaparadigmofdatawarehousedatarepresentationisfundamentallyconnectedtoitseaseofuseandintuitivenessevenforITnewbies.Themultidimensionalmodel’ssuccessisalsolinkedtothewidespreaduseofproductivitytools,suchasspreadsheets,thatadoptthemultidimensionalmodelasavisualizationparadigm.

Perhapsthebeststartingpointtoapproachthemultidimensionalmodeleffectivelyisadefinitionofthetypesofqueriesforwhichthismodelisbestsuited.Section1.7offersmoredetailsontypicaldecision-makingqueriessuchasthoselistedhere(Jarkeetal.,2000):

• “Whatisthetotalamountofreceiptsrecordedlastyearperstateandperproductcategory?”

• “WhatistherelationshipbetweenthetrendofPCmanufacturers’sharesandquartergainsoverthelastfiveyears?”

• “Whichordersmaximizereceipts?”

• “Whichoneoftwonewtreatmentswillresultinadecreaseintheaverageperiodofadmission?”

• “Whatistherelationshipbetweenprofitgainedbytheshipmentsconsistingoflessthan10itemsandtheprofitgainedbytheshipmentsofmorethan10items?”

Itisclearthatusingtraditionallanguages,suchasSQL,toexpressthesetypesofqueriescanbeaverydifficulttaskforinexperiencedusers.Itisalsoclearthatrunningthesetypesofqueriesagainstoperationaldatabaseswouldresultinanunacceptablylongresponsetime.

Themultidimensionalmodelbeginswiththeobservationthatthefactorsaffectingdecision-makingprocessesareenterprise-specificfacts,suchassales,shipments,hospitaladmissions,surgeries,andsoon.Instancesofafactcorrespondtoeventsthatoccurred.Forexample,everysinglesaleorshipmentcarriedoutisanevent.Eachfactisdescribedbythevaluesofasetofrelevantmeasuresthatprovideaquantitativedescriptionofevents.Forexample,salesreceipts,amountsshipped,hospitaladmissioncosts,andsurgerytimearemeasures.

Obviously,ahugenumberofeventsoccurintypicalenterprises—toomanytoanalyzeonebyone.Imagineplacingthemallintoann-dimensionalspacetohelpusquicklyselectandsortthemout.Then-dimensionalspaceaxesarecalledanalysisdimensions,andtheydefinedifferentperspectivestosingleoutevents.Forexample,thesalesinastorechaincanberepresentedinathree-dimensionalspacewhosedimensionsareproducts,stores,anddates.Asfarasshipmentsareconcerned,products,shipmentdates,orders,destinations,andterms&conditionscanbeusedasdimensions.Hospitaladmissionscanbedefinedbythedepartment-date-patientcombination,andyouwouldneedtoaddthetypeofoperationtoclassifysurgeryoperations.

Theconceptofdimensiongavelifetothebroadlyusedmetaphorofcubestorepresentmultidimensionaldata.Accordingtothismetaphor,eventsareassociatedwithcubecellsandcubeedgesstandforanalysisdimensions.Ifmorethanthreedimensionsexist,thecubeiscalledahypercube.Eachcubecellisgivenavalueforeachmeasure.Figure1-10showsanintuitiverepresentationofacubeinwhichthefactisasaleinastorechain.Itsanalysisdimensionsarestore,productanddate.Aneventstandsforaspecificitemsoldinaspecificstoreonaspecificdate,anditisdescribedbytwomeasures:thequantitysoldandthereceipts.Thisfigurehighlightsthatthecubeissparse—thismeansthatmanyeventsdidnotactuallytakeplace.Ofcourse,youcannotselleveryitemeverydayineverystore.

ch01.indd19 4/21/093:23:34PM

Page 20: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

20 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 20 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

Ifyouwanttousetherelationalmodeltorepresentthiscube,youcouldusethefollowingrelationalschema:

SALES(store,product,date,quantity,receipts)

Here,theunderlinedattributesmakeuptheprimarykeyandeventsareassociatedwithtuples,suchas<'EverMore','Shiny','04/05/08',10,25>.Theconstraintexpressedbythisprimarykeyspecifiesthattwoeventscannotbeassociatedwithanindividualstore,product,anddatevaluecombination,andthateveryvaluecombinationfunctionallydeterminesauniquevalueforquantityandauniquevalueforreceipts.Thismeansthatthefollowingfunctionaldependency2holds:

store,product,date→quantity,receipts

Toavoidanymisunderstandingofthetermevent,youshouldrealizethatthegroupofdimensionsselectedforafactrepresentationsinglesoutauniqueeventinthemultidimensionalmodel,butthegroupdoesnotnecessarilysingleoutauniqueeventintheapplicationdomain.Tomakethisstatementclearer,consideronceagainthesalesexample.Intheapplicationdomain,onesinglesaleseventissupposedtobeacustomer’spurchaseofasetofproductsfromastoreonaspecificdate.Inpractice,thiscorrespondstoasalesreceipt.Fromtheviewpointofthemultidimensionalmodel,ifthesalesfacthastheproduct,store,anddatedimensions,aneventwillbethedailytotalamountofanitemsoldinastore.Itisclearthatthedifferencebetweenbothinterpretationsdependsonsales

FIGURE 1-10 The three-dimensional cube modeling sales in a store chain: 10 packs of Shiny were sold on 4/5/2008 in the EverMore store, totaling $25.

2The definition of functional dependency belongs to relational theory. Given relation schema R and twoattributesetsX={a1...,an}andY={b1...,bm},XissaidtofunctionallydetermineY(X→Y)ifandonlyif,foreverylegalinstancerofRandforeachpairoftuplest1,t2inr,t1[X]= t2[X]impliest1[Y]= t2[Y].Heret[X/Y]denotesthevaluestakenintfromtheattributesinX/Y.Byextension,wesaythatafunctionaldependencyholdsbetweentwoattributesetsXandYwheneachvaluesetofXalwayscorrespondstoasinglevaluesetofY.Tosimplifythenotation,whenwedenotetheattributesineachset,wedropthebraces.

ch01.indd20 4/21/093:23:35PM

Page 21: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 21

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 21

receiptsthatgenerallyincludevariousitems,andonindividualitemsthataregenerallysoldmanytimeseverydayinastore.Inthefollowingsections,weusethetermseventandfacttomakereferencetothegranularitytakenbyeventsandfactsinthemultidimensionalmodel.

Normally,eachdimensionisassociatedwithahierarchyofaggregationlevels,oftencalledroll-uphierarchy.Roll-uphierarchiesgroupaggregationlevelvaluesindifferentways.Hierarchiesconsistoflevelscalleddimensionalattributes.Figure1-11showsasimpleexampleofhierarchiesbuiltontheproductandstoredimensions:productsareclassifiedintotypes,andarethenfurtherclassifiedintocategories.Storesarelocatedincitiesbelongingtostates.Ontopofeachhierarchyisafakelevelthatincludesallthedimension-relatedvalues.Fromtheviewpointofrelationaltheory,youcanuseasetoffunctionaldependenciesbetweendimensionalattributestoexpressahierarchy:

product→type→categorystore→city→state

Insummary,amultidimensionalcubehingesonafactrelevanttodecision-making.Itshowsasetofeventsforwhichnumericmeasuresprovideaquantitativedescription.Eachcubeaxisshowsapossibleanalysisdimension.Eachdimensioncanbeanalyzedatdifferentdetaillevelsspecifiedbyhierarchicallystructuredattributes.

Thescientificliteratureshowsmanyformalexpressionsofthemultidimensionalmodel,whichcanbemoreorlesscomplexandcomprehensive.We’llbrieflymentionalternativetermsusedforthemultidimensionalmodelinthescientificliteratureandincommercialtools.

FIGURE 1-11 Aggregation hierarchies built on the product and store dimensions

ch01.indd21 4/21/093:23:35PM

Page 22: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

22 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 22 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

Thefactandcubetermsareofteninterchangeablyused.Essentially,everyoneagreesontheuseofthetermdimensionstospecifythecoordinatesthatclassifyandidentifyfactoccurrences.However,entirehierarchiesaresometimescalleddimensions.Forexample,thetermtimedimensioncanbeusedfortheentirehierarchybuiltonthedateattribute.Measuresaresometimescalledvariables,metrics,properties,attributes,orindicators.Insomemodels,dimensionalattributesofhierarchiesarecalledlevelsorparameters.

NOTE ThemainformalexpressionsofthemultidimensionalmodelintheliteraturewereproposedbyAgrawaletal.,1995;GyssensandLakshmanan,1997;DattaandThomas,1997;Vassiliadis,1998;andCabibboandTorlone,1998.

Theinformationinamultidimensionalcubeisverydifficultforuserstomanagebecauseofitsquantity,evenifitisaconciseversionoftheinformationstoredtooperationaldatabases.If,forexample,astorechainincludes50storesselling1000items,andaspecificdatawarehousecoversthree-year-longtransactions(approximately1000days),thenumberofpotentialeventstotals50×1000×1000=5×107.Assumingthateachstorecansellonly10percentofalltheavailableitemsperday,thenumberofeventstotals5×106.Thisisstilltoomuchdatatobeanalyzedbyuserswithoutrelyingonautomatictools.

Youhaveessentiallytwowaystoreducethequantityofdataandobtainusefulinformation:restrictionandaggregation.Thecubemetaphoroffersaneasy-to-useandintuitivewaytounderstandbothofthesemethods,aswewilldiscussinthefollowingparagraphs.

1.5.1 RestrictionRestrictingdatameansseparatingpartofthedatafromacubetomarkoutananalysisfield.Inrelationalalgebraterminology,thisiscalledmakingselectionsand/orprojections.

Thesimplesttypeofselectionisdataslicing,showninFigure1-12.Whenyouslicedata,youdecreasecubedimensionalitybysettingoneormoredimensionstoaspecificvalue.Forexample,ifyousetoneofthesalescubedimensionstoavalue,suchasstore='EverMore',thisresultsinthesetofeventsassociatedwiththeitemssoldintheEverMorestore.Accordingtothecubemetaphor,thisissimplyaplaneofcells—thatis,adataslicethatcanbeeasilydisplayedinspreadsheets.Inthestorechainexamplegivenearlier,approximately105eventsstillappearinyourresult.Ifyousettwodimensionstoavalue,suchasstore='EverMore'anddate='4/5/2008',thiswillresultinallthedifferentitemssoldintheEverMorestoreonApril5(approximately100events).Graphicallyspeaking,thisinformationisstoredattheintersectionoftwoperpendicularplanesresultinginaline.Ifyousetallthedimensionstoaparticularvalue,youwilldefinejustoneeventthatcorrespondstoapointinthethree-dimensionalspaceofsales.

Dicingisageneralizationofslicing.Itposessomeconstraintsondimensionalattributestoscaledownthesizeofacube.Forexample,youcanselectonlythedailysalesofthefooditemsinApril2008inFlorida(Figure1-12).Inthisway,iffivestoresarelocatedinFloridaand50foodproductsaresold,thenumberofeventstoexaminechangesto5×50×30=7500.

Finally,aprojectioncanbereferredtoasachoicetokeepjustonesubgroupofmeasuresforeveryeventandrejectothermeasures.

ch01.indd22 4/21/093:23:35PM

Page 23: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 23

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 23

1.5.2 AggregationAggregationplaysafundamentalroleinmultidimensionaldatabases.Assume,forexample,thatyouwanttoanalyzetheitemssoldmonthlyforathreeyearperiod.Accordingtothecubemetaphor,thismeansthatyouneedtosortallthecellsrelatedtothedaysofeachmonthbyproductandstore,andthenmergethemintoonesinglemacrocell.Intheaggregatecubeobtainedinthisway,thetotalnumberofevents(thatis,thenumberofmacrocells)is50×1000×36.Thisisbecausethegranularityofthetimedimensionsdoesnotdependondaysanylonger,butnowdependsonmonths,and36isthenumberofmonthsinthreeyears.Everyaggregateeventwillthensumupthedataavailableintheeventsitaggregates.Inthisexample,thetotalamountofitemssoldpermonthandthetotalreceiptsarecalculatedbysummingeverysinglevalueoftheirmeasures(Figure1-13).Ifyoufurtheraggregatealongtime,youcanachievejustthreeeventsforeverystore-productcombination:oneforeveryyear.Whenyoucompletelyaggregatealongthetimedimension,eachstore-productcombinationcorrespondstoonesingleevent,whichshowsthetotalamountofitemssoldinastoreoverthreeyearsandthetotalamountofreceipts.

FIGURE 1-12 Slicing and dicing a three-dimensional cube

ch01.indd23 4/21/093:23:36PM

Page 24: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

24 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 24 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

FIGURE 1-13 Time hierarchy aggregation of the quantity of items sold per product in three stores. A dash shows that an event did not occur because no item was sold.

EverMore EvenMore SmartMart

1/1/2007 – – –

1/2/2007 10 15 5

1/3/2007 20 – 5

.......... .......... .......... ..........

1/1/2008 – – –

1/2/2008 15 10 20

1/3/2008 20 20 25

.......... .......... .......... ..........

1/1/2009 – – –

1/2/2009 20 8 25

1/3/2009 20 12 20

.......... .......... .......... ..........

EverMore EvenMore SmartMart

January 2007 200 180 150

February 2007 180 150 120

March 2007 220 180 160

.......... .......... .......... ..........

January 2008 350 220 200

February 2008 300 200 250

March 2008 310 180 300

.......... .......... .......... ..........

January 2009 380 200 220

February 2009 310 200 250

March 2009 300 160 280

.......... .......... .......... ..........

EverMore EvenMore SmartMart

2,400 2,000 1,600

3,200 2,300 3,000

2007

2008

2009 3,400 2,200 3,200

EverMore EvenMore SmartMart

Total 9,000 6,500 7,800

ch01.indd24 4/21/093:23:37PM

Page 25: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 25

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 25

Youcanaggregatealongvariousdimensionsatthesametime.Forexample,Figure1-14showsthatyoucangroupsalesbymonth,producttype,andstorecity,andbymonthandproducttype.Moreover,selectionsandaggregationscanbecombinedtocarryoutananalysisprocesstargetedexactlytousers’needs.

1.6 Meta-dataThetermmeta-datacanbeappliedtothedatausedtodefineotherdata.Inthescopeofdatawarehousing,meta-dataplaysanessentialrolebecauseitspecifiessource,values,usage,andfeaturesofdatawarehousedataanddefineshowdatacanbechangedandprocessedateveryarchitecturelayer.Figures1-3and1-4showthatthemeta-datarepositoryiscloselyconnectedtothedatawarehouse.Applicationsuseitintensivelytocarryoutdata-stagingandanalysistasks.

AccordingtoKelly’sapproach,youcanclassifymeta-dataintotwopartiallyoverlappingcategories.Thisclassificationisbasedonthewayssystemadministratorsandendusersexploitmeta-data.Systemadministratorsareinterestedininternalmeta-databecauseitdefinesdatasources,transformationprocesses,populationpolicies,logicalandphysicalschemata,constraints,anduserprofiles.Externalmeta-dataisrelevanttoendusers.Forexample,itisaboutdefinitions,qualitystandards,unitsofmeasure,relevantaggregations.

FIGURE 1-14 Two cube aggregation levels. Every macro-event measure value is a sum of its component event values.

ch01.indd25 4/21/093:23:37PM

Page 26: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

26 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 26 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

Meta-dataisstoredinameta-datarepositorywhichalltheotherarchitecturecomponentscanaccess.AccordingtoKelly,atoolformeta-datamanagementshould

• allowadministratorstoperformsystemadministrationoperations,andinparticularmanagesecurity;

• allowenduserstonavigateandquerymeta-data;

• useaGUI;

• allowenduserstoextendmeta-data;

• allowmeta-datatobeimported/exportedinto/fromotherstandardtoolsandformats.

Asfarasrepresentationformatsareconcerned,ObjectManagementGroup(OMG,2000)releasedastandardcalledCommonWarehouseMetamodel(CWM)thatreliesonthreefamousstandards:UnifiedModelingLanguage(UML),eXtensibleMarkupLanguage(XML),andXMLMetadataInterchange(XMI).Partners,suchasIBM,Unisys,NCR,andOracle,inacommoneffort,createdthenewstandardformatthatspecifieshowmeta-datacanbeexchangedamongthetechnologiesrelatedtodatawarehouses,businessintelligence,knowledgemanagement,andwebportals.

Figure1-15showsanexampleofadialogboxdisplayingexternalmeta-datarelatedtohierarchiesinMicroStrategyDesktopoftheMicroStrategy8toolsuite.Inparticular,

FIGURE 1-15 Accessing hierarchy meta-data in MicroStrategy

ch01.indd26 4/21/093:23:38PM

Page 27: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 27

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 27

thisdialogboxdisplaystheCallingCenterattributeparentattributes.Specifically,itstatesthatacallingcenterreferstoadistributioncenter,belongstoaregion,andismanagedbyamanager.

NOTE SeeBarquinandEdelstein,1996;Jarkeetal.,2000;Jennings,2004;andTozer,1999,foracomprehensivediscussiononmeta-datarepresentationandmanagement.

1.7 Accessing Data WarehousesAnalysisisthelastlevelcommontoalldatawarehousearchitecturetypes.Aftercleansing,integrating,andtransformingdata,youshoulddeterminehowtogetthebestoutofitintermsofinformation.Thefollowingsectionsshowthebestapproachesforenduserstoquerydatawarehouses:reports,OLAP,anddashboards.Endusersoftenusetheinformationstoredtoadatawarehouseasastartingpointforadditionalbusinessintelligenceapplications,suchaswhat-ifanalysesanddatamining.SeeChapter15formoredetailsontheseadvancedapplications.

1.7.1 ReportsThisapproachisorientedtothoseuserswhoneedtohaveregularaccesstotheinformationinanalmoststaticway.Forexample,supposealocalhealthauthoritymustsendtoitsstateofficesmonthlyreportssummingupinformationonpatientadmissioncosts.Thelayoutofthosereportshasbeenpredeterminedandmayvaryonlyifchangesareappliedtocurrentlawsandregulations.Designersissuethequeriestocreatereportswiththedesiredlayoutand“freeze”allthoseinanapplication.Inthisway,enduserscanquerycurrentdatawhenevertheyneedto.

Areportisdefinedbyaqueryandalayout.Aquerygenerallyimpliesarestrictionandanaggregationofmultidimensionaldata.Forexample,youcanlookforthemonthlyreceiptsduringthelastquarterforeveryproductcategory.Alayoutcanlooklikeatableorachart(diagrams,histograms,pies,andsoon).Figure1-16showsafewexamplesoflayoutsforthereceiptsquery.

Areportingtoolshouldbeevaluatednotonlyonthebasisofcomprehensivereportlayouts,butalsoonthebasisofflexiblereportdeliverysystems.Areportcanbeexplicitlyrunbyusersorautomaticallyandregularlysenttoregisteredendusers.Forexample,itcanbesentviae-mail.

Keepinmindthatreportsexistedlongbeforedatawarehousesystemscametobe.Reportshavealwaysbeenthemaintoolusedbymanagersforevaluatingandplanningtaskssincetheinventionofdatabases.However,addingdatawarehousestothemixisbeneficialtoreportsfortwomainreasons:First,theytakeadvantageofreliableandcorrect

ch01.indd27 4/21/093:23:38PM

Page 28: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

28 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 28 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

FIGURE 1-16 Report layouts: table (top), line graph (middle), 3-D pie graphs (bottom)

ch01.indd28 4/21/093:23:38PM

Page 29: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 29

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 29

resultsbecausethedatasummedupinreportsisconsistentandintegrated.Inaddition,datawarehousesexpeditethereportingprocessbecausethearchitecturalseparationbetweentransactionprocessingandanalysessignificantlyimprovesperformance.

1.7.2 OLAPOLAPmightbethemainwaytoexploitinformationinadatawarehouse.Surelyitisthemostpopularone,anditgivesendusers,whoseanalysisneedsarenoteasytodefinebeforehand,theopportunitytoanalyzeandexploredatainteractivelyonthebasisofthemultidimensionalmodel.Whileusersofreportingtoolsessentiallyplayapassiverole,OLAPusersareabletostartacomplexanalysissessionactively,whereeachstepistheresultoftheoutcomeofprecedingsteps.Real-timepropertiesofOLAPsessions,requiredin-depthknowledgeofdata,complexqueriesthatcanbeissued,anddesignforusersnotfamiliarwithITmakethetoolsinuseplayacrucialrole.TheGUIofthesetoolsmustbeflexible,easy-to-use,andeffective.

AnOLAPsessionconsistsofanavigationpaththatcorrespondstoananalysisprocessforfactsaccordingtodifferentviewpointsandatdifferentdetaillevels.Thispathisturnedintoasequenceofqueries,whichareoftennotissueddirectly,butdifferentiallyexpressedwithreferencetothepreviousquery.Theresultsofqueriesaremultidimensional.Becausewehumanshaveadifficulttimedecipheringdiagramsofmorethanthreedimensions,OLAPtoolstypicallyusetablestodisplaydata,withmultipleheaders,colors,andotherfeaturestohighlightdatadimensions.

EverystepofananalysissessionischaracterizedbyanOLAPoperatorthatturnsthelatestqueryintoanewone.Themostcommonoperatorsareroll-up,drill-down,slice-and-dice,pivot,drill-across,anddrill-through.Thefiguresincludedhereshowdifferentoperators,andweregeneratedusingtheMicroStrategyDesktopfront-endapplicationintheMicroStrategy8toolsuite.TheyarebasedontheV-Mallexample,inwhichalargevirtualmallsellsitemsfromitscatalogviaphoneandtheInternet.Figure1-17showstheattributehierarchiesrelevanttothesalesfactinV-Mall.

Theroll-upoperatorcausesanincreaseindataaggregationandremovesadetaillevelfromahierarchy.Forexample,Figure1-18showsaqueryposedbyauserthatdisplays

FIGURE 1-17 Attribute hierarchies in V-Mall; arrows show functional dependencies

ch01.indd29 4/21/093:23:38PM

Page 30: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

30 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 30 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

monthlyrevenuesin2005and2006foreverycustomerregion.Ifyou“rollitup,”youremovethemonthdetailtodisplayquarterlytotalrevenuesperregion.Rolling-upcanalsoreducethenumberofdimensionsinyourresultsifyouremoveallthehierarchydetails.IfyouapplythisprincipletoFigure1-19,youcanremoveinformationoncustomersanddisplayyearlytotalrevenuesperproductcategoryasyouturnthethree-dimensionaltable

FIGURE 1-18 Time hierarchy roll-up

ch01.indd30 4/21/093:23:39PM

Page 31: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 31

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 31

FIGURE 1-19 Roll-up removing customer hierarchy

FIGURE 1-20 Rolling-up (left) and drilling-down (right) a cube

intoatwo-dimensionalone.Figure1-20usesthecubemetaphortosketcharoll-upoperationwithandwithoutadecreaseindimensions.

Thedrill-downoperatoristhecomplementtotheroll-upoperator.Figure1-20showsthatitreducesdataaggregationandaddsanewdetailleveltoahierarchy.Figure1-21showsanexamplebasedonabidimensionaltable.Thistableshowsthattheaggregationbasedoncustomerregionsshiftstoanewfine-grainedaggregationbasedoncustomercities.

ch01.indd31 4/21/093:23:39PM

Page 32: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

32 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 32 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

FIGURE 1-21 Drilling-down customer hierarchy

FIGURE 1-22 Drilling-down and adding a dimension

InFigure1-22,thedrill-downoperatorcausesanincreaseinthenumberoftabledimensionsafteraddingcustomerregiondetails.

Slice-and-diceisoneofthemostabusedtermsindatawarehouseliteraturebecauseitcanhavemanydifferentmeanings.AfewauthorsuseitgenerallytodefinethewholeOLAPnavigationprocess.Otherauthorsuseittodefineselectionandprojectionoperationsbasedondata.Incompliancewithsection1.5.1,wedefineslicingasanoperationthatreducesthe

ch01.indd32 4/21/093:23:40PM

Page 33: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 33

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 33

FIGURE 1-23 Slicing (above) and dicing (below) a cube

FIGURE 1-24 Slicing based on the Year='2006' predicate

numberofcubedimensionsaftersettingoneofthedimensionstoaspecificvalue.Dicingisanoperationthatreducesthesetofdatabeinganalyzedbyaselectioncriterion(Figure1-23).Figures1-24and1-25showafewexamplesofslicinganddicing.

Thepivotoperatorimpliesachangeinlayouts.Itaimsatanalyzinganindividualgroupofinformationfromadifferentviewpoint.Accordingtothemultidimensionalmetaphor,ifyoupivotdata,yourotateyourcubesothatyoucanrearrangecellsonthe

ch01.indd33 4/21/093:23:40PM

Page 34: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

34 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 34 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

FIGURE 1-25 Selection based on a complex predicate

FIGURE 1-26 Pivoting a cube

basisofanewperspective.Inpractice,youcanhighlightadifferentcombinationofdimensions(Figure1-26).Figures1-27and1-28showafewexamplesofpivotedtwo-dimensionalandthree-dimensionaltables.

Thetermdrill-acrossstandsfortheopportunitytocreatealinkbetweentwoormoreinterrelatedcubesinordertocomparetheirdata.Forexample,thisappliesifyoucalculate

ch01.indd34 4/21/093:23:41PM

Page 35: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 35

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 35

FIGURE 1-27 Pivoting a two-dimensional table

FIGURE 1-28 Pivoting a three-dimensional table

anexpressioninvolvingmeasuresfromtwocubes(Figure1-29).Figure1-30showsanexampleinwhichasalescubeisdrilled-acrossapromotionscubeinordertocomparerevenuesanddiscountsperquarterandproductcategory.

MostOLAPtoolscanperformdrill-throughoperations,thoughwithvaryingeffectiveness.Thisoperationswitchesfrommultidimensionalaggregatedataindatamartstooperationaldatainsourcesorinthereconciledlayer.

Inmanyapplications,anintermediateapproachbetweenstaticreportingandOLAPisbroadlyused.Thisintermediateapproachiscalledsemi-staticreporting.Evenifasemi-staticreportfocusesonagroupofinformationpreviouslyset,itgivesuserssomemarginoffreedom.Thankstothismargin,userscanfollowalimitedsetofnavigationpaths.Forexample,thisapplieswhenyoucanrollupjusttoafewhierarchyattributes.Thissolutioniscommon,becauseitprovidessomeunquestionableadvantages.First,usersneedlessskilltousedatamodelsandanalysistoolsthantheyneedforOLAP.Second,thisavoidstheriskthatoccursinOLAPofachievinginconsistentanalysisresultsorincorrectonesbecauseofanymisuseofaggregationoperators.Third,ifyouposeconstraintsontheanalysesallowed,youwillpreventusersfromunwillinglyslowingdownyoursystemwhenevertheyformulatedemandingqueries.

ch01.indd35 4/21/093:23:41PM

Page 36: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

36 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 36 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

1.7.3 DashboardsDashboardsareanothermethodusedfordisplayinginformationstoredtoadatawarehouse.ThetermdashboardreferstoaGUIthatdisplaysalimitedamountofrelevantdatainabriefandeasy-to-readformat.Dashboardscanprovideareal-timeoverviewofthetrendsforaspecificphenomenonorformanyphenomenathatarestrictlyconnectedwitheachother.Thetermisavisualmetaphor:thegroupofindicatorsintheGUIaredisplayedlikeacardashboard.Dashboardsareoftenusedbyseniormanagerswhoneedaquickwaytoviewinformation.However,toconductanddisplayverycomplexanalysesofphenomena,dashboardsmustbematchedwithanalysistools.

Today,mostsoftwarevendorsofferdashboardsforreportcreationanddisplay.Figure1-31showsadashboardcreatedwithMicroStrategyDynamicEnterprise.Theliteraturerelatedtodashboardgraphicdesignhasalsoproventobeveryrich,inparticularinthescopeofenterprises(Few,2006).

FIGURE 1-29 Drilling across two cubes

FIGURE 1-30 Drilling across the sales cube (Revenue measure) and the promotions cube (Discount measure)

ch01.indd36 4/21/093:23:42PM

Page 37: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 37

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 37

Keepinmind,however,thatdashboardsarenothingbutperformanceindicatorsbehindGUIs.Theireffectivenessisduetoacarefulselectionoftherelevantmeasures,whileusingdatawarehouseinformationqualitystandards.Forthisreason,dashboardsshouldbeviewedasasophisticatedeffectiveadd-ontodatawarehousesystems,butnotastheprimarygoalofdatawarehousesystems.Infact,theprimarygoalofdatawarehousesystemsshouldalwaysbetoproperlydefineaprocesstotransformdataintoinformation.

1.8 ROLAP, MOLAP, and HOLAPThesethreeacronymsconcealthreemajorapproachestoimplementingdatawarehouses,andtheyarerelatedtothelogicalmodelusedtorepresentdata:

• ROLAPstandsforRelationalOLAP,animplementationbasedonrelationalDBMSs.

• MOLAPstandsforMultidimensionalOLAP,animplementationbasedonmultidimensionalDBMSs.

• HOLAPstandsforHybridOLAP,animplementationusingbothrelationalandmultidimensionaltechniques.

FIGURE 1-31 An example of dashboards

ch01.indd37 4/21/093:23:42PM

Page 38: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

38 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 38 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

Theideaofadoptingtherelationaltechnologytostoredatatoadatawarehousehasasolidfoundationifyouconsiderthehugeamountofliteraturewrittenabouttherelationalmodel,thebroadlyavailablecorporateexperiencewithrelationaldatabaseusageandmanagement,andthetopperformanceandflexibilitystandardsofrelationalDBMSs(RDBMSs).Theexpressivepoweroftherelationalmodel,however,doesnotincludetheconceptsofdimension,measure,andhierarchy,soyoumustcreatespecifictypesofschematasothatyoucanrepresentthemultidimensionalmodelintermsofbasicrelationalelementssuchasattributes,relations,andintegrityconstraints.Thistaskismainlyperformedbythewell-knownstarschema.SeeChapter8formoredetailsonstarschemataandstarschemavariants.

ThemainproblemwithROLAPimplementationsresultsfromtheperformancehitcausedbycostlyjoinoperationsbetweenlargetables.Toreducethenumberofjoins,oneofthekeyconceptsofROLAPisdenormalization—aconsciousbreachinthethirdnormalformorientedtoperformancemaximization.Tominimizeexecutioncosts,theotherkeywordisredundancy,whichistheresultofthematerializationofsomederivedtables(views)thatstoreaggregatedatausedfortypicalOLAPqueries.

Fromanarchitecturalviewpoint,adoptingROLAPrequiresspecializedmiddleware,alsocalledamultidimensionalengine,betweenrelationalback-endserversandfront-endcomponents,asshowninFigure1-32.ThemiddlewarereceivesOLAPqueriesformulatedbyusersinafront-endtoolandturnsthemintoSQLinstructionsforarelationalback-endapplicationwiththesupportofmeta-data.Theso-calledaggregatenavigatorisaparticularlyimportantcomponentinthisphase.Incaseofaggregateviews,thiscomponentselectsaviewfromamongallthealternativestosolveaspecificqueryattheminimumaccesscost.

Incommercialproducts,differentfront-endmodules,suchasOLAP,reports,anddashboards,aregenerallystrictlyconnectedtoamultidimensionalengine.Multidimensionalenginesarethemaincomponentsandcanbeconnectedtoanyrelationalserver.Opensourcesolutionshavebeenrecentlyreleased.Theirmultidimensionalengines(Mondrian,2009)aredisconnectedfromfront-endmodules(JPivot,2009).Forthisreason,theycanbemoreflexible

FIGURE 1-32 ROLAP architecture

ch01.indd38 4/21/093:23:43PM

Page 39: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 39

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 39

thancommercialsolutionswhenyouhavetocreatethearchitecture(ThomsenandPedersen,2005).AfewcommercialRDBMSsnativelysupportfeaturestypicalformultidimensionalenginestomaximizequeryoptimizationandincreasemeta-datareusability.Forexample,sinceits8iversionwasmadeavailable,Oracle’sRDBMSgivesuserstheopportunitytodefinehierarchiesandmaterializedviews.Moreover,itoffersanavigatorthatcanusemeta-dataandrewritequerieswithoutanyneedforamultidimensionalenginetobeinvolved.

DifferentfromaROLAPsystem,aMOLAPsystemisbasedonanadhoclogicalmodelthatcanbeusedtorepresentmultidimensionaldataandoperationsdirectly.Theunderlyingmultidimensionaldatabasephysicallystoresdataasarraysandtheaccesstoitispositional(GaedeandGünther,1998).Grid-files(Nievergeltetal.,1984;WhangandKrishnamurthy,1991),R*-trees(Beckmannetal.,1990)andUB-trees(Markletal.,2001)areamongthetechniquesusedforthispurpose.

ThegreatestadvantageofMOLAPsystemsincomparisonwithROLAPisthatmultidimensionaloperationscanbeperformedinaneasy,naturalwaywithMOLAPwithoutanyneedforcomplexjoinoperations.Forthisreason,MOLAPsystemperformanceisexcellent.However,MOLAPsystemimplementationshaveverylittleincommon,becausenomultidimensionallogicalmodelstandardhasyetbeenset.Generally,theysimplysharetheusageofoptimizationtechniquesspecificallydesignedforsparsitymanagement.Thelackofacommonstandardisaproblembeingprogressivelysolved.ThismeansthatMOLAPtoolsarebecomingmoreandmoresuccessfulaftertheirlimitedimplementationformanyyears.Thissuccessisalsoprovenbytheinvestmentsinthistechnologybymajorvendors,suchasMicrosoft(AnalysisServices)andOracle(Hyperion).

Theintermediatearchitecturetype,HOLAP,aimsatmixingtheadvantagesofbothbasicsolutions.IttakesadvantageofthestandardizationlevelandtheabilitytomanagelargeamountsofdatafromROLAPimplementations,andthequeryspeedtypicalofMOLAPsystems.HOLAPimpliesthatthelargestamountofdatashouldbestoredinanRDBMStoavoidtheproblemscausedbysparsity,andthatamultidimensionalsystemstoresonlytheinformationusersmostfrequentlyneedtoaccess.Ifthatinformationisnotenoughtosolvequeries,thesystemwilltransparentlyaccessthepartofthedatamanagedbytherelationalsystem.Overthelastfewyears,importantmarketactorssuchasMicroStrategyhaveadoptedHOLAPsolutionstoimprovetheirplatformperformance,joiningothervendorsalreadyusingthissolution,suchasBusinessObjects.

1.9 Additional IssuesTheissuesthatfollowcanplayafundamentalroleintuningupadatawarehousesystem.Thesepointsinvolveverywide-rangingproblemsandarementionedheretogiveyouthemostcomprehensivepicturepossible.

1.9.1 QualityIngeneral,wecansaythatthequalityofaprocessstandsforthewayaprocessmeetsusers’goals.Indatawarehousesystems,qualityisnotonlyusefulforthelevelofdata,butaboveallforthewholeintegratedsystem,becauseofthegoalsandusageofdatawarehouses.Astrictqualitystandardmustbeensuredfromthefirstphasesofthedatawarehouseproject.

ch01.indd39 4/21/093:23:43PM

Page 40: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

40 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s 40 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

Defining,measuring,andmaximizingthequalityofadatawarehousesystemcanbeverycomplexproblems.Forthisreason,wementiononlyafewpropertiescharacterizingdataqualityhere:

• Accuracy Storedvaluesshouldbecompliantwithreal-worldones.

• Freshness Datashouldnotbeold.

• Completeness Thereshouldbenolackofinformation.

• Consistency Datarepresentationshouldbeuniform.

• Availability Usersshouldhaveeasyaccesstodata.

• Traceability Datacaneasilybetraceddatabacktoitssources.

• Clearness Datacanbeeasilyunderstood.

Technically,checkingfordataqualityrequiresappropriatesetsofmetrics(Abellóetal.,2006).Inthefollowingsections,weprovideanexampleofthemetricsforafewofthequalitypropertiesmentioned:

• Accuracyandcompleteness ReferstothepercentageoftuplesnotloadedbyanETLprocessandcategorizedonthebasisofthetypesofproblemarising.Thispropertyshowsthepercentageofmissing,invalid,andnonstandardvaluesofeveryattribute.

• Freshness Definesthetimeelapsedbetweenthedatewhenaneventtakesplaceandthedatewhenuserscanaccessit.

• Consistency Definesthepercentageoftuplesthatmeetbusinessrulesthatcanbesetformeasuresofanindividualcubeormanycubesandthepercentageoftuplesmeetingstructuralconstraintsimposedbythedatamodel(forexample,uniquenessofprimarykeys,referentialintegrity,andcardinalityconstraintcompliance).

Notethatcorporateorganizationplaysafundamentalroleinreachingdataqualitygoals.Thisrolecanbeeffectivelyplayedonlybycreatinganappropriateandaccuratecertificationsystemthatdefinesalimitedgroupofusersinchargeofdata.Forthisreason,designersmustraiseseniormanagers’awarenessofthistopic.Designersmustalsomotivatemanagementtocreateanaccuratecertificationprocedurespecificallydifferentiatedforeveryenterprisearea.Aboardofcorporatemanagerspromotingdataqualitymaytriggeravirtuouscyclethatismorepowerfulandlesscostlythananydatacleansingsolution.Forexample,youcanachieveawesomeresultsifyouconnectacorporatedepartmentbudgettoaspecificdataqualitythresholdtobereached.

Anadditionaltopicconnectedtothequalityofadatawarehouseprojectisrelatedtodocumentation.Todaymostdocumentationisstillnonstandardized.Itisoftenissuedattheendoftheentiredatawarehouseproject.Designersandimplementersconsiderdocumentationawasteoftime,anddatawarehouseprojectcustomersconsideritanextracostitem.Softwareengineeringteachesthatastandardsystemfordocumentsshouldbeissued,managed,andvalidatedincompliancewithprojectdeadlines.Thissystemcanensurethatdifferentdatawarehouseprojectphasesarecorrectlycarriedoutandthatallanalysisandimplementationpointsareproperlyexaminedandunderstood.Inthemediumandlongterm,correctdocumentsincreasethechancesofreusingdatawarehouseprojectsandensureprojectknow-howmaintenance.

ch01.indd40 4/21/093:23:43PM

Page 41: 0071610391_chap01-libre.pdf

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 41

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

C h a p t e r 1 : I n t r o d u c t i o n t o D a t a W a r e h o u s i n g 41

NOTE Jarkeetal.,2000havecloselystudieddataquality.Theirstudiesprovideusefuldiscussionsontheimpactofdataqualityproblemsfromthemethodologicalpointofview.Kelly,1997describesqualitygoalsstrictlyconnectedtotheviewpointofbusinessorganizations.Serranoetal.,2004,2007;Lechtenbörger,2001;andBouzeghoubandKedad,2000focusonqualitystandardsrespectivelyforconceptual,logical,andphysicaldatawarehouseschemata.

1.9.2 SecurityInformationsecurityisgenerallyafundamentalrequirementforasystem,anditshouldbecarefullyconsideredinsoftwareengineeringateveryprojectdevelopmentstagefromrequirementanalysisthroughimplementationtomaintenance.Securityisparticularlyrelevanttodatawarehouseprojects,becausedatawarehousesareusedtomanageinformationcrucialforstrategicdecision-makingprocesses.Furthermore,multidimensionalpropertiesandaggregationcauseadditionalsecurityproblemssimilartothosethatgenerallyariseinstatisticdatabases,becausetheyimplicitlyoffertheopportunitytoinferinformationfromdata.Finally,thehugeamountofinformationexchangethattakesplaceindatawarehousesinthedata-stagingphasecausesspecificproblemsrelatedtonetworksecurity.

Appropriatemanagementandauditingcontrolsystemsareimportantfordatawarehouses.Managementcontrolsystemscanbeimplementedinfront-endtoolsorcanexploitoperatingsystemservices.Asfarasauditingisconcerned,thetechniquesprovidedbyDBMSserversarenotgenerallyappropriateforthisscope.Forthisreason,youmusttakeadvantageofthesystemsimplementedbyOLAPengines.Fromtheviewpointofusersprofile–baseddataaccess,basicrequirementsarerelatedtohidingwholecubes,specificcubeslices,andspecificcubemeasures.Sometimesyoualsohavetohidecubedatabeyondagivendetaillevel.

NOTE Inthescientificliteraturethereareafewworksspecificallydealingwithsecurityindatawarehousesystems(Kirkgözeetal.,1997;PriebeandPernul,2000;RosenthalandSciore,2000;Katicetal.,1998).Inparticular,PriebeandPernulproposeacomparativestudyonsecuritypropertiesofafewcommercialplatforms.Ferrandez-Medinaetal.,2004andSoleretal.,2008discussanapproachthatcouldbemoreinterestingfordesigners.TheyuseaUMLextensiontomodelspecificsecurityrequirementsfordatawarehousesintheconceptualdesignandrequirementanalysisphases,respectively.

1.9.3 EvolutionManymaturedatawarehouseimplementationsarecurrentlyrunninginmidsizeandlargecompanies.Theunstoppableevolutionofapplicationdomainshighlightsdynamicfeaturesofdatawarehousesconnectedtothewayinformationchangesattwodifferentlevelsastimegoesby:

• Datalevel Evenifmeasureddataisnaturallyloggedindatawarehousesthankstotemporaldimensionsmarkingevents,themultidimensionalmodelimplicitlyassumesthathierarchiesarecompletelystatic.Itisclearthatthisassumptionisnotveryrealistic.Forexample,acompanycanaddnewproductcategoriestoitscatalogandremoveothers,oritcanchangethecategorytowhichanexistingproductbelongsinordertomeetnewmarketingstrategies.

ch01.indd41 4/21/093:23:44PM

Page 42: 0071610391_chap01-libre.pdf

CompRef8 / Data Warehouse Design: Modern Principles and Methodologies / Golfarelli & Rizzi / 039-1

42 D a t a W a r e h o u s e D e s i g n : M o d e r n P r i n c i p l e s a n d M e t h o d o l o g i e s

• Schemalevel Adatawarehouseschemacanvarytomeetnewbusinessdomainstandards,newusers’requirements,orchangesindatasources.Newattributesandmeasurescanbecomenecessary.Forexample,youcanaddasubcategorytoaproducthierarchytomakeanalysesricherindetail.Youshouldalsoconsiderthatthesetoffactdimensionscanvaryastimegoesby.

Temporalproblemsareevenmorechallengingindatawarehousesthaninoperationaldatabases,becausequeriesoftencoverlongerperiodsoftime.Forthisreason,datawarehousequeriesfrequentlydealwithdifferentdataand/orschemaversions.Moreover,thispointisparticularlycriticalfordatawarehousesthatrunforalongtime,becauseeveryevolutionnotcompletelycontrolledcausesagrowinggapbetweentherealworldanditsdatabaserepresentation,eventuallymakingthedatawarehousesobsoleteanduseless.

Asfaraschangesindatavaluesareconcerned,differentapproacheshavebeendocumentedinscientificliterature.Somecommercialsystemsalsomakeitpossibletotrackchangesandquerycubesonthebasisofdifferenttemporalscenarios.Seesection8.4formoredetailsondynamichierarchies.Ontheotherhand,managingchangesindataschematahasbeenexploredonlypartiallytodate.Nocommercialtooliscurrentlyavailableonthemarkettosupportapproachestodataschemachangemanagement.

Theapproachestodatawarehouseschemachangemanagementcanbeclassifiedintwocategories:evolution(Quix,1999;Vaismanetal.,2002;Blaschka,2000)andversioning(Ederetal.,2002;Golfarellietal.,2006a).Bothcategoriesmakeitpossibletoalterdataschemata,butonlyversioningcantrackpreviousschemareleases.Afewapproachestoversioningcancreatenotonly“true”versionsgeneratedbychangesinapplicationdomains,butalsoalternativeversionstouseforwhat-ifanalyses(Bebeletal.,2004).

Themainproblemthathasnotbeensolvedinthisfieldisthecreationoftechniquesforversioninganddatamigrationbetweenversionsthatcanflexiblysupportqueriesrelatedtomoreschemaversions.Furthermore,weneedsystemsthatcansemiautomaticallyadjustETLprocedurestochangesinsourceschemata.Inthisdirection,someOLAPtoolsalreadyusetheirmeta-datatosupportanimpactanalysisaimedatidentifyingthefullconsequencesofanychangesinsourceschemata.

ch01.indd42 4/21/093:23:44PM