Download - Data Warehouses and Big Data
BIG Data WarehousesABigDataPerspectiveonDataWarehouses
Seminario peril corso diSistemi diElaborazione diGrandi Quantità diDati
1
BigDataWarehouse
hybriddatawarehousearchitectureBigDataWarehouse(BDW)
2
BigDatascenario
TechnologicalandimplementationGAP
TraditionalData
Warehouse
whatenterprise
has
whatenterprisewantstobecome
Presentationoutline
1. DataWarehouse:thetraditionalbusinessintelligenceapproach- Introductiontodatawarehousing,DFMconceptualmodelandROLAPlogicaldesign
2.ThearrivalofBigData:theneedforscalabilityinDWarchitecture- TypesofdatainBigData,BusinessRelevanceandBigDataarchitecturalrequirements
3.DataVault:anapproachtoenhanceDWstodealwithBigDatachallenges- IntroductiontoDataVault2.0modelandarchitecture
3
Time(out)lineTheroleofITfrompassivetoactive
1970
1985
2000
TransactionalDatabases BusinessIntelligenceandDW BigDataandNoSQL
Goal:reliability,makesurenodataislost
Goal:dataforthemasses,everyonehasaccess
toeverything
Goal:Analyzedataunderdifferentperspectives
tomakedecisions
RDBMSandrelationalDBmodel
DataWarehouseandROLAPstarschema
DataVaultModel:RDBMS+NoSQL
4
Data≠ InformationAnexplicative(businessintelligence)example
Problem:Salesforlollipopshavegonedowninthelast6months.
Data:Salesrecords,customerdata,socialnetworkdata,marketanalysis.Datarecordsaregroupedbytime,region,customerage.
Information:Lollipopsareboughtbyfemalesolderthan25tobeeatenbypeopleyoungerthan10.
Knowledge:Mothersbelievethatlollipopsarebadforchildrenteeth.
Value:Hireadentisttoadvertiselollipops.5
DataWarehouseDefinition
Adatawarehouseisasubject-oriented,integrated,time-variant andnon-volatile collectionofdatainsupportofmanagement'sdecisionmakingprocess.
• Subject-oriented:analysisofsubjectareas.• Integrated:datacomesfrommultiplesources.• Time-variant:historicaldataarecollected.• Non-volatile:nodatamodification/removal.
“Adatawarehouseisacopyoftransactiondataspecificallystructuredforqueryandanalysis”- RalphKimball,majorDWtechnologycontributor
6
DataWarehouseOLTPvsOLAPsystems
On-LineTransactionProcessing(OLTP):dataprocessingsystemfacilitatingmanagementoftransaction-orientedsoftware.Largenumberofshorton-linetransactions(INSERT,UPDATE,DELETE).Dataisdetailed,nothistorical,highlynormalized,joindoesnotperformwell.
On-LineAnalyticalProcessing(OLAP):dataprocessingsystemenablingtheanalysisofmultidimensionaldata,interactivelyandfrommultipleperspectives.DATAWAREHOUSESaredesignedtosupportOLAPoperations.Queriesareoftenverycomplexandinvolveaggregations.Dataishistorical,denormalized,redundant,joiniseasierandfaster.
7
DataWarehouseArchitecturalanduserequirements
DWusegoals:
• Correctnessandcompletenessofintegrateddata.Singleversionofthetruth.• Accessibility touserswithlimitedknowledgeofcomputing.• Dataissummarized/aggregated forflexiblequeryandintuitiveview.RequirementsforaDWarchitecture:
• Scalability:hardwareandsoftwarearchitecturemustbeeasilyscaled.• Extensibility:mustbeabletoaddnewapplications.• Security:accesscontrolisrequiredbecauseofthenatureofthedatastored.Strategicdataarememorized.
8
1.Sources:operationaldatasources,flatfiles.
2.ReconciledDataLevel:Extraction-Transformation-Loading(ETL)toolstoobtainareconcileddatalevelbeforefeedingtheDW.
3.DataWarehouseLevel: centraldatawarehouse,datamartsandmeta-datarepository.
Presentation(front-end):AnalysisandvisualizationOLAPtools,data-miningtools,reportingtools,what-ifanalysistools.
DataWarehouse3-TierArchitecture
9Presentation tools
STAGINGLEVELETL
DATAWAREHOUSELEVEL
Reconcileddata
Datamarts
Meta-data
OPERATIONALDATASOURCES
DataWarehouse3-TierArchitecture
10Presentation tools
STAGINGLEVELETL
DATAWAREHOUSELEVEL
Reconcileddata
Datamarts
Meta-data
OPERATIONALDATASOURCES
1.Sources:operationaldatasources,flatfiles.
2.ReconciledDataLevel:Extraction-Transformation-Loading(ETL)toolstoobtainareconcileddatalevelbeforefeedingtheDW.
3.DataWarehouseLevel: centraldatawarehouse,datamartsandmeta-datarepository.
Presentation(front-end):AnalysisandvisualizationOLAPtools,data-miningtools,reportingtools,what-ifanalysistools.
DataMartsareasubsetoranaggregationofdatastoredin
primaryDW,targetedtowardsaparticularfunctionalareaoruser
group.
DataWarehouse3-TierArchitecture
11Presentation tools
STAGINGLEVELETL
DATAWAREHOUSELEVEL
Reconcileddata
Datamarts
Meta-data
OPERATIONALDATASOURCES
1.Sources:operationaldatasources,flatfiles.
2.ReconciledDataLevel:Extraction-Transformation-Loading(ETL)toolstoobtainareconcileddatalevelbeforefeedingtheDW.
3.DataWarehouseLevel: centraldatawarehouse,datamartsandmeta-datarepository.
Presentation(front-end):AnalysisandvisualizationOLAPtools,data-miningtools,reportingtools,what-ifanalysistools.
DataMartsareasubsetoranaggregationofdatastoredin
primaryDW,targetedtowardsaparticularfunctionalareaoruser
group.
Meta-data is“dataaboutdata”.Businessmeta-datadescribessemantics,businessrulesand
constraints.Technicalmeta-datadescribeshowdataisstoredandhowitshouldbemanipulated.
DataWarehouseArchitecture,anotherview
12
Structureddatasources
DataWarehouseExtraction-Transformation-Loading (ETL)tools
ETLtoolsfeedasingledatarepository,detailed,comprehensiveandofhighquality,whichmayinturnfeedtheDW(Reconciliationprocess:reconcileddatalevel.)• Offline,carriedoutwhenDWisnotinuse(atnight?).• Batchprocessing.• Asubsetofdata,identifiedbybusinessgoalsisobtained:
GIVEASINGLEVERSIONOFTHETRUTH.Extraction:dataaregatheredfromsources.• INTERNALtransactionalsystems,flatfiles.• EXTERNALsources.ODBC,JDBC.
13
DataWarehouseExtraction-Transformation-Loading (ETL)tools
Transformation:dataisputintothewarehouseformat.Businessrulesareusedtodefineeitherpresentation/visualizationofdataandpersistencecharacteristics.• Cleaning:removeserrors,inconsistenciesandconvertsdataintoastandardizedformat.• Integration:dataisreconciled,bothatschemaanddatalevel.• Aggregation:dataissummarizedaccordingtotheDWlevelofdetail.
Loading:theDWisfedwithcleanedandtransformeddata,offline.Initialload(firstDWpopulation)orrefreshment.
14
DataWarehouse:datamodelingMultidimensionalModel:DFM(dimension-factmodel)– DATACUBE
Eachcell ofthecubeisaFACTofinterest quantifiedbynumericalmeasures.
Eachaxis representsadimension ofinterestfortheanalysis.Hierarchyofattributes:• Product• Category(Home)• Sub-category(Bedroom)
15
Date
Product
Customer
Home
Italy
15/01/2015
DataWarehouse:dataanalysisAnalysesonthedatacube
• Reporting:periodicalaccesstostructuredinformation.
• OLAP:analysisofoneormorefactsofinterestatdifferentlevelsofdetailbysequenceofqueriesthatgiveamultidimensionalresult.
• Datamining:extractingpatterns fromlargedatasetsbycombiningmethodsfromstatisticsandartificialintelligencewithdatabasemanagement.
16
DataWarehouse:logicalmodelSTARSchema:relationalOLAPdatamodel
MostDatacubesarebuiltontherelationalmodel.Astarschemaiscomposedby:• AcentralrelationFT,FactTable,representingthefactofinterest.• Asetofrelationscalleddimensiontables,eachofthemcorrespondingtoonedimensionoftheanalysis(cubeaxis).Everydimensiontableischaracterizedby• aprimarykey• asetofattributesthatdescribethedimensionsofanalysisatdifferentlevelsof
aggregation.
FT alsocontainsanattributeforeachmeasure.
17
DataWarehouse:logicalmodelSTARSchema:example
18
FactTable
DimensionTables
Measures
Hierarchyofattributes
Date_PK
Customer_PKProduct_PK
Quantity_orderedUnit_priceTotal_price
Date_PKDay
MonthYear
Product_PK
DescriptionCategory
Sub-category
NameCity
Country
Customer_PK
order_fact
product_dim
customer_dim
date_dim
DataWarehouseThearrivalofBigData:canatraditionalDWHhandleallthisdata?
19
HowcanBigDataTechnologies,suchasHadoop,Hive,HDFSbeusedinaDW/BIcontext?
BigDataWarehousesystemsThearrivalofBigData:canatraditionalDWhandleallthisdata?
InBusinessIntelligenceBigDatatechnologiescanbeused:1. Standalone:withtheirownquery/DBtools,querylanguages.• Analyticalgoals:Businessrequirementsarenotpreviouslydefined.
2. ComplementaryandsupportingtoenhanceexistingDWtechnologies:hybridsystemscalledBIGDATAWAREHOUSES.• Synergy amongBigDatatechnologiesandexistingDW:widerrangeofdata(Factsareintegratedwithunstructuredandmultidimensionaldata).• ImprovescalabilityandreducecostsofcurrentDWsystems.
20
BigDataWarehousesystemsHybridapproaches
21
SmallandBigDataSources
Hadoop RDBMS BITools
BigDataSources
SmallDataSources
Hadoop
BIToolsRDBMS
SmallDataSources
BigDataSources
RDBMS
HadoopBITools
(A)
(B)
(C)
Hadoop isusedonly fordataingestion/staging
BigDataarekeptseparatefromstructureddata:Hadoop isusedasdatamanagementplatforminparalleltotheRDBMS.Bothplatformsareusedinconjunction forpresentationpurposes.
HadoopenhancesRDBMSasdataingestion/stagingtool,butalsoasdatamanagementanddatapresentationplatform.The“best”ofbothtechnologies isexploited.
BigDataWarehousesystemsSomequestions
22
BeforeaddressingBigDataWarehousesweshouldanswersomequestions..
• BIonBigData?
• Whatkind ofdataarefoundinBigData?
• CanatraditionalETLtechnologyhandleBigData?
• Iffeasible,doesETLonBigDatamakesense?
CorporateDataTheroleofBigDatainthecorporation
Corporatedataisthetotalityofdatafoundinacorporation.
Examplesofcorporatedataare:analoginformation,telephonerecords,e-mails,marketresearchdata,callcenterrecords,payments,sales,transactions,measurements,interviews,socialnetworks..
OnewayofclassifyingthetotalityofcorporatedataisdinstinguishingbetweenSTRUCTURED andUNSTRUCTUREDDATA.
23
CorporateDataStructureddata
Structureddatahasapredictableandregularlyoccurringformat.Typically:
• itismanagedbyaDatabaseManagementSystem(DBMS).• consistsofrecordsorfiles,attributes,keys andindexes.• afixednumberoffieldsisdefined.ExamplesofstructureddataarethosecontainedinarelationalDB:adatamodelisclearlydefinedfordatarepresentation,storing,processing,accessingandquerying.(ACIDcompliant)
Traditionaldatawarehousesmanagestructureddata!24
definedlength definedformat
CorporateDataUnstructuredandsemi-structureddata:BIGDATA
Unstructured dataisunpredictable, andusuallydoesnothaveaneasilycomputer-recognizableformat.Longstringshavetobesearched(parsed)in ordertofindaunitofdata!
Examples:free-text,images,videos,webpages,webserverlogs,…
Semi-structured datahastags/markers thathelpindiscerningdifferentdataelements,butitlacksofastrictdatamodel.Examplesofsemi-structureddataare:RSSfeeds,metadata.Formats:XML,JSON,...
25
CorporateDataRepetitivenessinBigDataUnstructureddatacanbedividedinto:
- REPETITIVE:itoccursmanytimes,ofteninthesameembodiment.Typically,thiskindofrecordscomesfrommachineinteractions.Processingandanalysis:Hadoopcentricdata.Examples:analogprocessing,telephonecallrecords.
- NON-REPETITIVE unstructureddata:recordsaresubstantiallydifferentfromeachotherinformandcontent.
Processingandanalysis:NLP,Textualdisambiguation:dataisputintocontextandreformattedforstandardBIanalysis.
Examples:e-mails,healthcarerecords,marketresearch,meteorologicalrecords.
26
CorporateDataRepetitivenessmeasuresBusinessRelevance
Businessrelevancemeasuresthecapabilityofdatatoprovideinformationthatisofinterestforaspecificbusinesscontext.Businessrelevantinformationisusedtosupportdecisionmaking,solutiongenerationandcostoptimization.
REPETITIVEBIGDATAarehardlyeverbusinessrelevant:Millionsofphonecallrecords,onlyafewofwhicharerelevantforgovernmentalpurposes.
27
CorporateDataAcompletepictureofcorporatedata
CORPORATEDATA
StructuredData UnstructuredandSemi-structuredData
Repetitive Non-Repetitive
Busin
essRE
LEVA
NT
Busin
ess
IRRE
LEVA
NT
POTENTIALLY
busin
essrelevant
Busin
essRE
LEVA
NT
Busin
ess
IRRE
LEVA
NT
So… isallthisdatausefulforsupportingdecisionmaking?BIGDATA:apluralversionofthetruth
28
BigDatarequireforDWimprovementsTheneedforanecosystemtointegrateHadoopandNoSQLtechnologies
BigDatarequireadifferentapproach todatawarehousing:• Volume:Memorizationandprocessingmustbeparallelized.
• Hugeworkload,concurrentusersanddatavolumesrequireoptimizationofbothlogicalandphysicaldesign.
• ETLphaseisabottleneckand“nonsense”forBigData:Bigdatagoalistogatherdatatobeusedinwaysthathavenotbeenplanned.• Discover/extractnewinsightsindata:Exploratoryapproach• Processingisondata:lineageandmeta-dataarerequired.• TraditionalETLdoesnotworkwellonunstructureddata.Manualcodingfordataintegration.
• Rawdatapersistinthewarehouse:lineagethroughsoftbusinessrulescanbepostponedaccordingtoanalysisneeds.BigDatacallforELT.
29
• Datacomplexityincreases:• Variety ofdatarequiresspecificprocessingtechniques
• Textualdisambiguation,parsing,machine-generateddataanalysis.• Velocity ofdatarequiresalmostreal-timeanalysiscapabilities:
• Real-timedatashouldfeeddirectlytotheDW:On-LineTransactionalProcessingcaninpartbecarriedoutinthewarehouse.Real-timedatacannotundergoETL!
• Veracityofdatarequiresstrongintegrationandtraceability.
• Analyticalcomplexity:• BigdatahavetobeinaformatnotforeseenbyDWdeveloperstobeanalyzed.
30
BigDatarequireforDWimprovements(II)TheneedforanecosystemtointegrateHadoopandNoSQLtechnologies
• Querycomplexity: temporalanalysisandOLAPanalysisoncubesarenotfeasibleonBigData. OLAPisoptimizedforrelationalmodels.
• DWAvailability: additionofnewdatasourcesmightcompromisetheavailabilityoftheoverallsystem.Ithastobecarriedoutoffline.• Parallelizationofloadingisonesolution,butitmustbeembeddedinthesystem.
VerticalScaling:movetolargercomputers+HorizontalScaling:ü Functionalscaling=organizesimilardatagroupsandspreadthemacrossDBs.ü Sharding =splitdatawithintheareasoffunctionalityacrossmultipleDBs.
31
BigDatarequireforDWimprovements(III)TheneedforanecosystemtointegrateHadoopandNoSQLtechnologies
DataVault2.0(DV2)CommonFoundationalWarehouseArchitecture
“TheDataVaultModelisadetailoriented,historicaltrackinganduniquelylinkedsetofnormalizedtablesthatsupportoneormorefunctionalareasofbusiness.Itisahybridapproachencompassing
thetraditionalstarschema.Thedesignisflexible,scalable,consistentandadaptable totheneedsoftheenterprise”
Goal:provideandpresentinformation,extractedfromdatathathasbeenaggregated,summarized,consolidatedandputintocontext.
32
DataVault2.0(DV2)Aspects
1.Datamodel:changestothemodelforperformanceandscalability.Rawdata(structured+ BigData)areintegratedbybusinesskeys.
2.Methodology:ScrumandAgilebestpractices:two-tothree-weeksprintcycleswithadaptationsandoptimizations.
3.Architecture: inclusionofNoSQLandBigDatasystemsforunstructureddatahandlingandBigDataintegration.Separationofbusinessrules.
4.Implementation:GuidelinesdefinehowtoimplementDV2parts.33
DataVault2.0(DV2)Architecture
Basedonthe3-tierDWarchitecture:(1)staginglayer,(2)enterprisedatawarehouseEDWlayerandthe(3)informationdeliverylayer.
Additionalcomponents:
1.HadooporNoSQLhandleBigData(designrulesonwhereandhow)
2.Real-timeinformationflowsin/outoftheEDW.Operationalvault.
3.Hardandsoftbusinessrulesaresplit.• Datainterpretationispostponed:Bigdataprinciple:datafirst- schemalater!
Staginglayerislosingimportanceasrawdata persistintheEDW!34
DataVault2.0(DV2)Architecturepreview
35
OperationalDatasourcesSOA/ESB
EnterpriseDataWarehouse InformationDelivery
DataMarts
ReportMart
OLAP tools/ Starschema
real-time
Staging
HARDBUSINESSRULES
SOFTBUSINESSRULES
batch
Hadoop
DV2ArchitectureWheredoNoSQLplatformsfitinDV2?
MostcommonNoSQLplatformsarebasedonHadoopandHDFS.- Staging:Hadoopismostlyusedfordataingestion andstagingforANYDATA(structuredandunstructured)thatcanproceedintheEDW.
- EDW:NoSQLDBsareusedtostoreunstructureddata.
- Informationdelivery:Hadoopisusedtoperformdatamining.Miningresultsarestructureddatasetsthatcanbecopiedintorelationaldatabaseenginesforadhocquerying.
TheDV2modelallowsforNoSQLtechnologiestofeedall 3levels!
36
DV2ArchitectureHowdoesHadooptechnologyenhanceDWcapabilities?
• Cheaphardwareformemorizationofallkindsofdata.• Local storage(preferredtoStorageAreaNetworks).• Allowsprocessingdirectlyondataandbasedonthekindofdata:
• SomeBigDatamighthaveacomplexstructure(weblogs,complexsensors).• RawdatapersistinHadoop:TransformationcanberedonewithouttheneedforExtraction.Historyiseasilymaintained.
• Rawdatacanbere-used toaddcontextorconstraints.• DataminingmodelsextractedwithHadoopcanbeusedasreliablesemanticmeta-data.
37
DV2ArchitectureBusinessLogic:SoftandHardbusinessrulesseparation
Businessrulesarerequirements translatedintocode.IntraditionalDWbusinessrulesareappliedbeforetheloadingphase.
DV2IDEA: separatedatainterpretation(doneafter loadingdataintoEDW)fromdatastorageandalignmentrules.InsideEDWrawdataispreserved!
Hardrules:donot changethecontentofindividualfields.Examples:typealignment,splitbyrecordstructure,denormalization.
Softrules:changeandinterpretdata.Examples:standardizingnameaddresses,coalescing,concatenating namefields.
IntraditionalDWALLbusinesslogicisappliedtroughETLtools!
38
DV2ModelBusinessKeys
ABusinessKeyidentifiesakeyconceptinbusiness.Theyhaveabusinessmeaning!
TheyareuniqueandhaveverylowpropensitytochangeBusinesskeyschangeonlywhenthebusinesschange!
Examplesofbusinesskeysare:customernumbers,barcodes,ISBNcodes,ISSNcodes,E-mailaddresses,creditcardnumbers..Smartkeysarecomposedofdifferentpartswhicharegivenbusinessmeaningthroughpositionandformat.BusinesskeysandassociationsaretheskeletonoftheDataVaultmodel,whichisfunctional-oriented,notsubject-oriented.
39
DV2ModelCharacteristicsandComponents
40
DV2Modelbasicentities:• Hubs:mainbusinessconcepts,representedbybusinesskeys.
• Links:relationshipsbetweenhubs,thusbetweenbusinesskeys.
• Satellites:contextofhubsandlinks(attributesandtime).Realdatawarehousingcomponents:nonvolatiledataarestoredovertime.
DV2ModelHubs
Eachhubrepresentsabusinesskey,whichisvaluablefortheoverallsystemandmightbedifferentfromthesinglekeysfoundintheoperationalsources.
P1:Businesskeysareseparatedbygrainandsemanticmeaning.
BusinessKey:oneoremorekeys,identifyingtheobject.
HashKey:generatedsurrogatekeytoeaselookup.
Metadata:LoadDateindicateswhenthebusinesskeyfirstarrivedintheEDW,RecordSource keepstrackofthesource.
41
DV2ModelLinks
Linksmodeltransactions,associations,hierarchiesandredefinitionsofbusinessterms.Linkscapturepast,presentandfuturerelationsamonghubs.
P2:intersectionsacrosstwoormorebusinesskeysareplacedintolinkstructures.P3: linkshavenobeginorenddates.Theyarethe expressionoftherelationshipatthetimethedataarrivedinthe EDW
42
DV2ModelLinks:structure
HashKey:generatedsurrogatekeytoeaselookup.
HashKeysofthehubsconnectedbythelink.
Metadata
43
DV2ModelLinks
Linksaremany-to-manyrelationships amongtwoormorehubs.Theyabsorbdatachanges.Flexibility:changeinbusinessrulesdoesnotrequirelinkreengineering.Example:
Businessrule:“onecarrierhandlemoreairports,butoneairportmustbehandledbutoneandonlyonecarrier”à weakentitymodelLet’ssay,afewyearslater,anyairportcanbehandledbymorethanonecarrier..Thiswouldrequiretheredesigntheexistingstructures!
Thegranularityoflinksisdefinedbythenumberofconnectedhubs.
44
DV2ModelSatellites
Satellitesstorealldatathatdescribesabusinessobject,relationshiportransaction.TheyaddCONTEXTatagiventimeoveragivenhub/link.P4:Satellitesareseparatedbytypeofdataandclassificationandrateofchange.Eachsatelliteisattachedtoonlyonehuborlink.Asatelliteisidentifiedbytheparent’shashkeyandthetimestampofthechange.(RemindtraditionalDWhistoricaldata!)Inaddition,attributesthatdescribethecontextofthebusinessobjectarecontained.Satellitestrackchange!
45
DV2ModelSatellites:structure
Parentobjecthashkey
LoadDateattributeHashKey oftheparenthub/link
Timestamp ofthesatellite
Timestamp thatdeterminestheendoftheSAT’svalidity.
RecordSourcekeepstrackofthesource.Hashdifference:hashvalueofallthedescriptivedatainasatellite.
Nameandattributes.
46
DV2ModelHeterogeneoussatellites
47
ExampleofalogicalforeignkeybetweenRDBMSandHadoop-storedsatellite.
Hashkeysallowcross-systemjoinstooccurbetweenRDBMSandNoSQL/Hadoopplatforms.
DV2ModelModelexample:Customer/Product
48
Customer_HK
Customer_HKLoadDate
CustProdLink_HKLoadDate
LoadEndDateRecordSourceHashDiffQuantity_orderedUnit_priceTotal_price
CustomerProductLink
ProductSatellite
Product_HKLoadDate
Product_HK
ProductHub
CustomerSatellite
Customer Hub
Product_BKLoadDateRecordSource
Customer_BKLoadDateRecordSource
LoadEndDateRecordSourceHashDiffDescriptionCategorySub-category
LoadEndDateRecordSourceHashDiffNameCityCountry
LoadDateRecordSourceCustomer_HKProduct_HK
CustProdLink_HK
CustomerProductSatellite
DV2ModelModelexample:Customer/Product
49
Customer_HK
Customer_HKLoadDate
CustomerProductLink
CustomerProductSatellite
ProductSatellite
Product_HKLoadDate
Product_HK
ProductHub
CustomerSatellite
Customer Hub
Product_BKLoadDateRecordSource
Customer_BKLoadDateRecordSource
LoadEndDateRecordSourceHashDiffDescriptionCategorySub-category
LoadEndDateRecordSourceHashDiffNameCityCountry
LoadDateRecordSourceCustomer_HKProduct_HK
CustProdLink_HK
Outofproduction!
CustProdLink_HKLoadDate
LoadEndDateRecordSourceHashDiffQuantity_orderedUnit_priceTotal_price
DataVault2.0(DV2)Modelingobjectives
• Dataintegrationisbasedonbusinesskeys.• Businesskeysarethekeystotheinformationstoredacrossmultiplesystemsusedtolocateanduniquely identifyrecordsordata.
• Datasetsaretraceableacrossmultiplelinesofbusiness.• Modelingcansupportunstructuredandstructureddata:• Hashkeys allowtheconnectionbetweenheterogeneousdataenvironments,suchasHadoopandRDBMSandremovethedependencyon“loading”.
Parallelizationofloads:removesdependenciesinloadingstreams.Example:loadingdataintoHadoop,perhapsaJSONdocument, requireslookingupthesequencenumberfromahubinarelationaldatabase.
50
ConclusionsWhentouseaDV2model
• TheDV2Modelallowssplit/mergeofbusinesskeysanddataentities:• Parallel/distributedsystem,geographicalreasons,securityreasons.• Itisdesignedtoberesilienttoenvironmentalchanges.
• SeamlesslyintegratesBigDatatechnologieswithexistingRDBMStechnologies:• Hadoop,MongoDB andmanyotherNoSQLoptionsareeasilyadded.• Datacleaningrequiredbyastar-schemabecomesunnecessary:alldataisrelevant.
HadoopandRDBMSaresidebysideinBigDataWarehouses.
51
BigDataWarehousesvstraditionalDWAsummarizingviewoverthetwoapproaches
52
DesignPrinciple Traditional DataWarehouse BigDataWarehouse
BusinessExpectations
•Factbased;•Pre-designedforspecificreportingrequirements;•singlesourceofthebusiness truth;
•Exploratoryanalysis;•Findingofnewinsights•Veracityofresultsmightbequestionable
DesignMethodology
•Iterative andwaterfall•Integratedandconsistentmodel
•Agileanditerativeapproach•Nodatamodeldefinition
DataArchitecture •Notalldataismanagedandmaintained intheEDW:thedatasourcesarepreviouslyknown;•Anythingnewhastogothrougharigorousrequirementsgatheringandvalidationprocess;•Scalesbutatapotentiallyhighercostperbyte;
•Integratesallpossibledatastructures;•Scalesatrelativelylowcost;•Analyzesmassivevolumesofdatawithoutresortingtosamplingmechanisms.
DataIntegrityandStandards
•DrivenbyRDBMSand ETLtools.•Centralizeddata
•Integrationislooselydefined;•Dataanddataprocessingprogramsarehighlydistributed.
Thankyou
Thankyouforyourattention!
Q&A?
53
QuickReferences
Books:• DATAARCHITECTURE:APRIMERFORTHEDATASCIENTISTBigData,DataWarehouseandDataVault-W.H.Inmon,DanielLinstedt
• BigDataImperativesEnterpriseBigDataWarehouse,BIImplementationsandAnalytics– S.Mohanty,M.Jagadeesh andH.Srivatsa
• BuildingaScalableDataWarehousewithDataVault2.0- DanLinstedt,MichaelOlschimke• AdvancedDataWarehouseDesign- FromConventionaltoSpatialandTemporalApplications- E.MalinowskiandE.Zimányi
Othersources:- HadoopandtheDataWarehouse:WhentoUseWhich- Dr.AmrAwadallah,FounderandCTO,Cloudera,DanGraham,GeneralManager,EnterpriseSystems,TeradataCorporation
- BigDatainBigCompanies- ThomasH.Davenport,JillDycheDataVaultSupport:QUIPU- http://www.datawarehousemanagement.org
54