open data warehouse building a data warehouse with pentaho eng... · 2019. 2. 26. · itnoum best...
TRANSCRIPT
Open Data WarehouseBuilding a Data Warehouse with Pentaho
Best Practice
it-novum.com
1. Objectivesandbenefits 3
2. Management Summary 4
3. Buildingadatawarehousesystem 5
4. Datasources 6
5. Datacapture 7
5.1 DesigningtheETLprocess 8
5.1.1 BestpracticeapproachtoworkingwiththeLookupandJoinoperations 8
5.2 Practicalimplementation 11
5.3 Datastorage&management 17
5.3.1 Performanceoptimization 18
5.4 Dataanalysis 20
5.5 Datapresentation 21
6.Conclusionandoutlook 22
7. Appendix 23
Contents
2
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
1. Objectives and benefitsThisbestpracticeguideshowshowapowerful,high-performancedatawarehousesystemcan
bebuiltwithopensourceproducts.Thisbestpracticeguideisofinteresttoyouifyou
и arethinkingaboutintroducingadatawarehousesystemintoacompanyandarelookingfor
anappropriatesolution
и aremoreinterestedinsoftwareadaptationthaninvestingyourmoneyinsoftwarelicensing
и wantanopenandcustomizablesolutionthatcanquicklybeputinplaceandwillgrowwith
yourneeds
и alreadyhaveopensourcesolutionsinmind,butareunsurehowtoapproachsuchaproject
и havealimitedprojectbudget
Youwillbenefitfromreadingthisbestpracticeguide,becauseit
и clearlydemonstratesanddescribesanexamplearchitectureforanopensource-baseddata
warehouse
и givesyoutipsfordesigningandimplementingyoursystem
и providesaschemathatyoucanmodifytosuityourownindividualneeds
и containsrecommendationsforprovensoftwareproducts
3
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
2. Management SummaryCompaniesandtheenvironmentsinwhichtheyexistareproducingeverfasterandevergreater
amountsofdata.Theresultingdatacontainsanenormouseconomicpotential–butonlywhen
ithasbeenproperlyprocessedandanalyzed.Adatawarehousesolutionprovidestheidealbasis
forthisprocessingandanalysis.Settingupanddevelopingadatawarehouse,however,often
entailscostlylicensingandhardwareissues.Thisdetersmainlysmallandmedium-sizedcom-
paniesfromundertakingsuchaproject,eventhoughcost-effectiveopensourcesolutionshave
longbeenavailableinthebusinessintelligencesector.Arethesesolutionssuitable,however,
fordevelopingahigh-performancedatawarehousesystemoristherereallynoalternativeto
commercialsoftwareproductswhenitcomestoperformanceandfunctionality?
Thisbestpracticeguideaddresseswhetherandhowapowerfuldatawarehousecanbebuilt
usingopensourcesystems.
Forthispurpose,wewillbedesigningandbuildingaprototypicalexamplearchitectureusing
thePentahoandInfobrightsolutions.Wehaveimplementedthisarchitecture–withitsessential
features–inanumberofprojects,manyofwhichhavebeeninsuccessfuluseforyears.
Foreaseofunderstanding,thisbestpracticeguidehasbeenorganizedinthefollowingmanner:
и Generalconstructionofadatawarehouse
и Connectingdatasources
и DesigningtheETLprocess
и Datastorage&management
и Dataanalysis
и Datapresentation
и Conclusions
Intheprocess,allpertinentpointsassociatedwiththedesignandimplementationofadata
warehousewillbecovered.
4
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
3. Building a data warehouse systemAdatawarehousesystemconsistsoffivelevels:Datasources,dataacquisition,datamanage-
ment,dataanalysisanddatapresentation.Theindividuallayersarerealizedbymeansofvari-
ousconceptsandtoolsandtheirrespectiveimplementationsarebasedontherequirementsof
thegivencompany.
Thedatawarehousesystemusedforourprototypeisstructuredasshowninthefigurebelow.
Allcomponentsareopensource.TheMicrosoftdatabaseAdventureWorkswillserveasthedata
source.TheETLprocessatthedataacquisitionlevelwillberealizedwithPentahoDataIntegra-
tion.ThistoolispartoftheopensourcesolutionPentahoBusinessAnalyticsSuite,which,inour
example,willbeusedfordataanalysisandpresentation.
Theactualdatawarehouseusedwithinthedatamanagementsystemwillbetheanalytical
databasemanagementsystem,Infobright.PentahoMondrianwillbeusedastheOLAPserver.
TherequiredXMLschemawillbegeneratedwithPentahoSchemaWorkbench.Oncethese
componentshavebeenimplemented,thedataatthedatapresentationlevelcanbepreparedin
theformofanalyses,dashboards,andreportsusingavarietyofPentahotools.Inthefollowing
chapters,theindividuallayersandtheirassociatedtoolswillbedescribedindetail.
Structureofthesamplearchitecture
5
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
4. Data sourcesInourprototypesystem,wewillbeusingAdventureWorks2008R2(AWR2)asthebasisforour
sampledata.ThisOLTPdatabasefromMicrosoftcanbedownloadedasafreebackupfilehere:
http://msftdbprodsamples.codeplex.com/Thedatarepresentsafictitious,multinationalcompa-
nyintheproductionsectorthatspecializesinthemanufactureandsaleofbicyclesandrelated
accessories.Thedatamodelconsistsofmorethan70tablesdividedintofivecategories:Human
Resources,Person,Production,Purchasing,andSales.
Thedatamodelissimilartothedatabasestructuresofrealcompaniesintermsofscenariosand
complexityofdatabasestructureandis,therefore,wellsuitedfordemonstrationandtesting
purposes.Thebackupfile(.bak)providedcanbeeasilyintegratedintoMicrosoftSQLServer.To
dothis,openSQLServerManagementStudioand,clickingonthedatabasesymbol,openthe
restoredatabaseinterface(seeillustration).ThenewdatabasewillthenappearintheObject
Explorer.
MicrosoftSQLServerisnot,ofcourse,opensourcesoftware.Wehave,however,employeditfor
thisexamplesothatwecanusethetestdatasuppliedbyMicrosoft.Alternatively,thebackup
canbeconvertedintoCSVfiles.Thesefilescanthenbeloadedintoanyopensourcedatabase
viatheBulkLoadoption.
Importthebackupfile
6
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
5. Data captureData acquisition is concerned with the extraction, transformation and loading (ETL) of operational data into the
data warehouse. The goal is to achieve a uniform, high-quality base of data in the data warehouse that can be ana-
lyzed as needed. Planning the data model is also one of the tasks associated with data capture.
The data model then forms the basis for the later transformation of the data. Certain fundamental decisions have to
be made concerning the business process to be analyzed, the level of detail for the analysis as well as any possible
dimensions and values.
Wewillbeevaluatingsalesdataaspartofthepresentprototype.Nochangeswillbemadeinthe
levelofdetailforthedata,soallhierarchylevelswillberetained.Overall,fivedimensionsareto
becreatedwithrespecttotheOLTPdatabase.Thesearedate,customer,product,salesterri-
tory,andcurrency.Asforthemetrics,itshouldbepossibletoanalyzequantity,shippingcosts,
discountsgranted,taxes,unitcostsandrevenue.Thedatawillbepresentedintheformofastar
schema,becausethisisthebestmodelingvariantforachievingourgoals(seefigure).Itisalso
possible,however,touseasnowflakeorflatschema.
StarschemabasedontheAWR2database
7
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
DatabaseLookup
5.1 Designing the ETL processFortheimplementationoftheETLprocess,wewillusetheopensourcetoolPentahoData
Integration(PDI).Thistoolprovidespredefinedstepsforextracting,transforming,andloading
data.Thesecanbecreatedviaagraphicaldrag-and-dropinterfacecalledSpoon.UsingSpoon,
professionalETLroutinesintheformoftransformationsandparentjobscanbebuiltwithvery
littleeffort.Thus,theamountofmanualprogrammingrequiredisverysmall,makingthistool
idealforthenoviceuser.
5.1.1 Best practice approach to working with the Lookup and Join operations
ThedimensionsforthestarschemarequireinformationfromdifferenttablesintheOLTPda-
tabase.Tobeabletocompareandcombinethedata,PentahoDataIntegrationoffersLookup
andJoinoperations.Eachstepcanbechangedoraltered,soitisimportanttheyarehandled
correctly.Applyingtheminthewrongcontextcanleadtoperformanceproblems.Since,asa
rule,manytablesmustoftenbecombinedwithintheETLprocess,choosingtherightstepswill
significantlyincreaseprocessingspeed.Wewill,therefore,explainthemostimportantsteps
below.
Database LookupDatabaseLookupenablesthecomparisonofdatafromdifferenttablesinadatabase,whichal-
lowsfortheintegrationofnewattributesintotheexistingdatabasetable.DatabaseLookupwas
designedforsmalltomedium-sizeddatasets.Particularattentionneedstobepaidtohowthe
cacheisconfiguredhere,becauseusinganactivecachecansignificantlyshortentheprocessing
time.Afurtherincreaseinperformancecanalsobeachievedbyselectingthecheckbox„Load
AllDataFromTable“.Particularlywithlargerdatasets,cachingcan,however,overloadtheJava
stackspacecausingthetransformationtocollapse.
8
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
Stream LookupStreamLookupissimilartoDatabaseLookupwithactivecaching,butthedatabeingcompared
canbeextracteddirectlyfromthestream.Duetocontinuouscaching,StreamLookupisvery
resourceintensiveandshouldonlybeusedwithsmallerdatasetsthatcannotbeloadeddirectly
fromadatabase.Byselectingthecheckbox„PreserveMemory(CostsCPU)“,itispossibleto
reducethememoryfootprint,butattheexpenseoftheCPU.Whenthisfunctionisactivated,
theloadeddataisencodedduringsorting(hashing).Adisadvantageofthis,however,islonger
running times.
Merge JoinMergeJoincarriesoutatraditionaljoinoperationwithinthedataintegrationcomponent.A
distinctionhastobemadeherebetweenInner,FullOuter,LeftOuterandRightOuterJoins.
MergeJoinisparticularlysuitableforlargedatasetswithhighcardinality,thatis,manydifferent
attributevalueswithinacolumn.Fromthebeginning,thedatamustbesortedaccordingtothe
attributesbeingcompared.Wheneverpossible,thisshouldbeperformeddirectlyonthedataba-
sevia„OrderBy“.Alternatively,thiscanalsobedoneviaSortRows,however,thiswillimpacton
performance.
StreamLookup
StreamLookup
9
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
Database JoinUsingDatabaseJoinitisalsopossibletolinkdatabasetablesandcreateandexecutenative
queries.Theparametersusedintheprevioustransformationcanbeintegratedintothenative
query.Aquestionmarkisusedasaplaceholderandisreplacedbythecorrespondingparameter
whenthequeryisrun.Whenintegratingmultipleparameters,itiscrucialthatthesebeplacedin
theircorrectorderinthe„Theparameterstouse“area.
Dimension Lookup / UpdateTheDimensionLookup/UpdatestepcombinesthefunctionalityofInsert/Updatewiththat
ofDatabaseLookup.Itispossibletoswitchbetweenbothfunctionsbyselectingordeselecting
thecheckbox„UpdateTheDimension“.Inaddition,thisstepallowsfortheefficientintegration
oftheSlowlyCachingdimension.TheLookup/Updatedimensionwasdevelopedforspecial
applicationscenariosandis,therefore,veryprocessorintensive.Wementionithereonlyinthe
interestsofcompleteness.
DatabaseJoin
DimensionLookup/Update
10
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
Performance TestTohighlightthedifferencesinperformance,wehaveexaminedthevariousstepsinconnection
withtwodatasetsthatcorrespondtotheexpectedamountsofdatainourdemo.
Asdemonstratedinthetable,DatabaseJoinhighlightsmajorperformanceproblemswhen
increasingdatavolumesareinvolved.Fortheremainingsteps,however,thereishardlyany
increaseinprocessingtimedespitethesignificantlylargeramountsofdata.
Operation 40.000 Data Sets 152.500 Data Sets
DatabaseLookup 0,9sec. 1,5sec.
StreamLookup 1,6sec. 2,3sec.
MergeJoin 1,3sec. 2,6sec.
DatabaseJoin 8,2sec. 3min.
5.2 Practical implementationTheETLprocessesconformtothespecificationslaiddownforamulti-dimensionaldatamodel.
Atransformationiscreatedperdimensionorfacttable.Thisprovidesabetteroverviewand
errorscanbemoreeasilylocated.Inaddition,theriskofoverloadingtheserverisalsoreduced.
Thevarioustransformationsarecombinedandcontrolledviaacentraljob:
TransformationsofthePerformanceTest
PerformanceTestResults
11
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
Beforetheactualtransformationprocessesforpreparingandpopulatingthedatawarehouse
canbegin,thedatabaseconnectionstothedatasourceandthedatawarehousemustfirstbe
defined.NewdatabaseconnectionscanbecreatedinPentahoDataIntegrationunderthetab
View>DatabaseConnections.Theselectionofthecorrectdatabaseanddriveraswellasuser-
specificsettings,suchasdatabasenameoruser,isdecisive.
ItisimportanttoensurethattheJDBCdriversforbothdatabasesareinthelibdirectoryofthe
DataIntegrationTool.Bydefault,MySQLcanbeaccessedviaport1433andtheInfobrightdata-
baseviaport5029.
ParentJob
Databaseconnectiondetails
12
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
DimCurrencyThedimensionDimCurrencyisgeneratedbasedonthetableSales.Currency.Onlythenameand
theabbreviationfortherespectivecurrencyareapplied.Whenthishasbeendone,anidentifica-
tionnumber(CurrencyID)isgeneratedfortheindividualtuples,whichactsastheprimarykey.
Then,usingthestepTableDimCurrency,thetableiscreatedandpopulated.
DimCustomerThesecondtransformation,DimCustomer.ktralsodoesnotrequireajoinoperation.Thedi-
mensionisbasedonthetableSales.vIndividualCustomer.Thisonlyrequiresslightchangeswith
regardtothedimensiontableDimCustomer,whichneedstobecreated.Thefirstandlastname
willbecombinedinacommondatafieldforlateranalysis.Theprimarykeynameismodifiedto
suitthedesignationwithinthemulti-dimensionaldatamodel.
DimCurrency.ktr
DimCustomer.ktr
13
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
DimPersonThetransformationDimPerson.ktrisbasedonthetablePerson.Person.Thisisexpandedto
includePerson.PersonPhoneandPerson.EmailAdressusingDatabaseLookup.Inbothcases,
BusinessEntityIDisusedforcomparisonasitservesastheprimarykeyforallthreetables.
DimProductThedimensionDimProductisgeneratedfromthreetablesoftheOLTPdatabase.Thetransfor-
mationstartswiththetableProduction.Product.Asafirststep,wewillincludetheattributes
NameandProductCategoryIDfromthetableProduction.ProductSubcategory.Thesecondof
theseattributeswillserveasthekeyforthesecondLookupoperationperformedonthetable
Production.ProductCategory.Thenextstep,SelectProduct,isusedtocleanupandrename
datafieldsbeforepopulatingthetablewithdata.Whilethisisbeingdone,anSQLscriptisrun
inparalleltocreateanadditionalproductdatasetnamedFreight.Thiswillenableustofilteron
freightcostsforindividualshipmentsduringourlateranalyses.
DimPerson.ktr
DimProduct.ktr
14
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
DimSalesTerritoryThe transformation DimSalesTerritory.ktr essentially corresponds to the previous proces-ses. It is based on the table StateProvince. During the Lookup operation, the attributes Name and Group from the tables CountryRegion and SalesTerritory are added to this table. After this has been successfully completed, the dimension is linked to the table Address using a Merge Join. Prior to this, the address lines within the step Contact Fields are combined, which will prove useful for later analysis. Merge Join is realized as a Right Outer Join, ensuring the address information is only updated for the dimension‘s data sets. In this transformation, Merge Join is, therefore, the most efficient solution.
DimDateThe dimension DimDate cannot be extracted from the OLTP database, so it must be created manually. The basis of this transformation is a CSV document created using Excel. The file contains a unique identification number, the specific date, the corresponding day of the week as well as day, month and year – all separated. The dimension also covers the period from 2003 to 2013. The corresponding CSV file can be integrated into Pentaho Data Integration using the step CSV Input File. The data can then be processed and / or loaded into the target database (as depicted in the figure).
DimSalesTerritory.ktr
DimDate.ktr
15
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
FactSalesThisusesthetableSales.SalesOrderHeader.Italreadycontainsmanyimportantkeysandvalues
suchasFreightandTaxAmt.Theconnectionsare,however,rarelydesignedtoworkquickly.This
iswhyOrderDateandShipDatearereplacedbyDimDateinthefirstsection,whichwasgenera-
tedwithinthedimension.Afterthis,theattributeToCurrencyCodefromthetableSales.Curren-
cyRateisintegratedusingaLookup.Indoingso,CurrencyRateIDisusedasthekey.Sincefor
manyordersnocurrencyconversionwillbeperformed(transactionswithinthedollarzone),the
columnvalueisoftennull.ThenullvaluesarereplacedinthenextstepbytheabbreviationUSD
fordollars.Then,theattributeisreplacedbythebetterperformingCurrencyIDfromthedimen-
sionDimCurrency.
Now,thetableSalesOrderDetailcanbeintegratedusingaLeftOuterJoin.Thistableprovides
detailedinformationontheindividualorders.Thisjoincombinesdatasetsofdifferentaggre-
gationlevels,sothesewillhavetobesubsequentlymodified.Thedatastreamisthensplitfor
furtherprocessing:Theaimhereistocreateanadditionaldatasetperorderthatincludesthe
aggregatedvaluesFreightandTaxAmt.Thisisimplementedintheupperpath.Thelowerpart,
however,processesthedatasetsfortheindividualorderitems.Asanalternative,itisalso
possibletodistributethevalueoftheitemsFreightandTaxAmtequallyamongtheindividual
positionsofeachrespectiveorder.
Inthelowersection,onlythetwoparentvaluesFreightandTaxAmtaresetto0.Inparallelto
this,thedatasetsintheupperpartarefilteredbydistinctstatement.Then,variousattributes
suchasProductIDorUnitPricearesetto0.Now,anewvalueiscreatedfromthesumofthe
taxesandfreightcosts.ThisiscorrespondinglydesignatedasLineTotal.Afterfurthertermshave
beenmodified,thetwodatastreamsarejoinedagainbeforeafinalselectionoftheattributesis
carriedout.Thefinishedfacttablenowcontainssevenkeysandsixvalues.
16
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
5.3 Data storage & managementDependingonthetypeofdataandtheobjectivespursued,anumberofdifferentdatabasema-
nagementsystems(DBMS)canbeintegratedatthedatamanagementlevel.Here,adistinction
mustbemadebetweenrelationalandnon-relationalsystems.Non-relationalsystems,which
arealsoreferredtoasNoSQLdatabases,includeMongoDB,HBaseandCouchDB.Therelational
DBMSincludetraditional,row-orientedproductssuchasMySQLandPostgresaswellasanalyti-
cal,column-orientedsystemssuchasInfobright,InfiniDBandMonetDB.
WithintheexamplearchitecturewehaveusedInfobrightasthedatabasemanagementsystem
forourdatawarehouse.InfobrightisananalyticaldatabasebasedonMySQL.Itis,therefore,
compatiblewithavarietyofwell-knownMySQLtools,butalsoofferssignificantlybetterper-
formancewhendealingwithlargeamountsofdata.AtthecoreofInfobrightistheBrighthouse
Engine,whichhasbeenspeciallydevelopedforanalyzinglargedatasets.Infobrightisavailable
inbothafreeCommunityEditionandapaidEnterpriseEdition.TheEnterpriseEditionoffers
significantadvantagesintermsoffunctionalityandperformance,whichiswhywewillbeusing
itforourprototypedatawarehouse.
TheinstallationandintegrationofInfobrightiseasyduetoitshighdegreeofautomation.Large
partsoftheconfigurationandmanual„tweaking“ofthesystem,includingindexing,areomit-
ted.Thecorrespondingsettingsarecarriedoutbythesystemautomatically,butcanalsobe
FactSales.ktr
17
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
modifiedretrospectively.Theconfigurationfilesmy-ib.iniandbrighthouse.iniarelocatedinthe
installationdirectory.
5.3.1 Performance optimizationInfobright‘salreadyhighperformanceandcompressionratecanbeenhancedevenfurther.This
isimportant,forexample,forthedatatypesChar/Varchar,becausetheyweakensystemperfor-
manceandsignificantlyincreasememoryrequirements.Infobrightoffersseveralsolutionshere,
includingLookupColumnsandDomainExpert.Bothoftheseareusedinourprototypesystem.
Lookup ColumnsTheprinciplebehindLookupColumnsissimilartothatofacompressedglossaryordictiona-
ry.WhenusingLookupColumns,allvaluesaretransformedintocorrespondingindicesofthe
typeIntegerwithinacolumn.Inparallel,adirectoryiscreatedthatstorestheindexesandthe
correspondingChar/Varcharvalues.Comparabletermsarerepresentedbythesameindex.This
improvesperformanceandreducesthememoryrequirementsofthedatabase.
LookupColumnsaredefinedwhenthetableiscreatedwithintheCreatestatement.Subse-
quentintegrationusingtheAlterstatementisnotpossible.Therespectivecolumniscompleted
viatheadditional„comment‚lookup‘“.CardinalityplaysacrucialrolewhencreatingLookup
Columns.Forratiosupto10to1,theyaredefinitelyworthusing.Thismeansthatforacolumn
with100,000values,amaximumof10,000differentinstancescanexist.Thus,forhighercardina-
lities,anypossiblegainsinperformancemustbecarefullyweighedupagainstalongerprocess
loadingtime.
18
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
DomainExpertInfobrightoffersanothertypeofoptimization:DomainExpert.Thistechnologyenablestheuser
tocreateanattribute‘sstructureintheformofarule.Byusingthistechniquethewaydatais
processed,stored,andretrievedissustainablyimproved.Accordingtothedeveloper,queries
willrunupto50%fasterandsignificantlybetterdatacompressioncanbeachieved.
DomainExpertcanonlybeusedwithcolumnswhosevalueshaveaparticularpatternoracer-
tainstructure.Forinstance,theseinclude,IPaddresses(Number.Number.Number.Number)or
emailaddresses([email protected]):
Onlyoneofthetwomethodsdescribedcanbeappliedpercolumn.Ingeneral,LookupColumns
shouldbepreferred.DomainExpertismerelyanalternativethatcanbeusedifacolumnhasa
fixedpattern,butduetoatoohighcardinalityLookupColumnscannotbeused.
Table Column Pattern
DimCustomer EmailAddress %s@%s.%s
DimDate Date %d.%d.%d
Examplepattern
DomainExpertrulesinourexamplearchitecture
19
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
5.4 Data analysisData analysis within a data warehouse architecture is based on the principle of Online Analytical Processing (OLAP).
OLAP makes it possible to carry out dynamic analysis in multi-dimensional spaces, regardless of the physical data
storage used. OLAP has a number of variations: relational (ROLAP), multi-dimensional (MOLAP) and hybrid (HOLAP)
OLAP.
Withinourprototypearchitecture,wehaveimplementedthedataanalysislevelusingtheopen
sourceserverMondrian.MondrianisbasedontheROLAPorrelationalapproach.Usingadata
cube,ittransformsMDXqueriesintomultipleSQLqueriesand,conversely,processesrelational
resultsmulti-dimensionally.Inaddition,MondrianusestheinformationprovidedbyPentaho
SchemaWorkbenchtogeneratetheXMLschema.Thistoolletsyouestablishadirectconnection
tothedatabase,whichcanbeusedtostoreattributesintheschemaviaadrop-downlist.The
connectionisestablishedinthesamewayasPentahoDataIntegration.
Theschemaiscreatedaccordingtoafixedstructure:Thebasishereisthecube,whichiscreated
byclickingonthecubesymbol.Next,thefacttableFactSalesneedstobedefined,afterwhich
theindividualdimensionscanbesetup.Inadditiontothehierarchiesandtheirassociated
levels,theintegrationoftherespectivedimensiontablemustalsobeperformed.
ItmaybeadvisabletoseparatemultiplehierarchiesofadimensiontableintheXMLschema
intotheirowndimensions.Thisallowsthehierarchiestobeintegratedontothedifferentaxes
inparallel.WeusedthisprincipleinourexamplearchitectureforthedimensionsShipDateand
OrderDate.Specialattentionshouldbepaidtothespecificorderofthehierarchyleveland/or
attributes.
Afterthishasbeendone,thevariouskeyfigurescanbeintegratedintotheXMLschema.Choo-
singtherightaggregatorhereiscrucial.ThevaluesTurnover,TaxAmt,OrderQty,Freightare
addedtogetherusingsum.AddingtogetherthevaluesUnitPriceandUnitPriceDiscountwould
notbemeaningful.Instead,wederivetheiraverageusingavg.
20
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
Whensettinguptheschema,specialattentionmustbepaidtothespecificcharacteristicsof
therespectiveanalysistools.Forad-hocanalyses,SaikuorPivot4Jcanbeusedinsteadofthe
PentahoAnalyzer.PentahoAnalyzerreliesonthedefinedhierarchiesforlabelingdimensions
ratherthantheactualdimensionsthemselves.
5.5 Data presentationThe data presentation level is the interface between the system and the end user. Various graphi-
cal tools can be used to select, analyze, and visualize the data processed by the data warehouse.
Included among these are ad-hoc analyses, dashboards, and reports.
Forouropensourcedatawarehouse,wehavechosentousePentahoAnalyzerandAnalyzer
Reportforthedatapresentationandvisualizationlayer.AnalyzerReportprovidesagraphical
draganddropinterfacewithwhichindividualfieldscaneasilybeaddedtoananalysis.Using
variouscombinationsofattributesandvaluesaswellasfilters,differentviewsofthedatacanbe
created.Butbeforethiscanbedone,theschemaaswellasthecorrespondingdatabaseconnec-
tionmustfirstbesetupwithintheUserConsole.ThisisdoneviathemenuitemManageData
Sources>NewDataSource.Oncethishasbeendone,itispossibletoproducecomprehensive
analysesofthedatastoredinthedatawarehouse.
Excerptsfromtheschema
21
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
6. Conclusion and outlookThearchitectureillustratedinthisdocumentprovesthatitispossibletoconstructahigh-per-
formance,user-friendlydatawarehousesystemwithopensourcesoftware.Neitherduringthe
constructionnortheoperationofthedatawarehousewereweabletofindanyfaults,deficienci-
esorlossofperformanceforouropensourcesolutionsincomparisonwithproductsofferedby
majormanufacturers.
Wethereforerecommendthatyoutakeaseriouslookatopensourcebusinessanalyticssoft-
wareproducts.TherearenowanumberofexcellentopensourcetoolsfordataanalysisandBig
Datascenariosthatmorethancompetewiththeclosedsource„bigboys“.Smallandmedium-
sizedbusinessescandefinitelybenefithere,becauseforthem,themajorityofsolutionsprovi-
dedbylargesoftwarevendorsaretoopowerfulandtooexpensivetouse.Largecorporateusers,
bycontrast,willbenefitfromtheopenarchitectures,interfaces,andextremeflexibilitythatan
opensourcedatawarehousesystemcanoffer.
22
it-novum Best Practice | Open Data Warehouse — Building a Data Warehouse with Pentaho
7. AppendixTable Column
• DimDate • w_day
• month
• DimCustomer • Titel
• City
• StateProvinceName
• PostalCode
• CountryRegionName
• DimPerson • PersonType
• Titel
• FirstName
• LastName
• DimProduct • ProductCategoryName
• ProductSubcategoryName
• Color
• ProductLine
• CLASS
• Style
• DimSalesTerritory • City
• StateProvinceCode
• StateProvinceName
• CountryRegionCode
• LocationName
• GROUP
LookupColumnswithintheexamplearchitecture
23
it-novum Profile
Your contact person for Business Intelligence and Big Data: Stefan Müller [email protected]+49(0)661103942
Leading in Business Open Source solutions and consultingit-novumistheleadingITconsultancyforBusinessOpenSourceintheGerman-speakingmarket.Foundedin2001
it-novumtodayisasubsidiaryofthepublicly-heldKAPBeteiligungs-AG.
Weoperatewith85employeesfromourmainofficeinFuldaandbranchofficesinDüsseldorf,Dortmund,Vienna
andZurichtoservelargeSMEenterprisesaswellasbigcompaniesintheGerman-speakingmarkets.
it-novumisacertifiedSAPBusinessPartnerandlongtimeaccreditedpartnerofawiderangeofOpenSourcepro-
ducts.WemainlyfocusontheintegrationofOpenSourcewithClosedSourceandthedevelopmentofcombined
OpenSourcesolutionsandplatforms.
DuetotheISO9001certificationit-novumbelongstooneofthefewOpenSourcespecialistswho
canprovethebusinesssuitabilityoftheirsolutions,provenbyinternationalqualitystandards.
More than 15 years of Open Source project experience ▶ OurportfoliocontainsawiderangeofOpenSourcesolutionswithintheapplicationsandinfrastructureareaas
wellasownproductdevelopmentswhicharewell-establishedinthemarket.
▶ As an IT consulting companywith a profound technical know-howwithin the BusinessOpen Source areawe
differentiateourselves from thebig solutionproviders’ standardofferings.Becauseour solutionsarenotonly
scalableandflexiblebutalsointegrateseamlesslyinyourexistingITinfrastructure.
▶ Wecanassemlemultidisciplinaryprojectteams,consistingofengineers,consultantsandbusinessdataproces-
singspecialists.Thuswecombinebusinessknow-howwithtechnologicalexcellencetobuildsustainablebusiness
processes.
▶ Ourtargetistoprovideyouwithahigh-qualitylevelofconsultingduringallprojectphases–fromtheanalysisand
conceptionuptotheimplementationandsupport.
▶ Asadecision-makingbasispriortotheproject’sstartweofferyouaProof-of-Concept.Throughareal-casesimula-
tionandadevelopedprototypeyoucandecideonanewsoftwarewithouttakinganyrisks.Moreover,youbenefit
from:
и Securityandpredictability
и Clearprojectmethodology
и Sensiblecalculation
it-novum GmbH GermanyHeadquartersFulda:EdelzellerStraße44·36043FuldaPhone:+49(0)661103333BranchesinDüsseldorf&Dortmund
it-novum GmbH SwitzerlandHotelstrasse1·8058ZurichPhone:+41(0)445676207
it-novum branch AustriaAusstellungsstraße50/ZugangC·1020ViennaPhone:+4312057741041