pentaho data integration best architecture practices ...€¦ · integration edge analytics data...

31
Pentaho Data Integration Best Architecture Practices Matt Casters Pentaho Chief Architect of Data Integration, Hitachi Vantara

Upload: others

Post on 21-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

PentahoDataIntegrationBestArchitecturePracticesMattCastersPentahoChiefArchitectofDataIntegration,HitachiVantara

Contents

• Introduction• Generaladvice• Specificadvice• Practicalexamples

• Recap• Q&A

Introduction:Whatis“DataIntegrationArchitecture?”

Introduction

• Whatis“dataintegrationarchitecture”?– Highlevelviewona(potential)DIsolution– Describescomponentsandtheirrelationships– Takingintoaccountallparts– Avoidingdetailswithoutskippinganything

Introduction

• Whydoyouneedanarchitecture?– Solutionsgetverycomplex– Teamsofengineersgetlarge– Consciousdecisionsonuseofsolutioncomponents– Holisticviewsonsecurity,quality,transparency,performance– Allowsforvalidationofhighlevelrequirements– Allowsforthecreationandvalidationofscenarios– Clearlydefinesstakeholders

GeneralAdvice:SomePointersinSettingupSolidArchitecturesforSolidSolutions

GeneralAdvice– Don’tForgettheDetails…

• Learnthebasicsofthebuildingblocks…– PDIBestPractices#PWorld14• Standards,naming,…– PDIBestGovernancePractices#PWorld15• PM,CI,VCS,Testing,…– Getexpertiseforallsoftwarecomponentsyouuse

GeneralAdvice– Whiteboarding

• Whiteboarding– Isdonewithinterestedstakeholders– Triestocompromiseknowledgefromvariousparties– Allowsforquickhighleveldesign– Itisjustastartingpoint!– Needstogetfollowedup,validatedagainstscenarios– Forgetconviction:timetochangeyourmind

GeneralAdvice– Scalability

• Parallelizeonahighlevel– Aggressivelowlevelparallelizationcangetyouintotrouble

• Remembertoallowdatatoflowinswimlanes– Parallelizationofasmuchaspossible– “Sharding”andsoonshouldbearchitectedin

• Identifytimewindowearlyon,assessHWneeds

GeneralAdvice– Transparency

• Greatcomplexityrequirestransparency– Somethingwillalwaysgowrong– Attheworstpossibletime

• Asarule:– alwaystracedatamovingbetweenpartsofarchitecture–Whenindoubt:addmorelogging,trackingandtracing

• Usecomponentsinarchitecturethatallowformonitoring– Preferserversthatallowyoutoseewhat’sgoingon

GeneralAdvice– Predictability

• Enormousworkloads,batchjobs,putsystemsunderstress

• Batchestendtogrowbiggerovertime,causingmorestress

• Asarule:– Ifyoucaninanyway,usemicro-batching– Chopup1largenightlyworkloadintohundredsofsmallonesthroughouttheday

• Advantages:– Morefrequentupdates– Predictableworkload– Failearlyscenario:problemsaredetectedearlier

SpecificAdvice:AdviceforIoTandOthers

SpecificAdvice– Hadoop

• Hadoophasitselfbecomeanecosystemofsoftware

• Selectthesoftwareintheecosystemtofityouridealarchitecture

• Onlyselectproperlysupportedcomponents,avoidbleedingedge

• Combatlackoftransparencywithextensivelogging

• Followtherightsizingforyourarchitecture,balancecorrectly• Useitasascalablepart,notjustasa“Database”

SpecificAdvice– IoT

• IoTisMessy– DataQualityvarying– DataConnectivityproblems– Latearrivingdata– Flash-floodsofdata(lowpredictability)– Highcomplexity– Varyingdataformatsandversions– Numberofdifferentdevicescanbehigh

HitachiVantara IoTOfferings

CONNECTEDTHINGS

OperationalInsights

AssetIntelligence

MaintenanceOptimization

ManufacturingOptimization

EDGE

AssetAvatar State

CORE ANALYTICS

FOUNDRY

DataCollection

AssetManagement

AssetAvatar

ArtificialIntelligence

Batch/Stream/Analytics

DataBlending/Orchestration

AssetIntegration

EdgeAnalytics

DataFiltering

DataTransformation

DashboardAlerts/

NotificationsApplicationEnablement

SpecificAdvice– IoT

• Planaheadforfailure• UsemoderntechniqueslikeMetadataInjection

• Makeextensiveuseofqueuesinanyformat

• Assumethatthingswillgowrongineveryscenario

• Designthearchitecturetocopewithfailures• Designthearchitecturetoreportonstatistics

PracticalExamples:WarStoriesfromtheField

Examples– LargeServicesVendor

• Movinglargeamountsofsmalldatapacketsaround

• Pickedtherighttools,didn’tpickanoverallarchitecture• Differentteams“workingtogether”indifferentcountries

• Architecturebecamesecondarytotheoverallsolution

• Technologywasselectednotarchitecture

Examples– LargeServicesVendor

• Carteserversgothammeredthousandsoftimespersecond– Useofaspecificschedulerwasmandated– Runningoutofsockets,HTTPserverbucklingundertheload

• ComplaintsaboutPDIstartuptimes

• Overallperformancetoolow

• Servicescalledintosolve“critical”issuesinoursoftware

Examples– LargeServicesVendor

• Don’tallowinternalorganizationalneedsdrivethearchitecture• Don’tallowtechnologychoicestodrivearchitecture– Andifyoutoo,handletheimplications

• Toscale,rampupperformance,alwaysqueueandintelligentlyhandlequeuedtasks(notoneatatimeforexample)

• Theperformanceofthewholeisdeterminedbytheslowestlink– Considerthisup-frontinthearchitecture

Examples– HandlingTVSet-topData

• Periodicinnature,handlingclicks• ReadingfromMQTT,dumpingdataintoOracleforanalysis

• ReportedPDIperformancetrouble,servicescalledin

• Smallscaletest,predictedten-foldincreaseinsize,alreadyintrouble

Examples– HandlingTVSet-topData

• MQTT:greatforqueuingandIoT

• Notalwayspossibletoreadinparallelfromqueues!

• OracleisanRDBMS,killsparallelisminarchitecture

Examples– HandlingTVSet-topData

• Considerpartitioninglargeamountsofclients

• Considerdataextractionforanydatastoragemechanism

Examples– BigBank

• Processedagazillionrecordseverynight• Hadabatchwindowof2hours• Gotamonstercomputertodothejobwith64cores

• RancomplexdataqualityvalidationsinPDI,hundredsofsteps

• Gotintoaperformanceproblem

• Neededextensiveperformancetuning

Examples– BigBank

Pick2

Good

FastCheap

Examples– BigBank

Pick2

Lotsofwork

InbatchwindowOn1server

Examples– BigBank

• Considerup-frontwhetherHWchoiceswillpinyoudownlater

• Weightheimportanceofspecificrequirementsintothearchitecture– timevscomplexityvshardwareinthiscase

Recap:PDIBestArchitecturePractices

Recap

• Makeanarchitectureup-front,notaspartofthedocumentation

• Becritical• Bedetailed• Runscenariosagainstit• Bereadytochangeyourmind

• Getstakeholdersinvolved• UsePDI:PessimisticDataIntegration

QuestionsandDiscussion