techconnex big data series - big data in banking
TRANSCRIPT
AgendaØ BigDataattheBig6Ø RDARDataHubsØ LessonsLearned(sofar)Ø TechnologyThemesin2016
Animportant noteaboutthispresentation:inorder torespectthecommercialinterestsandprivacyofmyclients,Ihaverefrainedfromusingspecificcompanynames,unlessinformationispubliclyavailable.
Big Data at the Big 6
RDARRDrivesBig6AdoptionØ RDARRisamandatoryregulatoryproject:
v Regulatoryresponseto2008creditcrisisv Requiresre-buildofdatagatheringandregulatoryreportingtoimplement
measurabledataquality,operationalmetadataandauditabledatalineagev Regulatoryenforcementstartsin2017
Ø Big6 ITspendof~$800MMoverthreeyearsonRDARRv CombinedBig6ITspendonallRiskSystemsprojectsis~$400MMperyearv RDARRspendhaslargelybeenincremental– otherregulatoryinitiativeshave
continuedtodriveprojectspendseparatefromRDARR
Ø HadoopdatahubisatypicalRDARRsolutionelement
TheinvestmentspendbyG-SIBsonRDARRisverysignificant,averagingUS$230MMperbank.Theseinvestmentcostsarelikelytoincrease.
OliverWyman“BCBS239:LearningfromthePrimeMovers”
AllofCanada’sBig6banksweredesignatedasDomesticallySystematicallyImportantBanks(D-SIBS)byOSFI,meaningtheymustfullycomplywithBCBS-239.
Big6HadoopRiskApplicationsØ Manyprojectsareunderway,butrelativelyfewareinproduction:
v Plansforenhancedmodelbuildingandanalyticsforretailbankingfollowing2016RDARRdeadlinev CapitalMarketshasbeenleadingdriverofHadoopadoptionforcomputeapplications
Ø RiskSystemsteamshavestartedbuildingHadoop-basedapplications:v VolckerRuleComplianceMetrics(e.g.RENTD)v PortfolioStressTestingv MarketRiskVaR Historyv On-DemandRisk
Ø TradingFloorRiskManagershaveinstalledstand-aloneHadoopinstances:v Oftencloud-based, usedinspecializedanalysisofderivativesensitivities orhistoricalmarketdata
ImportingUSRiskApplicationsØ ExpecttoseemoreriskapplicationspioneeredbyleadingUSbanks:
v TradingStrategyBackTestingv GranularCapital,CVAandMarketRiskTrendingv CapitalMarketsDealerCompliancev CreditAdjudicationModelsv BehavioralModels(OftenforCollections)v Fast-timeTransactionalFraudDetectionv AMLv CommercialCreditNetworkAnalysis
Big6VendorAlignmentsØ Bankshaveeachchosenastrategic
Hadoopvendor:v TD,CIBCandNBuseClouderav RBCandBNSuseHortonworksv BMOusesPivotal(Hortonworks)
Ø “Landgrab”amongvendors:v Multi-yearsubscription dealsatlargediscountsto
lockincustomers
Ø IBMstrugglingforsharedespiteentrenchedstartingposition:
v LackofSASsupportwasashowstopper
ForresterWaveQ12014
DeploymentPatternsØ Mixofvirtualandphysicalserverdeployments:
v CiscoUCSandVMWarevSphereareleadinginfrastructurechoices
Ø Manybanksreportusingmultiplegridsalignedtobusinessunits*:v Toolstomanagemulti-tenancyonHadooparestillnascentv Organizationalissues(costallocation,supportteamalignments)inhibitshareddeployments
Ø Vendorcommunityhasinvestedheavilyinclouddeploymenttools:v One-clickdeploymentsofallmajorHadoopdistributions areavailableonpublicclouds
Ø Bankslookingat“hubandsandbox”deploymentsonprivateclouds:v PopularpatterninestablishedUSdeploymentsv Big6allhaveabuiltinternalprivatecloudoraccesstoonethroughamajorinfrastructureproviderv NotableS3/AWSdeploymentbyUSregulatorFINRAsetsthestandard
*HortonworksCAB
RDARR Data Hubs
TypicalRDARRDataHubØ RDARRfocusdrivesDataHubsolutioncharacteristics:
v RDARRobjectiveisauditablebatchreporting– tiedintocentrallineageandmetadatasolutionsv Littleconsiderationofunstructuredorreal-timedatasourcesv Oftencharacterizedasaraw-datalandingzoneforotherwiseinaccessiblemainframedatav ResistancetofullyadoptHadoopasadatahub– oftenpairedwithlegacydatabasehubs
Ø Retaildatafocusdrivesemphasisonsecurityv PIPEDA/GBLcompliancedeemedcriticaldespitelittletonouseofPII/PCIdatainreportsv SOXcompliancemandatory
Ø Architectureteamsarethedominantviewindatahubprojectsv Business sponsorisoftenanewlyestablishedDataManagementOfficev Focusoncostandprocessoptimizationofdataflowstodownstreamreportingsolutions
Ø Internalbuild– lowtonoadoptionofcommercialhubsolutions
RDARRDataHubChallengesØ HadoopDataGovernanceisearlystageandpoorlyintegrated:
v NogoodHadoopsolution todatagovernance(yet)v DatalinageisatthefilelevelinHadoop– notsuitableforRDARRcriticaldataelementtraceabilityv Policy-baseddataaccesssolutions stillindevelopment(e.g.Navigator,Atlas)
Ø EnterpriseETLtoolsnotHadoopenabled:v Manytoolsunabletopush transformationworktoHadoop(oronlyasrudimentaryHiveSQL)v PerformanceofestablishedETLtoolsoftenpooronHadoop
Ø Earlymoverpenalty:Hadoop2.xincludedsolutionstomanyearlysecurityandoperationalproblems“inthebox:”
v Projectswith2013startdateswerebasedonHadoop1.x– andsoareusually Cloudera-basedv EstablishedUSbankingshopsareusuallyonCloudera orMapR implementationsforsamereason
LeavingBusinessValueontheTableØ Rudimentarygovernanceandsecuritytoolsproducea
biasagainstself-serveaccesstodata:v Transfermodellingandanalyticusers’ frustrationswithexistingdata
warehousesolutions toanewplatformv PII/PCIdatacontrolsolutions canpreventdeploymentofanalyticaltools
Ø Designforstaticregulatoryreportingobjectivesignoreshigh-valueinteractiveexplorationanddiscoveryuses:
v Standardizedreportingschemas(suchasIBMBDW)havelimitedvaluetoriskmodelersandanalysts
Ø FocusonmeetingoperationalSLAsoversharingofgrids
BanksarestrugglingtounderstandtheconcretebusinessimpactassociatedwithBCBS239;nearly70percentofdomesticsystemicallyimportant banks(D-SIBs)andhalfofG-SIBshavenotquantifiedthebenefits.
OliverWyman“BCBS239:LearningfromthePrimeMovers”
Lessons Learned (so far)
ChoosingaHadoopDistributionØ Maximizeyourexposuretochange:
v Hadoopmovesatveryfastpace:expecttodeploy ameaningfulupdateevery3-6monthsv AvoiddesignsandproductsthattrytoencapsulateHadoop– theyfallbehind fasterthanyoucan
recoveryourinvestment
Ø Legacytoolcompatibilityisimportant:v SAScompatibilityiscritical(eventhoughSASdoesn’tintegratewellwithHadoop)v DoesyourorganizationhaveDB2orPL/SQLskills topreserve?
Ø It’snotaseasytoswitchdistributionsasyouthinkØ Waitforthefeaturesyouliketobecomefree:
v Stronghistoryoftheopen-sourcedistribution incorporatingfeaturesthatwerepreviouslyproprietary– newervendorsattackincumbentsbyproducingopen-sourcereplacementsforproprietaryextensions
DataEngineeringØ Riskmodellingisoftenveryinefficient:
v Aquantitativemodelertypicallyspends 80%oftheirtimedatagatheringandpreparingdatav Specializeddatapreparationisoftendifficulttorepeatinproductionenvironments
Ø DataEngineeringacceleratesquantitativemodelling:v Advancedresearchlabshiredataengineerstosupporttheirquantitativemodelersv DataEngineersareahybridofcomputerprogrammerandmathematician:theyuseIT-friendlytools
tosourceandpackagedataintoformsthataretailoredtothemodeler’stoolset(e.g.buildingasmoothingatimeseries)
v Marketingteamsusea1:5ratioofmodelersanddataengineers– but10:1iscommononthe“buyside”andsoisabetterstaffingtargetforabank
Ø Datahubsshouldtargetdataengineersasusers:v Buildsophisticatedtoolsforexpertconsumers,ratherthanrudimentarytoolsforcasualusers
DeveloperLessonsLearnedØ ProductivityandperformanceimprovewithnativeHadooptools:
v The“Hadoopedition”ofmostlegacyETLpackagesperformslowlyandarepoorlyintegratedwithHadoop– youareusually justbuyinganHDFSadapter
Ø Learnthenativetools– it’seasierthanyouthink:v AJavaprogrammercanlearnMap/Reduceinaweekv Mostend-usersalreadyknowhowtouseSQLandpython
Ø UsePigtotuneyourSQLqueries:v ThebestoptimizationforHiveSQLisoftentostructuredataoningestioninaHadoop-friendlyway
Ø YouwillfindlotsofsmallbugsinHadoop:v YourHadoopvendor’ssupportteamareacriticalresourcetoyoursuccess
RiskArchitectureInsightsØ Hadoopisacomputegrid:
v Yarnisafunctionally equivalenttoDataSynapseorPlatformSymphony
Ø Youcanwrapmostcomputationsusingmap/reduce:v Writingamap/reducewrappertofeeddatatoyourC#,Java,C++,or
pythonapplicationsissurprisinglyeasy– ahundredlinesofcodeusuallydoesit
Ø UseHadooptobringthecomputationtothedata:v Re-processyourdatafilesintocomputationallyefficientHDFSblocksv Eliminatingmovementofdatainacompute-centricriskapplication
improvesperformancedramaticallyv Stillneedcachingofintermediatevaluationproducts(e.g.zerocurves)
InfrastructureLessonsLearnedØ Payattentiontothenetwork:
v Hadoopneedsafastnetworkbackbonebetweennodesv ApplicationsanddatabasesthatdrawdatafromHadoop(e.g.
Tableau)should beco-located
Ø Hadoopgridsshouldcostlessthan$1,000/TB:v Includinghardwareandsupport subscriptionforamajorHadoop
distributionv Hadoopreferenceconfigurationsarebasedonmid-pricecommodity
hardware,sousethatv Virtualizationwillprovidecheaperinfrastructure,buthighernode
countsoffsetsavingsbydrivingupsupport subscriptioncosts
StorageCosts(TB)
Hadoop $1,000SAN $5,000Database $12,000
InformationWeek07/27/2012
InfrastructureLessonsLearnedØ Don’ttrytopreventinfrastructurefailure:
v Hadoopisveryfaulttolerant– itisdesignedtohandleanannualequipmentfailurerateof8%v Donotusefaulttoleranthardware– useJBODinsteadofRAIDarraysv Awell-designedHadoopgridwillkeeprunningforthe24hoursittakesyourhardwarevendorto
replaceabrokenmachineunderanormalsupportcontract
Ø Thebestback-upforHadoopisHadoop:v Hadoopisthecheapestformofon-linestorageavailable,andiscost-competitiveandmore
reliablethantape.v ReplicateyourHadoopgridtoasecondgridatadifferentsiteforahigh-gradedisasterrecovery
solution.
Technology Themes in 2016
TechnologyThemesfor2016Ø Mix-and-matchSQLengines:
v NativeHadoopSQLengineslackmanyadvancedfeaturesindatabaseSQLenginesv OracleandIBMareunbundling theirHadoopimplementationsofPL/SQLandDB2v Oracle’sPL/SQLengineforHadooprunsonCloudera andcouldbeavailableonHortonworksv IBMisreleasingBigSQL (DB2)forODP– meaningitwon’tbeavailableonCloudera
Ø OpenDataPlatform:FUDorfantastic?v PivotalhasusedODPtopartnerwithHortonworksandfocusontheirothertoolsv IBMhaspromisedtoreleasealloftheirdatasciencetoolsforODP,buthasbeenslowtodeliver
Ø IBM“allin”onSpark:v IBM’sdatasciencetools(e.g.BigR)complementtypicalSparkusecases(e.g.clustering)
Ø TableaudisplacingCognos&BOBJ
DataGovernanceThemesfor2016Ø NativeHadoopDataGovernance:
v HortonworkshaspartneredwithJPMorgan,MerckandAetnatobuildanadvancedHadoopdatagovernancesolution intheApacheAtlasproject
v AtlasisintendedtogovernHadoopdatainafederatedgovernancemodel– partneradoptionwilldrivesuccess
Ø FederatedDataGovernance:v TheBig6havealladoptedIBMIGCastheirenterpriseRDARR
lineageandmetadatasolution.v IBMprovidesRESTAPIstointegrateIGCwithnon-IBMproducts.v WillODPpartnersHortonworksandIBMmanagetoestablish
AtlasonIGCasthedefinitiveHadoopsolutioninadistributedgovernancemodel?
RiskTechnologyThemesfor2016Ø ModeldevelopmentonHadoop:
v AsRDARRdatahubshitcriticalmass,riskmodeldevelopmentwillgravitatetoHadoop-basedtools
Ø Notebookworkspaces:v IncreaseduseofHadoopmodellingenvironmentswilldrive
demandforNotebookenvironmentsbasedonJupyter andApacheZeppelin(e.g.IBMKnowledgeAnyhow)
Ø On-DemandRiskonHadoop:v Nextgenerationon-demand riskapplicationswillconverge
stand-alonecomputegridanddatacacheandpersistenceontoHadoopstacktoeliminatedatamovement– betterperformanceandlowercosts
Questions?