delivering a 'big data ready' minimum viable product

55
DELIVERING A 'BIG DATA READY' MVP Gregory Chomatas Dublin Google Developers Group - 2013 July 30th

Upload: gregory-chomatas

Post on 13-Jun-2015

394 views

Category:

Technology


3 download

DESCRIPTION

In most cases talking about big data follows an "a posteriori" view where an organization overwhelmed by huge amounts of log files and numerous data sources scattered among its departments decides to put some order to the mess and get some value out of the "big data", usually building a Hadoop cluster. In this presentation I take the opposite direction and try to demonstrate how to proactively design and build product architectures that manage to remain simple and lean while at the same time anticipate the big data complexities and solve them easily and elegantly from day one.

TRANSCRIPT

Page 1: Delivering a 'Big Data Ready' minimum viable product

DELIVERINGA'BIGDATAREADY'

MVP

GregoryChomatas

DublinGoogleDevelopersGroup-2013July30th

Page 2: Delivering a 'Big Data Ready' minimum viable product

http://linkedin.com/in/gchomatas

SWEngineer

CHOMATASGREGORY

t:@gchomatas

http://www.astroboa.org

Entrepreneur

Betaconcept/Astroboa:Founder

Aquinetix:Co-founder/CTO

Page 3: Delivering a 'Big Data Ready' minimum viable product
Page 4: Delivering a 'Big Data Ready' minimum viable product
Page 5: Delivering a 'Big Data Ready' minimum viable product
Page 6: Delivering a 'Big Data Ready' minimum viable product

7YEARSAGOIREALIZED...

Page 7: Delivering a 'Big Data Ready' minimum viable product

TOOMUCHRDBMSSODA

Page 8: Delivering a 'Big Data Ready' minimum viable product

LOTSOFOBJECT-RELATIONALMISMATCH

Page 9: Delivering a 'Big Data Ready' minimum viable product

DBISNOTTHECENTEROFMYAPPLICATION

DomainDrivenDesign/BehaviourDrivenDesign

DatabaseDrivenDesignvs

Page 10: Delivering a 'Big Data Ready' minimum viable product

ATTHATTIMENOTMANYALTERNATIVESEXISTED

sowedecidedtorollourowndatastoresolution...

Page 11: Delivering a 'Big Data Ready' minimum viable product

ASTROBOATOTHERESCUEHybridDocument-GraphStorefocusedondatasemantics

SimilartoGoogleDatastore&OrientDB

External'appindependent'SemanticDataModelModelasyougoSecurityperEntityinstance/propertyVersionedEntitiesAutomatedRESTAPIsencapsulatingthedatalayerHyperlinkedResourcesPolyglotPersistence(Experimental)*

*Notavailableinthepublicversion

Page 12: Delivering a 'Big Data Ready' minimum viable product

THE"BIGNESS'INBIGDATATwomainpathstotherealizationof'BIGNESS'

Luckilybothpathsconvergetocommonprinciples&toolsthatcan

manageBIGComplexity&BIGVolume

Page 13: Delivering a 'Big Data Ready' minimum viable product

BIGDATAENLIGHTENMENT

Page 14: Delivering a 'Big Data Ready' minimum viable product

BIG'DATAPROBLEMS'(COMPLEXITY)singlepointoffailure/resiliencecrossdatacenterhumanfaulttolerancestore/searchunstructuredorsemi-structureddataflexibledatamodeling(e.g.traverserelationships)dataversioningpolyglotprogrammingmultitenancyshare/dataasaservicesemanticweb/multipleformats-endpoints

FlexibleOptions/Easeofoperations

Page 15: Delivering a 'Big Data Ready' minimum viable product

'BIGDATA'PROBLEMS(VOLUME)highvolumehighvelocityreal-timeAPIs/actinrealtimedataasothersservice/dirtydatafromopensourceslogcollection/aggregation

LINEAR/HORIZONTALSCALING

Page 16: Delivering a 'Big Data Ready' minimum viable product

IAMNOTABIGDATASTART-UP!Start-up=Growth(5%-10%)/week1000writesperaquaculturefarmperday120farmsonpublicbeta=120000writes/day1stmonth:176farms=176000writes/day6thmonth:1181farms=1.2Mwrites/day1styear:17045farms=17Mwrites/day(200/sec)2ndyear:2421143=2.4Bwrites/day(27777/sec)

AREYOUSURE?asucessfulSaaSisabigdataservice

Page 17: Delivering a 'Big Data Ready' minimum viable product

IT'SJUSTANMVP-WEWILLADDALLTHESEBIGDATASTUFFLATER

ABigDataarchitecturecanbesimplerthanatraditionaloneTherightdatastorecanincreaseproductivityKeepitsimplebutnotcompromisethearchitecturalconceptsBalancebetweentechnicaldebt&technicalequityAnenterprisebusinesssystemwillusuallywinonunderlyingtechnologicalinnovation,robustnessandenterprisereadiness"Inbusinessthereisnothingmorevaluablethanatechnicaladvantageyourcompetitorsdon'tunderstand"-PaulGraham

Page 18: Delivering a 'Big Data Ready' minimum viable product

KEYBIGDATAARCHITECTUREFEATURESDistributedStorage

APPLICATIONdatabasevsINTEGRATIONdatabaseMixseveraldatamodels/polyglotpersistenceExternalDataSchema/CommonDataStructuresDataStoreencapsulatedbyanAPI(DataServices)Appendonly/savechangesvsstate(eventsourcing)

Page 19: Delivering a 'Big Data Ready' minimum viable product

KEYBIGDATAARCHITECTUREFEATURESDistributedComputing

AsynchronousprocessingRealTimeEventProcessing/StreamingSimpledecoupledservicesexposedthroughRESTorRPCAPIs(businessservices)Thickwebclients/mob.appsusingtheRESTorStreamingAPIsClient-levelmultivariatedataanalysis&complexvisualization

Page 20: Delivering a 'Big Data Ready' minimum viable product

THELAMBDAARCHITECTUREbyNathanMarzandJamesWarren

storeraw,immutable,perpetualdata

query=function(alldata)

combinebatch&realtimestreamprocessingtocomputearbitraryfunctionsonarbitrarydata

Page 21: Delivering a 'Big Data Ready' minimum viable product

THELAMBDAARCHITECTURE

Page 22: Delivering a 'Big Data Ready' minimum viable product

ULTIMATEDESIGNRULE

KEEPitSIMPLE

Page 23: Delivering a 'Big Data Ready' minimum viable product

THECONVENTIONALARCHITECTURE

auto-shard

newdatastorecriteria

Distributed

Easytochangeschema&queries

Simpletoinstall,configure,operateonecomponent

peer-to-peer

Minimizeimpedancemismatch

Boostproductivity

Page 24: Delivering a 'Big Data Ready' minimum viable product

DIRECTLYSTOREMYAGGREGATES{"date":"2013-02-28","allocated_worker":"swp4jhi4Tm6VxY1nueX2yw","cage":"1GuuHWTaQc-kpPcRV5uBGA","feed":"7IWmy2FATcS9Vh0RB1onXQ","quantity_approved":12.5,"farm":"__uBZUr3RWOqOSkszfbRLw","species":"KDU-2LCjRRynby9HLifc3g","batch":"i6MgxixnSCGwGWb0037wlQ","execution":{"feeder":"swp4jhi4Tm6VxY1nueX2yw","quantity_fed":12.5,"species_position_start":"top","species_position_end":"middle","start":"2013-02-28T07:59:57.668Z","end":"2013-02-28T08:00:03.216Z","feeder_position_end":{"lat_lon":{"lat":37.7066959,"lon":23.16831896},"altitude":40,"accuracy":12}}}

Page 25: Delivering a 'Big Data Ready' minimum viable product

THECANDIDATESKey-Value Document Column GraphRiak MongoDB Cassandra Neo4JRedis CouchBase HBase InfiniteGraphPr.Voldemort OrientDB Hypertable OrientDBMemcacheDB ElasticSearch Accumulo TitanDynamoDB GoogleDatastore SimpleDB Virtuoso

Page 26: Delivering a 'Big Data Ready' minimum viable product

MYCOOLDATASTORETIPelasticsearchdocumentstore

NootherNoSQLstorecomesclosetotheoutoftheboxutilityandusabilityofElasticSearch

schemaless,multitenant,replicating&shardingdocumentstorethatimplementsextensible

&advancedsearchfeatures(geospatial,faceting,filtering,etc.)

RESTAPItoCREATE/UPDATE(partially)/DELETE/READaggregates/entities

RESTSearchAPIwithfulltextsearchoutofthebox

MULTI-TENANTfriendlywithRESTAPIforcreating/updatingDBs&entitytypes

Dynamic/Semi-Dynamic/Fixedschema

Page 27: Delivering a 'Big Data Ready' minimum viable product

ELASTICSEARCHPOWERindexover95GB/h/node

8-nodecluster:sub-200msresponseforcomplexsearcheson10B+records

(oracleORmysql)ANDreplicationappleANDip*djohnANDcity:Dublinspecies:"SeaBream"ANDexecution.date:[20130701TO20130730]taxicubAND("Dublin"^2OR"Cork")

"facets":{"locations":{"terms":{"field":"city"}}}

"terms":[{"term":"Dublin","count":130},{"term":"Cork","count":20},{"term":"Galway","count":1}]

Page 28: Delivering a 'Big Data Ready' minimum viable product

FACETEDBROWSING

Page 29: Delivering a 'Big Data Ready' minimum viable product

HISTOGRAMS/GEODISTANCE"facets":{"Feed_Histogram":{"date_histogram":{"key_field":"date","value_field":"execution.quantity_fed","interval":"month"}}}

"filter":{"geo_distance_range":{"from":"200km","to":"400km""pin.location":{"lat":40,"lon":-70}}}

"filter":{"geo_polygon":{"person.location":{"points":[{"lat":40,"lon":-70},{"lat":30,"lon":-80},{"lat":20,"lon":-90}]}}}

"filter":{"geo_distance":{"distance":"200km","pin.location":{"lat":40,"lon":-70}}}

Page 30: Delivering a 'Big Data Ready' minimum viable product

RDBMSOUT-DOCUMENTSTOREIN

Page 31: Delivering a 'Big Data Ready' minimum viable product

WHATABOUTMYRELATIONS

Page 32: Delivering a 'Big Data Ready' minimum viable product

LETSGOPOLYGLOT

Page 33: Delivering a 'Big Data Ready' minimum viable product

THETITANGRAPHDBDistributedPluggablestorage(Cassandra,HBase,BerkeleyDB)IndexingwithElasticSearch&LuceneBlueprintsInterfaceGremlinQueryLanguageRexterServeraddsJSON-basedRESTinterface

Page 34: Delivering a 'Big Data Ready' minimum viable product

EASYGRAPHTRAVERSALWITHGREMLIN//calculatebasiccollaborativefilteringforuser'Gregory'

m=[:]

g.v('name','Gregory').out('likes').in('likes').out('likes').groupCount(m)m.sort{-it.value}

Page 35: Delivering a 'Big Data Ready' minimum viable product

STARTONASINGLEMACHINE

Page 36: Delivering a 'Big Data Ready' minimum viable product

DATASTORESELECTIONTIPS(1)UsepolyglotpersistencewithmultipledatamodelsStartwithaDocumentStoreasyoursystemofrecordMixitwithakey-valueStoreforkeepingsessions,shoppingcart,userprefs,counters,cachingMixitwithaGraphstoretokeepandtraverseentityrelationshipsUseaColumnStoreasyoursystemofrecordifyouneedperformanceratherthanflexibilityandyouknowwellyourdatamodel&queriesKeeparelationaldbforqueriesontransientdata(reportingoninter-aggregaterelationships)

Page 37: Delivering a 'Big Data Ready' minimum viable product

DATASTORESELECTIONTIPS(2)Preferone-componentstoresratherthanmanymovingpartsChooseastorethatmakesiteasytoexperimentwithschemaandquerychanges&supportseasydatamigrationsPreferstoresthatcanworkwithbothdynamic&fixedschemas(thereisalwaysanimplicitschema)InearlyprototypesavoidColumnstoresastheyhaveahighcostonschemaandquerychanges

Page 38: Delivering a 'Big Data Ready' minimum viable product

DATASTORESELECTIONTIPS(3)Choosestoresthatsupportauto-shardingPreferpeer-to-peerreplicationratherthanmaster-slaveReplicationfactorN=3isagoodstandardchoiceConsistencyAdjustmentQuorum:W>N/2,W+R>N

Page 39: Delivering a 'Big Data Ready' minimum viable product

ALLTHATSAID...APPCONTEXTisalwaysthedeterminingfactorforselecting

yourstore

aswellas...

Safety/StabilityProductivityCommunity

PerformanceTooling/Operationeaseness

Page 40: Delivering a 'Big Data Ready' minimum viable product

DATAMODELINGTIPSRememberthatyoufityourmodeltothedatastoreandnotViceVersa(APPLICATIONvsINTEGRATIONDB)UseaSchemaBuildyouraggregatesorcolumnfamiliesaccordingtoyourusecases,i.e.DENORMALIZEperyourqueryrequirementsAggregatesformtheboundariesforACIDoperations(transactions)Pre-computeQuestionFocusedDatasets(materializedviews)toprovidedataorganizeddifferentlyfromtheirprimaryaggregates

Page 41: Delivering a 'Big Data Ready' minimum viable product

AREWEFINISHEDYET?NOTQUITE!

Dosomethingwithourmonolithicapp

Page 42: Delivering a 'Big Data Ready' minimum viable product

SPLITTHEMONOLITHICAPPLICATIONWrapdatastoresintoDATASERVICESCreateBUSINESSSERVICESontopofDataServicesPreferRESTfulAPIsforservices(ROA)UseaBinarySerializationFrameworktocreateRPCAPIsifperformanceisaconcern(ROA/SOA)MoveMVC*tofatmobile/webclientappsthatconsumetheAPIs

JavaScriptinthebrowserisoneoftheworld'smostwidelydistributedexecutionenvironments&Deploymentistrivial!

Page 43: Delivering a 'Big Data Ready' minimum viable product

DECOUPLEDSERVICES

FATCLIENT

SINGLEPAGEAPP

Page 44: Delivering a 'Big Data Ready' minimum viable product

APIFRAMEWORK/DSLclassAPI<Grape::APIversion'v1',:using=>:header,:vendor=>'aquinetix.com'default_format:jsoncontent_type:json,"application/json"content_type:tsv,"text/tab-separated-values"formatter:tsv,Aquinetix::TsvFormattercontent_type:kml,"text/xml"formatter:kml,Aquinetix::KmlFormattermountCageAPImountCageEventsAPImountDeviceAPImountFeedAPImountFeedingAPImountLossCountEventAPImountOxygenSamplingEventAPImountSigninAPImountTemperatureSamplingEventAPImountUserAPIadd_swagger_documentationmarkdown:true,base_path:"http://..."end

Page 45: Delivering a 'Big Data Ready' minimum viable product

APIFRAMEWORK/DSLclassFeedingAPI<Grape::APIresource:feedingsdodesc'Createanewfeeding'postdoexecute_farm_obj_create_request'Feeding'enddesc'PerformaFULLorPARTIALupdateofanexistingfeeding'paramsdorequires:id,type:String,desc:"Theid(UUID)of..."optional:fields,type:String,desc:"Whichfields..."endput'/:id'doexecute_farm_obj_update_request'Feeding'enddesc'Getafeedingbyitsid(UUID)'paramsdorequires:id,:type=>String,:desc=>"Feedingid."endget'/:id'doexecute_farm_obj_instance_get_request'Feeding'endendend

Page 46: Delivering a 'Big Data Ready' minimum viable product

SWAGGERUI

Page 47: Delivering a 'Big Data Ready' minimum viable product

MVC*ATTHECLIENTMobileappwithbackbone.js&phonegapManagement/BIConsolewithAngularJSVisualizationwithD3.jsMultivariateDatasetAnalysisatthebrowserwithcrossfilter.jsAppworkflow&buildwithyeoman,grunt,bower

*MVP,MVVM,MVC,MVW

Page 48: Delivering a 'Big Data Ready' minimum viable product

ASYNCHRONOUS/REALTIMEPROCESSING&STREAMINGAPI

RabbitMQ+RabbitMQWeb-StompPluginattheserver

SockJS,Stompjslibsattheclient

Real-timeeventstreamprocessingwithESPER

Alternativemessagebrokers:

node.js+zeromq

kestrel

pusher

kafka(>100kmsg/sec)

AlternativeReal-timestreamprocessing:Storm

Page 49: Delivering a 'Big Data Ready' minimum viable product

USECASEScountratings,votes,click-throughs

blockabusivecrawlersrate-limitapis

detectspammingattemptstrackperformanceandtriggeralerts

batchprocesslogs

Page 50: Delivering a 'Big Data Ready' minimum viable product

SUBSCRIBETOSTOMPTOPICSFROMJSws=newSockJS('http://node1.aquinetix.com:15674/stomp')@client=Stomp.over(ws)@client.connect('aquinetix','password',(x)=>@on_connect(x)@on_error,"/")on_connect:(x)->console.log"Connectedtomessagebroker"@[email protected]'/topic/feeding',(message)=>feeding=JSON.parse(message.body)Aq_Manager.events.trigger'feeding_execution:arrived',feeding

@[email protected]'/topic/position',(message)=>position=JSON.parse(message.body);Aq_Manager.events.trigger'worker_position:arrived',position

@client.send('/topic/feeding',{},JSON.stringify(feeding_obj))

Page 51: Delivering a 'Big Data Ready' minimum viable product

REALTIMEEVENTPROCESSINGWITHESPERselectcount(*)astps,max(retweetCount)asmaxRetweetsfromTwitterEvent.win:time_batch(1sec)

selectfraud.accountNumberasaccntNum,fraud.warningaswarn,withdraw.amountasamount,MAX(fraud.timestamp,withdraw.timestamp)astimestamp,'withdrawlFraud'asdescfromFraudWarningEvent.win:time(30min)asfraud,WithdrawalEvent.win:time(30sec)aswithdrawwherefraud.accountNumber=withdraw.accountNumber

Page 52: Delivering a 'Big Data Ready' minimum viable product

LOGACTIVITYANDOPERATIONALDATA

Todayacriticalpartoftheproductionfeatures

ofwebsites

Logstash+ElasticSearch+Kibana3

Page 53: Delivering a 'Big Data Ready' minimum viable product

WRAPUPShouldavailability,robustness&scalabilitybeaddedtoyourhypotheses&valueproposition

?

ifYESthen:

Adoptanarchitecturewithdecoupledanddistributedcomponentsatearlystages.Buildyour

teamaroundit&balancetechnicaldebt/equitytoget:

Increasedteamproductivity,Increasedreadinessandagility,Sustainability

Buildyourdatamodelsaroundyourusecasesratherthanaroundyourdatabase

andexperimentwithapolyglotpersistencestrategy

Startwiththemosteasytoinstall,configure&operatetechnologies.

KeepitSIMPLE&SUSTAINABLE

Page 54: Delivering a 'Big Data Ready' minimum viable product

LINKS/REFERENCES

http://www.rabbitmq.com/web-stomp.html

https://github.com/jmesnil/stomp-websocket/

IntroductiontoNoSQL-MartinFowlergoto;conference

MartinFowleratNoSQLMattersconference

BookontheLambdaArchitecture

TalkonLambdaArchitecture

WilliamPietri-GoingtheDistance:BuildingaSustainableStartup

Don'tLettheMinimumWinOvertheViable-HarvardBusinessReview

ElasticSearchDocumentDB&SearchEngine

CassandraColumnDB

TitanGraphDB

AstroboaSemanticDocumentStore

Page 55: Delivering a 'Big Data Ready' minimum viable product

LINKS/REFERENCEShttps://github.com/sockjs/sockjs-client

https://github.com/robey/kestrel

https://github.com/JustinTulloss/zeromq.node

http://kafka.apache.org/index.html

https://github.com/nathanmarz/storm

https://developers.helloreverb.com/swagger/

https://github.com/wordnik/swagger-ui