stockage des données : quel système pour quel usage ?
Post on 21-Jan-2018
175 Views
Preview:
TRANSCRIPT
#DevoxxMA @zouheircadi
STOCKAGE : QUEL SYSTEME POUR QUEL
USAGE
#DevoxxMA @zouheircadi
QUI SUIS-JE
• @ZouheirCADI• JEEarchitect(bigdata,perf.,quality,ops,app,…)• Intervenantàl’ENST• Co-organisateurDevoxxFrance• (ancien…)Co-organisateurParisJavaUserGroup
#DevoxxMA @zouheircadi
AGENDA
• Revu des systèmes de stockage (OLAP etOLTP)• RDBMS• OLAP(HadoopetSpark)• OLTP
• Key-Value:memcached• Document:couchdb• Columnfamily• Search
• Conclusion
#DevoxxMA @zouheircadi
Why ?
• Sharedata• Manyusers
• Exposeadatamodel• Anorganizedone?
• Scalability• Dependingonusersordataprocessing
• Flexibility• Embracechange
#DevoxxMA @zouheircadi
RDBMS
#DevoxxMA @zouheircadi
Key date
• 80s
#DevoxxMA @zouheircadi
RDBMS
RelaYonal Database Management Systemswereinventedtoletyouuseonesetofdatain mulYple ways, including ways that areunforeseenat theYmethedatabase isbuiltandthe1stapplicaYonsarewri\en.CurtMonash,analyst/blogger
#DevoxxMA @zouheircadi
RDBMS
• RelaYonaldatabasesorganizedataintables• Whicharemadeofmanyrows.• Eachrowhasdata ineachofseveralcolumns(everyrowinatablehasthesamecolumns)• RelaYonshipsareimplicit
Emp
empno ename job deptno7839 King President 107698 Blake Manager 20
deptno dname loc10 Account NY20 Sales CHI
Dept
#DevoxxMA @zouheircadi
RDBMS – KEY CONCEPTS
#DevoxxMA @zouheircadi
1er : Physical data independence
PHYSICALFILESLOGICALMODEL
fseekfopenfread
©hWp://www.slideshare.net/billhoweuw/dataintensive-scalable-science
#DevoxxMA @zouheircadi
2eme : Relational algebra
• Select,Project,Join• Union,Intersec`on,Difference
©hWp://www.slideshare.net/billhoweuw/dataintensive-scalable-science
#DevoxxMA @zouheircadi
RDBMS
• Expressionlogiquedesrequêtes
SELECTe.ename,d.dnameFROMEMPeJOINDEPTdone.deptno=d.deptnoWHEREe.ename=‘King’
#DevoxxMA @zouheircadi
#DevoxxMA @zouheircadi
Tablescan
Tablescan
HashmatchSelect
Tablescan
Tablescan
NestedloopsSelect
SelectT1.Col2FromTable1T1InnerJoinTable2T2ONT1.Col1=T2.Col1
SelectT1.Col2FromTable1T1InnerJoinTable2T2ONT1.Col1=T2.Col1WhereT1.col1=1
©hWps://sqlcommiWed.wordpress.com/tag/hash-match-join/
#DevoxxMA @zouheircadi
AtomicityTransacYonareallornothing
ConsistencyOnlyvaliddataissaved
IsolaYonTransacYondonotaffecteachother
DurabilityWri\endatawillnotbelost
Transaction
#DevoxxMA @zouheircadi
Indexes
• Easytoproduce• Easytouse
#DevoxxMA @zouheircadi
Scalability
• VerYcalscalability(scaleup/down)• Moreresourcestoasinglenode
#DevoxxMA @zouheircadi
Scalability
• Horizontalscalability(scaleout/in)• Addmorenodestoasystem
#DevoxxMA @zouheircadi
Shortcommings
• Scalability(almostnotscalable…)• SPOF• Difficulttoserveusersworldwide
#DevoxxMA @zouheircadi
NoSQL
• NotOnlySQL• NothingtodowithSQL• Relaxa`on of transac`on constraints in distributedsystems• CAP
#DevoxxMA @zouheircadi
CAP
• Consistency• Everyreadreceivesthemostrecentwriteoranerror
• Availability• Everyrequestreceivesaresponse ,withoutgaranteethatitcontainsthemostrecentversion
• ParYYontolerance• The system con`nue to operate despite arbitrarypar``onningduetonetworkfailure• Ifallowed,youmightsacrificeconsistency• Ifnot,youmightsacrificeavailability
• NOSQLmaysacrificeconsistencyhWps://en.wikipedia.org/wiki/CAP_theorem
#DevoxxMA @zouheircadi
NoSQL
• DefaçonpluspragmaYque• Par``onning(répar``oncharge)• Replica`on(toléranceauxpannes)• Horizontalescalability
• Oncommodityhardware
• SimpleAPI• OLTP
#DevoxxMA @zouheircadi
OLAP
#DevoxxMA @zouheircadi
M/R
#DevoxxMA @zouheircadi
Key dates
• 2003octobre:GFSpaperreleased• 2004 décembre : MapReduce Simplified Dataprocessingonlargeclusters• 2006janvier:CréaYonHadoop• 2006octobre :ClusterHadoopde600machinechezYahoo• 2007 avril: Cluster Hadoop de 1000 machinechezYahoo
hWps://en.wikipedia.org/wiki/Apache_Hadoop
#DevoxxMA @zouheircadi
Map Reduce
• MR• Abstrac`on• Programmingmodel
• ImplémentaYons• Opensource
• Hadoop• Lesswellknown:Couchdb,Infinispan,Riak
• Propriétaire:Google
#DevoxxMA @zouheircadi
Map Reduce
• MapReduceis• ahighlevelprogrammingmodel• andanassociatedimplementa`on• forprocessingandgenera`nglargedatasets• withaparallel,distributedalgorithmonacluster.
©hWps://en.wikipedia.org/wiki/MapReduce
#DevoxxMA @zouheircadi
map()
map()
map()
<key,value>
reduce()
reduce()
#DevoxxMA @zouheircadi
devoxxmorrocodevoxxfrancedevoxxpolandgreatconferencegreatconferencedevoxxtaroudant
devoxxmorrocodevoxxfrance
devoxxpolandgreat
conference
greatconferencedevoxx
taroudant
devoxx,1morroco,1devoxx,1france,1
great,1conference,1
great,1taroudant,1
devoxx,1poland,1great,1
conference,1
devoxx,1devoxx,1devoxx,1devoxx,1morroco,1france,1
poland,1great,1
conference,1great,1
conference,1taroudant,1
devoxx,4morroco,1france,1
poland,1great,2
conference,2taroudant,1
#DevoxxMA @zouheircadi
Hadoop structure
• Datastorage:HDFS• Dataprocessing:MAPREDUCE
#DevoxxMA @zouheircadi
©h\ps://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
#DevoxxMA @zouheircadi
©h\p://stackoverflow.com/quesYons/31044575/mapreduce-2-vs-yarn-applicaYons
#DevoxxMA @zouheircadi
©h\ps://www.mapr.com/blog/how-job-execuYon-framework-mapreduce-v1-v2
#DevoxxMA @zouheircadi
©Hadoop,thedefiniYveguide,ThirdediYonTomWhite,O'ReillyEd.
#DevoxxMA @zouheircadi
©Hadoop,thedefiniYveguide,ThirdediYonTomWhite,O'ReillyEd.
#DevoxxMA @zouheircadi
Hadoop ecosystème
#DevoxxMA @zouheircadi
©Hadoop,thedefiniYveguide,ThirdediYonTomWhite,O'ReillyEd.
#DevoxxMA @zouheircadi
Hadoop conclusion
• Donnéesread-onlyavectraitementssimples• Map-Reduce• Movecomputa`ontodata
• Paralleliza`onanddistribu`on(Highscalability)• Faulttolerance• Statusandmonitoring• «onepersondeployment»
©hWps://en.wikipedia.org/wiki/MapReduce
#DevoxxMA @zouheircadi
When ?
BIGDATA
VOLUME
VELOCITY VARIETY
#DevoxxMA @zouheircadi
Software companies
#DevoxxMA @zouheircadi
M/R shortcomings
• ForceyourpipelineintoMap/Reducetasks• Otherworkflows(filter,join,map-reduce-map…)
• ReadfromdiskforeveryM/Rtask• Itera`vealgorithms
• OnlynaYvejavaprogramminginterface• Supportforotherlanguages:streamingmodule• Interac`veshell
#DevoxxMA @zouheircadi
Hadoop conclusion
• Grosproblèmedelenteur• MapReduceest lentmais c’est actuellement la seulealterna`vepourfairedestraitementssurHDFS
• RoadMapcontradictoiredeséditeurs• Stratégiedeséditeurs(Google)
#DevoxxMA @zouheircadi
Hadoop conclusion
• Map-Reduce has served a great purpose,though: many, many companies, researchlabs and individuals are successfullybringingMap-Reduce to bear on problemstowhich it issuited:brute-forceprocessingwithanopYonalaggregaYon.
hWp://the-paper-trail.org/blog/the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/
#DevoxxMA @zouheircadi
Hadoop conclusion
• Butmore important in the longer term, tomy mind, is the way that Map-Reduceprovided the jusYficaYon for re-evaluaYngthe ways in which large-scale dataprocessing plaworms are built (andpurchased!).
hWp://the-paper-trail.org/blog/the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/
#DevoxxMA @zouheircadi
Hadoop conclusion
• It’s well known in the industry that morethan 10 years ago Google inventedMapReduce,thetechnologyattheheartoffirst-generaYon Hadoop. It’s less wellknown that Google moved away fromMapReduce several years ago. Today at itsGoogleI/O2014…
hWps://www.datanami.com/2014/06/25/google-re-imagines-mapreduce-launches-dataflow/
#DevoxxMA @zouheircadi
Hadoop conclusion
• …Todayat itsGoogleI/O2014conference,theWebgiantunveiledapossiblesuccessorto MapReduce called Dataflow, which it’ssellingthroughitshostedcloudservice.
hWps://www.datanami.com/2014/06/25/google-re-imagines-mapreduce-launches-dataflow/
#DevoxxMA @zouheircadi
Spark
#DevoxxMA @zouheircadi
Key dates
• 2009AMPLabUniversityofBerk.Cal.• Originalaim:POCdeMesos• 2012:0.5.1
#DevoxxMA @zouheircadi
Workernode
Executor
DriverNode
Cache
Task Task
Driverprogram
Sparkcontext
Clustermanager
WorkernodeExecutor
Cache
Task Task
#DevoxxMA @zouheircadi
Spark
• ResilientDistributedDatasets(RDD)• ARDDisaresilientanddistributedcollec`onofrecords
• MoYvaYon• Itera`vealgorithmsinmachinelearning
• Supports2typesofoperaYons• Transforma`ons• Ac`ons
#DevoxxMA @zouheircadi
Spark - RDD
Server1
Server2
Server3
RDD
#DevoxxMA @zouheircadi
Spark
#DevoxxMA @zouheircadi
Spark
• TransformaYons• Func`onsthatreturnanotherRDD• Map• FlapMap• Filter• Coalesce• GroupByKey
#DevoxxMA @zouheircadi
Spark – Transformation : Map
HelloWorld
ThisIsDevoxx
Morocco
HeldIn
Casablanca
helloworld
thisisdevoxx
morocco
heldin
casablanca
.map(_toLowerCase)
#DevoxxMA @zouheircadi
Spark – Transformation : flatMap
hello
wold
this
is
.flatMap(line=>line.split(«\\s+»))
helloworld
thisisdevoxx
morocco
heldin
casablanca….devoxx
#DevoxxMA @zouheircadi
Spark – Transformation : map
(hello,1)
(wold,1)
(this,1)
(is,1)
.map(word=>(word,1))
….(devoxx,1)
hello
wold
this
is
….devoxx
#DevoxxMA @zouheircadi
Spark – Transformation : groupByKey
(a,1)(b,1)
(a,1)(a,1)(b,1)(b,1)
(a,1)(a,1)(a,1)(b,1)(b,1)(b,1)
(a,1)(a,1)(b,1)(b,1)
(a,1)(a,1)(a,1)(b,1)(b,1)(b,1)
#DevoxxMA @zouheircadi
Spark – Transformation : reduceByKey
(a,1)(b,1)
(a,1)(a,1)(b,1)(b,1)
(a,1)(a,1)(a,1)(b,1)(b,1)(b,1)
(a,1)(a,1)(a,1)(a,1)(a,1)(a,1)
(a,6)
(b,1)(b,1)(b,1)(b,1)(b,1)(b,1)
(b,6)
#DevoxxMA @zouheircadi
Spark
• AcYons• func`onsthattriggercomputa`onandreturnsomethingthatisn’tanRDD• collect():copyallelementstothedriver• count()• collectAsMap()• sample()• take(n):copyfirstnelements• reduce(func):aggregateselementswithfunc(take2elements,returnone)
• saveTextAsFile(fileName):savetolocalorHDFS
#DevoxxMA @zouheircadi
All in one
valsc=newSparkContext()valdocs=sc.textFile("hdfs://<path>")vallow=docs.map(line=>line.toLowerCase)valword=low.flatMap(line=>line.split("\\s+"))valcounts=words.map(word=>(word,1))valfrequency=counts.reduceByKey(_+_)valtop=frequency.map(_swap).top(N)top.forEach(println)
#DevoxxMA @zouheircadi
Spark
• Caching• Bydefault,eachjobreprocessedfromHDFS• .cache()methodonRDDtriggercaching• Calledatthefirstcomputa`on(lazy)
#DevoxxMA @zouheircadi
Spark
• DirectAcyclicGraphs(DAGs)• NodesareRDD• ArrowsareTransforma`ons
#DevoxxMA @zouheircadi
Spark
• Batch• Streaming• IteraYve• InteracYve
#DevoxxMA @zouheircadiGOOGLETRENDSSPARKvs.STORMvs.HIVE
#DevoxxMA @zouheircadi
OLTP
#DevoxxMA @zouheircadi
Key dates
• BigTable(Google):2004• Dynamo(Amazon):2007
#DevoxxMA @zouheircadi
Data model
• Key-Value• Document• Column
#DevoxxMA @zouheircadi
Key-value
• TableauassociaYf(map)• Querymodel:PUT,GET,DELETE
KEY VALUE
#DevoxxMA @zouheircadi
Document {"id":"987GREHLKE878YEFB","images":["url1","url2","url3"],"prix":»1290","type":"APPARTEMENT","etage":"2","pieces":"2","chambres":"1","surface":"20","descrip`on":"desc...","ville":"PARIS","arrondissement":"75004","departement":"IDF"}
#DevoxxMA @zouheircadi
Document
• Standardencodingformat:JSON,BSON,…• Querymodel• CRUD(CReate,Update,Delete)• Selectbasedondocumentcontent
#DevoxxMA @zouheircadi
{"id":"987GREHLKE878YEFB","images":["url1","url2","url3"],"prix":»1290","type":"APPARTEMENT","etage":"2","pieces":"2","chambres":"1","surface":"20","descrip`on":"desc...","ville":"PARIS","arrondissement":"75004","departement":"IDF"}
#DevoxxMA @zouheircadi
{Column}
• Columnfamilystores• BigTable,Hbase,Hypertable,Cassandra
• Columnstores• C-Store,Ver`ca
©hWp://dbmsmusings.blogspot.fr/2010/03/dis`nguishing-two-major-types-of_29.html
#DevoxxMA @zouheircadi
Data model
©hWp://www.slideshare.net/yellow7/cassandra-backgroundandarchitecture
Rela`onalDB Databases Tables Rows Columns
MongoDB db Collec`ons Documents Fields
Elas`cSearch Indices Types Documents Fields
#DevoxxMA @zouheircadi
Column family stores
• Persistent(distributed)maps
#DevoxxMA @zouheircadi
Column family stores
Map<RowKey,SortedMap<ColumnKey,ColumnValue>>
©hWp://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-prac`ces-part-1/
#DevoxxMA @zouheircadi
Column family stores
#DevoxxMA @zouheircadi
Column family stores
Map<RowKey,SortedMap<SuperColumnKey,SortedMap<ColumnKey,ColumnValue>>>
©hWp://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-prac`ces-part-1/
#DevoxxMA @zouheircadi
©h\ps://cloud.google.com/bigtable/docs/schema-design
Column family stores (bigTable)
#DevoxxMA @zouheircadi
Replication model
• Master-less• Cassandra,DynamoDB,Riak,
• Masterslave• MongoDB,Redis,Hbase
• Master-Master(ouMaster-Slave)• CouchDB
©hWp://www.slideshare.net/yellow7/cassandra-backgroundandarchitecture
#DevoxxMA @zouheircadi
Comparison criteria
• Datamodel• Querymodel• ReplicaYonmodel• Consistencymodel• Licensing,support,community
#DevoxxMA @zouheircadi
Comparison criteria
• Datamodel• Querymodel• ReplicaYonmodel• Consistencymodel• Licensing,…
#DevoxxMA @zouheircadi
System Architecture
#DevoxxMA @zouheircadi
Pourquoi explosion schema less
#DevoxxMA @zouheircadi
#DevoxxMA @zouheircadi
Pourquoi explosion schema less
• Start-upvsentreprisesoldschool• (avecunTTMtrèscourt)
#DevoxxMA @zouheircadi
Pourquoi explosion schema less
• Allowedbybusinessrules
#DevoxxMA @zouheircadi
Pourquoi explosion schema less : 3V
#DevoxxMA @zouheircadi
Contraintes à l’utilsation de NoSQL
• TransacYons• Onnepeutpasconsidérerquepasserlarésolu`ondesconflitsauclientsoitunprogrès.• Malnécessairesouventdictéparlebusiness
#DevoxxMA @zouheircadi
hWp://db-engines.com/en/ranking
#DevoxxMA @zouheircadi
hWps://www.gartner.com/doc/reprints?id=1-2PMFPEN&ct=151013&st=sb
#DevoxxMA @zouheircadi
hWps://www.google.com/trends/explore?date=2008-03-18%202016-10-18&q=RDBMS,NOSQL
#DevoxxMA @zouheircadi
Why ?
• Sharedata• Manyusers
• Exposeadatamodel• Anorganizedone?
• Scalability• Dependingonusersordataprocessing
• Flexibility• Embracechange
#DevoxxMA @zouheircadi
#DevoxxMA @zouheircadi
URLOGRAPHIE • Hadoop,thedefiniYveguide,ThirdediYonTomWhite,ISBN:978-1-449-31152-0,O'ReillyEd.• h\ps://www.postgresql.org/about/• h\ps://blog.codeship.com/unleash-the-power-of-storing-json-in-postgres/• h\ps://opentextbc.ca/dbdesign/chapter/chapter-5-data-modelling/• h\p://coronet.iicm.edu/is/scripts/lesson03.pdf• h\ps://opentextbc.ca/dbdesign/chapter/chapter-3-characterisYcs-and-benefits-of-a-database/• h\p://gerardnico.com/wiki/relaYon/rdbms• h\ps://en.wikipedia.org/wiki/Scalability• h\p://siliconangle.com/blog/2016/06/27/google-tools-up-with-its-spanner-database-looks-for-a-fight-with-
aws/• h\p://www.ca\ell.net/datastores/Datastores.pdf• h\ps://en.wikipedia.org/wiki/Apache_Hadoop• h\ps://en.wikipedia.org/wiki/MapReduce• h\ps://www.linkedin.com/pulse/rdbms-follows-acid-property-nosql-databases-base-does
#DevoxxMA @zouheircadi
URLOGRAPHIE • h\ps://www.quora.com/Hadoop-Why-are-companies-invesYng-so-much-into-Hadoop-if-Google-released-the-
MapReduce-paper-back-in-2004-Are-companies-just-going-to-follow-the-road-map-Google-created-Big-Table-Pregel-Dremel-etc-It-seems-to-me-that-companies-will-always-be-behind-the-curve
• h\p://the-paper-trail.org/blog/the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/• h\ps://www.mapr.com/ebooks/spark/01-what-is-apache-spark.html• h\ps://www.digitalocean.com/community/tutorials/a-comparison-of-nosql-database-management-systems-
and-models• h\ps://cloud.google.com/bigtable/docs/overview• h\ps://cloud.google.com/bigtable/docs/schema-design• h\ps://en.wikipedia.org/wiki/Dremel_(so�ware)• h\ps://www.gartner.com/doc/reprints?id=1-2PMFPEN&ct=151013&st=sb• h\p://www.infoworld.com/arYcle/3056637/database/nosql-chips-away-at-oracle-ibm-and-microso�-
dominance.html• h\p://www.slideshare.net/billhoweuw/dataintensive-scalable-science
top related