![Page 1: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/1.jpg)
BigDatasoftwareinfrastructuresInstitutefor ComputerScienceandControl,Hungarian Academy ofSciences (MTASZTAKI)
GáborHermannZoltánZvaraAndrásBenczúrInformatics Laboratory„BigData– Momemtum”research group
MTASZTAKI,10/11/2016
![Page 2: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/2.jpg)
2
Outline
n Motivationfor“BigData”n “BigData”isaboutsoftwareinfrastructuren Batchandstreamingapproachesn WhatwedoatSZTAKIn Solvingaproblem:handlingdataskew
![Page 3: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/3.jpg)
3
Somedataaboutdata
n Google was guesstimated to have over1Mmachines in2012
n LargeHadronCollider(LHC)collisionsgeneratedabout75petabytesinpast3 years
n Facebook31.25millionmessageseveryminute(2015)
![Page 4: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/4.jpg)
4
TheProjectTriangle
n Smallmachinetostoreandprocessdata:slown Largeserversarefastbutcostlyn Distributedcomputing?
![Page 5: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/5.jpg)
5
Problemwithdistributeddataprocessing
n Whenfailureshappen…n Leftside will not know aboutnew data on right
n Immediate response fromleftsidemightgiveincorrectanswer
n Iffailurescanhappen(partitioned)wecaneitherchoosecorrectness(consistence)orfastresponse(availability)
![Page 6: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/6.jpg)
6
CAP(Fox&Brewer)Theorem
C
A P
Theorem: You may choose two of C-A-PConsistency(Good)
Availability(Fast)
Partition-resilience(Cheap)AP: some replicas may give
erroneous answer
![Page 7: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/7.jpg)
7
Failuresdohappenindistributeddataprocessing!
n WemusthaveP,needtochoosebetweenAandC
n Fastresponsevs.correctresultsn Mostapplicationsneedfastresponsen Bestwecando:eventualconsistency
(ifconnectionresumesanddatacanbeexchanged)
n „BigData” today ismostly about softwareinfrastructuren Tryingtodothebest
![Page 8: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/8.jpg)
8
Approach1:batchprocessing
n Processthewholedatasetn Consistent,buttakestime(hours,days)n Iffailurehappens,waitforrecovery(choosingCP)
n ApacheHadoopn MapReduce,HDFS(distributedfilesystem)n Datainchunksacrossmanymachines,replicatedn Bringcomputationclosetothedata
![Page 9: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/9.jpg)
9
BeyondHadoop andMapReduce (batch)
n MapReduce hasthefirstopensourcedistributedsoftware,Hadoopn Limitationsn Joinandmorecomplexprimitivesn Graphs,machinelearning
n Alternativesn Graphprocessing:ApacheGiraph,ApacheHAMA,…n Inmemorybased:ApacheSparkn Streamingdataflowengine:ApacheFlink
![Page 10: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/10.jpg)
10
Approach2:streamprocessing
n Continuouslyprocessallincomingdatan Fasterresponsetime(low-latency,within1sec)n Iffailure:waitforrecoveryn StillchoosingPC,butsophisticatedrecoverymechanismsgive
lowerdowntime
n Hardertoimplementandreasonaboutn Streamprocessingframeworksn ApacheFlink,ApacheStorm, ApacheSpark
![Page 11: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/11.jpg)
11
STREAMLINEH2020
NewinitiativeontopofApacheFlinkA general data processing framework to unify batch and stream processing
At SZTAKI: Machine Learning
n DFKI(DE)n SICS(SE)n PortugalTelecom(PT)n InternetMemory(FR)n Rovio(FI)n SZTAKI(HU) B.– Volker Markl (TUBerlin)
![Page 12: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/12.jpg)
12
Flink
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
ETL, Graphs,Machine LearningRelational, …
Low latencywindowing, aggregations, ...
Event logs
Real-time data streams
Batchandstream:same execution engineAn engine that puts equal emphasis to
streaming and batch
![Page 13: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/13.jpg)
13
Whatwedofor“BigData”atSZTAKI
n DevelopingApacheFlink (STREAMLINEH2020)n MachineLearningalgorithms,experimenting
n Projectswithindustrialpartnersn UsingSpark,Flink,Cassandra,Hadoop etc.
n Researchn Improvementsoncurrentsystemsn Ongoingproject:handlingdataskew
![Page 14: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/14.jpg)
14
Solvingaproblem:handlingdataskew
n Wehavedevelopedanapplicationaggregatingtelco datan Afterawhile,onrealdatasetitcouldbecomesloworeven
crashn Investigatedtheproblem:dataskewn 80%ofthetrafficgeneratedby20%ofthecommunication
towers
![Page 15: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/15.jpg)
15
Theproblem
n Defaulthashingisnotgoingtodistributethedatauniformly
n Datadistributionisnotknowninadvancen Theheavykeysmightevenchange
stageboundary & shuffle
even partitioning skewed data
slow task
slow task
![Page 16: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/16.jpg)
16
Oursolution:DynamicRepartitioning
n Monitoringtasks,repartitioningbasedonthatdatan Systemawaren Nosignificantoverhead
n Canhandlearbitrarydatadistributionsn Doeseverythingon-the-flyn Worksforstreamingandbatch
n Pluggablen InitiallyonSparkbatchandstreamingn PluggedintoFlink streaming
![Page 17: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/17.jpg)
17
Execution visualization ofSpark jobs
![Page 18: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/18.jpg)
18
Futureinhandlingdataskew
n Generalizingtheproblemn Balancingloadin aprocessingsystemn Balancingresourcesbetween processingsystemsona
cluster(YARN)
![Page 19: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/19.jpg)
19
Conclusion
n Wecantame“BigData”withbettersoftwaren Lottodo…n Connectionbetweenbatchandstreamingn Highlyscalablemachinelearning(batchandonlineboth)n Optimizations(e.g.handlingdataskew)n Betterunderstandingourtools(e.g.visualization)n ...
n Evolvingfast,butwecantakepartinit
![Page 21: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/21.jpg)
21
References
n STREAMLINEH2020n https://streamline.sics.se/
n DynamicRepartitioningn https://spark-summit.org/2016/events/handling-data-skew-
adaptively-in-spark-using-dynamic-repartitioning/
n Visualizationsn http://flink-forward.org/kb_sessions/advanced-visualization-of-
flink-and-spark-jobs/
![Page 22: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/22.jpg)
22
References(continued)
n ApacheHadoopn https://hadoop.apache.org/
n ApacheFlinkn https://flink.apache.org/
n ApacheSparkn https://spark.apache.org/
![Page 23: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache](https://reader034.vdocuments.net/reader034/viewer/2022042307/5ed386441ed2af207328707f/html5/thumbnails/23.jpg)
23
References(continued)
n E.A.Brewer.Towardsrobustdistributedsystems,2000n J.DeanandS.Ghemawat.MapReduce:SimplifiedData
ProcessingonLargeClusters,2004n N.Marz.HowtobeattheCAPtheorem,2011n http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html