big data software infrastructures - hte · 2016-11-10 · big data software infrastructures...

23
Big Data software infrastructures Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI) Gábor Hermann Zoltán Zvara András Benczúr Informatics Laboratory „Big Data – Momemtum” research group MTA SZTAKI, 10/11/2016

Upload: others

Post on 28-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

BigDatasoftwareinfrastructuresInstitutefor ComputerScienceandControl,Hungarian Academy ofSciences (MTASZTAKI)

GáborHermannZoltánZvaraAndrásBenczúrInformatics Laboratory„BigData– Momemtum”research group

MTASZTAKI,10/11/2016

Page 2: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

2

Outline

n Motivationfor“BigData”n “BigData”isaboutsoftwareinfrastructuren Batchandstreamingapproachesn WhatwedoatSZTAKIn Solvingaproblem:handlingdataskew

Page 3: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

3

Somedataaboutdata

n Google was guesstimated to have over1Mmachines in2012

n LargeHadronCollider(LHC)collisionsgeneratedabout75petabytesinpast3 years

n Facebook31.25millionmessageseveryminute(2015)

Page 4: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

4

TheProjectTriangle

n Smallmachinetostoreandprocessdata:slown Largeserversarefastbutcostlyn Distributedcomputing?

Page 5: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

5

Problemwithdistributeddataprocessing

n Whenfailureshappen…n Leftside will not know aboutnew data on right

n Immediate response fromleftsidemightgiveincorrectanswer

n Iffailurescanhappen(partitioned)wecaneitherchoosecorrectness(consistence)orfastresponse(availability)

Page 6: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

6

CAP(Fox&Brewer)Theorem

C

A P

Theorem: You may choose two of C-A-PConsistency(Good)

Availability(Fast)

Partition-resilience(Cheap)AP: some replicas may give

erroneous answer

Page 7: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

7

Failuresdohappenindistributeddataprocessing!

n WemusthaveP,needtochoosebetweenAandC

n Fastresponsevs.correctresultsn Mostapplicationsneedfastresponsen Bestwecando:eventualconsistency

(ifconnectionresumesanddatacanbeexchanged)

n „BigData” today ismostly about softwareinfrastructuren Tryingtodothebest

Page 8: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

8

Approach1:batchprocessing

n Processthewholedatasetn Consistent,buttakestime(hours,days)n Iffailurehappens,waitforrecovery(choosingCP)

n ApacheHadoopn MapReduce,HDFS(distributedfilesystem)n Datainchunksacrossmanymachines,replicatedn Bringcomputationclosetothedata

Page 9: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

9

BeyondHadoop andMapReduce (batch)

n MapReduce hasthefirstopensourcedistributedsoftware,Hadoopn Limitationsn Joinandmorecomplexprimitivesn Graphs,machinelearning

n Alternativesn Graphprocessing:ApacheGiraph,ApacheHAMA,…n Inmemorybased:ApacheSparkn Streamingdataflowengine:ApacheFlink

Page 10: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

10

Approach2:streamprocessing

n Continuouslyprocessallincomingdatan Fasterresponsetime(low-latency,within1sec)n Iffailure:waitforrecoveryn StillchoosingPC,butsophisticatedrecoverymechanismsgive

lowerdowntime

n Hardertoimplementandreasonaboutn Streamprocessingframeworksn ApacheFlink,ApacheStorm, ApacheSpark

Page 11: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

11

STREAMLINEH2020

NewinitiativeontopofApacheFlinkA general data processing framework to unify batch and stream processing

At SZTAKI: Machine Learning

n DFKI(DE)n SICS(SE)n PortugalTelecom(PT)n InternetMemory(FR)n Rovio(FI)n SZTAKI(HU) B.– Volker Markl (TUBerlin)

Page 12: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

12

Flink

Historic data

Kafka, RabbitMQ, ...

HDFS, JDBC, ...

ETL, Graphs,Machine LearningRelational, …

Low latencywindowing, aggregations, ...

Event logs

Real-time data streams

Batchandstream:same execution engineAn engine that puts equal emphasis to

streaming and batch

Page 13: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

13

Whatwedofor“BigData”atSZTAKI

n DevelopingApacheFlink (STREAMLINEH2020)n MachineLearningalgorithms,experimenting

n Projectswithindustrialpartnersn UsingSpark,Flink,Cassandra,Hadoop etc.

n Researchn Improvementsoncurrentsystemsn Ongoingproject:handlingdataskew

Page 14: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

14

Solvingaproblem:handlingdataskew

n Wehavedevelopedanapplicationaggregatingtelco datan Afterawhile,onrealdatasetitcouldbecomesloworeven

crashn Investigatedtheproblem:dataskewn 80%ofthetrafficgeneratedby20%ofthecommunication

towers

Page 15: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

15

Theproblem

n Defaulthashingisnotgoingtodistributethedatauniformly

n Datadistributionisnotknowninadvancen Theheavykeysmightevenchange

stageboundary & shuffle

even partitioning skewed data

slow task

slow task

Page 16: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

16

Oursolution:DynamicRepartitioning

n Monitoringtasks,repartitioningbasedonthatdatan Systemawaren Nosignificantoverhead

n Canhandlearbitrarydatadistributionsn Doeseverythingon-the-flyn Worksforstreamingandbatch

n Pluggablen InitiallyonSparkbatchandstreamingn PluggedintoFlink streaming

Page 17: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

17

Execution visualization ofSpark jobs

Page 18: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

18

Futureinhandlingdataskew

n Generalizingtheproblemn Balancingloadin aprocessingsystemn Balancingresourcesbetween processingsystemsona

cluster(YARN)

Page 19: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

19

Conclusion

n Wecantame“BigData”withbettersoftwaren Lottodo…n Connectionbetweenbatchandstreamingn Highlyscalablemachinelearning(batchandonlineboth)n Optimizations(e.g.handlingdataskew)n Betterunderstandingourtools(e.g.visualization)n ...

n Evolvingfast,butwecantakepartinit

Page 20: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

20

Questions?

n Youcanreachus!n [email protected] [email protected] [email protected]

Page 21: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

21

References

n STREAMLINEH2020n https://streamline.sics.se/

n DynamicRepartitioningn https://spark-summit.org/2016/events/handling-data-skew-

adaptively-in-spark-using-dynamic-repartitioning/

n Visualizationsn http://flink-forward.org/kb_sessions/advanced-visualization-of-

flink-and-spark-jobs/

Page 22: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

22

References(continued)

n ApacheHadoopn https://hadoop.apache.org/

n ApacheFlinkn https://flink.apache.org/

n ApacheSparkn https://spark.apache.org/

Page 23: Big Data software infrastructures - HTE · 2016-11-10 · Big Data software infrastructures Institute for Computer Science and Control ... n Stream processing frameworks n Apache

23

References(continued)

n E.A.Brewer.Towardsrobustdistributedsystems,2000n J.DeanandS.Ghemawat.MapReduce:SimplifiedData

ProcessingonLargeClusters,2004n N.Marz.HowtobeattheCAPtheorem,2011n http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html