big data - inf.uniroma3.ittorlone/bigdata/s7-cern.pdfstreaming (spark or flume) or sqoopjobs...
TRANSCRIPT
![Page 1: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/1.jpg)
BIGDATAMYEXPERIENCEATCERNLUCAMENICHETT I
Università RomaTre,BigData,6June2016
![Page 2: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/2.jpg)
CERN
2
![Page 3: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/3.jpg)
LargeHadronCollider
3
![Page 4: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/4.jpg)
Experiments
4
![Page 5: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/5.jpg)
Events
5
![Page 6: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/6.jpg)
6
Tier 0 (CERN Computing Centre)Data Recording &Offline Analysis
(perogniesperimento…)
DataFlow
![Page 7: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/7.jpg)
7
Storage
200-400 MB/sec
Data flow to permanent storage: 4-6 GB/sec
1.25 GB/sec
1-2 GB/se
1-2 GB/sec
![Page 8: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/8.jpg)
Reconstructionandarchival
6/8/16 DOCUMENTREFERENCE 8
![Page 9: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/9.jpg)
Tiers- WLCG
9
Tier-0 (CERN):•Data recording•Initial data reconstruction
•Data distribution
Tier-1 (11 centres):•Permanent storage•Re-processing•Analysis
Tier-2 (~130 centres):• Simulation• End-user analysis
![Page 10: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/10.jpg)
WLCG
10
![Page 11: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/11.jpg)
Hadoop◦ ExperimentsandITservicesrunning24/7◦ Millionofjobssubmitteddailyinthegrid(physicists)◦ Monitoringdataforeachservicearecollectedandproperlystoredindependently
◦ Crossprojectanalysisactivitiesarecoordinatedbyworkinggroupsthatareoftensharingacommonplatformwheretodumpdataandrunjobs(IT)
◦ Amongthese:HadoopServiceprovidedbyCERNIT◦ Acommonrepository(datalake)◦ Aproductionenvironmentforotherservices
11
![Page 12: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/12.jpg)
Main activities◦ Serviceprovider◦ Cluster(s)maintenance(ROTA)◦ Framework/Applicationstroubleshooting◦ Analysisenvironmentconfiguration(clients)◦ Externalserviceintegration(fromtransportlayeruntilUI)◦ …
◦ Dataanalysis◦ Mainlyaboutresourcesutilizationandjobsperformance◦ Fileformatandframeworksevaluation◦ Usersupport◦ …
12
![Page 13: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/13.jpg)
DataFlow:ETL◦ DataarestoredinHDFSusingRESTAPIs,streaming(SparkorFlume)orSqoop jobs◦ ExtractionTransformationandLoad(orELT)proceduresarerunningdailyforeachdataset◦ ResultsareCSV,JSON,Avro,Parquet,…◦ Eachdatasetcanbepresentmorethanonce◦ Writtenwithdifferenttechnologiesorformats◦ Mergedwithotherdatasets(denormalization)◦ Writtenwithlessormorefields
13
![Page 14: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/14.jpg)
DataAnalysiswithinHadoop◦Howtoansweryourquestion?◦Differentframeworksandtoolscanbeused,dependingontheusecase:◦ Datasize◦ Frequency◦ Numberoffieldsperrecord◦ Finalresult
14
![Page 15: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/15.jpg)
ApacheSpark◦ Fast(in-memoryapproach)◦ Easytolearnandtouse◦ RDDandDataFrame bringsthefocusonthedataset◦ SparkComponents!◦ SparkSQL,MLlib,GraphX
15
![Page 16: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/16.jpg)
Analysisexample- workflow
16
Dashboard(experiment
jobs)
LSF(batchjobs)
LanDB(hostinfo)
SqoopFlume
![Page 17: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/17.jpg)
AnalysisExamples◦ JobefficiencyWignervsGeneva◦ Spark,Python(Pandas)
◦ Memoryprofiling◦ Spark(SQL)
◦ Datapopularity(blockreplicaslocaltion)◦ Pig,Spark(SQL,GraphX)
◦ Jobmonitoringsystemdiscrepancyanalysis◦ Spark,Python(Pandas)
6/8/16 DOCUMENTREFERENCE 17
![Page 18: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/18.jpg)
WebNotebooks
6/8/16 DOCUMENTREFERENCE 18
◦ Itis“aninteractivecomputationalenvironment,inwhichyoucancombinecodeexecution,richtext,mathematics,plotsandrichmedia”[http://ipython.org/notebook.html]
![Page 19: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/19.jpg)
IPython /Jupyter
6/8/16 DOCUMENTREFERENCE 19
![Page 20: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/20.jpg)
Jupyter example- matplotlib
6/8/16 DOCUMENTREFERENCE 20
![Page 21: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/21.jpg)
Zeppelin
21
![Page 22: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/22.jpg)
Zeppelin– ExampleDF
22
![Page 23: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/23.jpg)
23
Zeppelin– ExampleChart
![Page 24: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/24.jpg)
24
WallClock
CPU
Circlesize:jobduration
Zeppelin– ExamplePlot
![Page 25: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset](https://reader033.vdocuments.net/reader033/viewer/2022060317/5f0c5cc17e708231d435069f/html5/thumbnails/25.jpg)
Theend
6/8/16 DOCUMENTREFERENCE 25