mining sensor data - roma tre universitytorlone/bigdata/s1-streaming.pdf · not only big data •...
TRANSCRIPT
Mining Sensor Data DonatellaFirmani
Preliminary comments
• InternetofThings=connectedeverythingworld• AccordingCisco,therewill21billionconnecteddevicesby2018.
• AnalyBcofsensorgenerateddatait’smostlyaboutrealBmeanalyBcofBmeseriesdata
2
Overview of this lecture
• Real-worldexample• Theory(Datastreaming)• PracBce(Sparkexercise)
3
Real-world example
4
Data collec?on
• HowMuchClimateDataatNASA?• MERRA*ReanalysisCollecBon~200TB• TotaldataholdingsoftheNASACenterforClimateSimulaBon(NCCS)is~40PB
• IntergovernmentalPanelonClimateChange• FiXhAssessmentReport~5PB(dataonlinenow)• IntergovernmentalPanelonClimateChangeSixthAssessmentReport~100PB(tobecreatedwithinthenext5to6years)
*ModernEra-RetrospecBveAnalysisforResearchandApplicaBons
5
MERRA technologies
• HadoopFileSystem• NaBveMERRAfilesaresequencedandingestedintotheHadoopclusterintriplicated640MBblocks.
• TotalsizeofMERRAHDFSrepository~480TB.
• MapReduce• 36nodeDellcluster,576Intel2.6GHzSandyBridgecores,1300TBrawstorage,1250GBRAM,11.7TFtheoreBcalpeakcomputecapacity.
• FDRInfinibandnetworkwithpeakTCP/IPspeeds>20Gbps.
6
Impact
• WeiExperiment(ContribuBonofIrrigaBontoPrecipitaBon)
• TradiBonal:• ~8.4TBtransferredfromarchivetolocalworkstaBon(weeks)• Clipping,averagingperformedbyFortranprogramonlocalworkstaBon(days)
• MERRA:• Clipping,averagingperformedbyMERRA(lessthanoneday)• ~35GBoffinalproductmovedtolocalworkstaBon• SignificantBmesavingsindatawrangling,• Rapidscreeningovermonthlymeansfilestakesminutes
7
Other Applica?ons
• Military• AcBvitymonitoring• EventdetecBon
• Cosmological• SpacestaBondata• Spacetelescopes
• Mobile• wearablesensors• socialsensing
8
Theory
9
Problem defini?on
DataMining(discoveryofmeaningfulpaeernsincollecBonsofdata)
DataStreaming
BigData
10
Not only big data
• IftheinsightbeingsoughtthroughanalyBcneedsaglobalcontext,thenallthedatawillbesenttobackendBigDataplaform(e.g.NOSQLdatabase)
• Otherwise:• AllthedatamaynotendupinaBigDataplaform(Theremaybehubnodesinasensornetworkwhichmaycollectandaggregatedatafromasetofsensors)
• ThedataarrivingattheBigDataplaformmaynotalwaysbetherawsensordata.Itmaybedataaggregatedandpreprocessedatthenetworkedge.
11
Data streaming
• CharacterisBcsofDatastreamsare:◦ ConBnuousflowofdata◦ Infinitelength
Networktraffic
Sensordata
Callcenterrecords
◦ Examples:
12
Challenges
• Datastreamingchallenges• Volume
• Specificsensorchallenges• Onepassofthedata• Temporalcomponent(NostraighforwardadaptaBonofone-passalgorithms)
• DataisoXenuncertain• OXenminedinadistributedfashion(Intermediatesensornodeslimitedprocessingpower)
13
Typical problems
• Frequentitems• Frequentitemsets:
• recurringgroupsofelements• usedforforecasBng
• Clustering:groupsimilaritems• ClassificaBon:learnmodelbasedonexamples
14
Typical computa?onal models
• overtheenBredatastream• considersthedatafromthebeginningunBlnow• calledlandmarkdatamodel
• overawindow• considersthedatafromnowuptoacertainrangeinthepast
• slidingwindowmodel• hybrid
• associatesweightswiththedatainthestream,andgiveshigherweightstorecentdatathanthoseinthepast.
• dampedwindowmodel
15
Discussed in this lecture
• Frequentitems• Clustering
16
Classic Frequent PaJern Mining
Customer buys diaper
Customer buys both
Customer buys beer
Tid Items bought
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
17
Over the en?re stream
Stream
IdenBfyallelementswhosecurrentfrequencyexceedssupportthresholds=0.1%.
18
Related problem
Stream
IdenBfyallsubsetsofitemswhosecurrentfrequencyexceedss=0.1%
19
Over a window
bucket 1 bucket2 bucket 3
Dividethestreamintopossiblyoverlappingbuckets(“slidingwindow”)ItispossibletoholdthetransacBonsineachbucketinmainmemory(i.e.,keepexactcountersforitemsinthebuckets)
20
Hybrid
• Dampedwindow:• decayfactor:theweightofeachtransacBonismulBpliedbyafactoroff<1,whenanewtransacBonarrives.
• TheoveralleffectisanexponenBaldecayfuncBon• effecBveforevolvingdatastream,sincerecenttransacBonsarecountedmoresignificantly
• Outofthescopeofthislecture• Wefocuson“overtheenBrestream”
21
Synopsis 1/2
• FittheenBrestreamwithintheavailablespace?• impossible• computesta$s$calproper$esofthefrequencyvector,insteadofthevectoritself
• acceptabletogenerateapproximatesoluBons
• Frequencyvector:histogramofourstream• p-thfrequencymomentofthestream:
22
Notable moments
• zerofrequencymoment:numberofdisBnctelementsinourstream(numberofnon-zeroentriesofthefrequencyvector)
• firstfrequencymoment:numberofelementsinthestream.
• secondfrequencymoment:classicstaBsBcforstreamingapplicaBons
23
Synopsis 2/2
• ApproximatesoluBonsbysummarizingthedata:• Sampling• HashSketches
• DisBnctitems• FM(Flajolet-MarBn)Sketches
• Linear-ProjecBonSketches• 2ndfrequencymoment• AMS(Alon,MaBasandSzegedy)Sketches
24
Warm-Up
• Streamcontainsd−1disBnctintegersx∈[1,d]inanarbitraryorder
• Computethemissingintegerk?• iniBalizecountera=1⊕2⊕…⊕d• update(x):a=a⊕x• query():returna
• kistheonlyintegerthatappearsonceintheXORsequencesoa=k
• memory:logd+1=O(logd)bits
25
FM Sketch 1/2
• AssumeahashfuncBonh(x)thatmapsincomingvaluesxin[0,…,N-1]uniformlyacross[0,…,2^L-1],whereL=O(logN)
• Letlsb(y)denotetheposiBonoftheleast-significant1bitinthebinaryrepresentaBonofy
• Avaluexismappedtolsb(h(x))• MaintainHashSketch=BITMAParrayofLbits,iniBalizedto0
• Foreachincomingvaluex,setBITMAP[lsb(h(x))]=1
26
x=5 h(x)=101100 lsb(h(x))=2 0 0 0 001
BITMAP543210
FM Sketch 2/2
• Byuniformitythroughh(x):Prob[BITMAP[k]=1]=Prob[]=• AssumingddisBnctvalues:expectd/2tomaptoBITMAP[0],d/4tomaptoBITMAP[1],...
• LetR=posiBonofrightmostzeroinBITMAP• Useasindicatoroflog(d)
• [FM85]provethatE[R]=,where• EsBmated=• Averageseveralinstances(differenthashfuncBons)toreduceesBmatorvariance
27
)log( dφ 7735.=φφR2
k10 121+k
0
fringeof0/1saroundlog(d)
0 0 0 00 10 00 111 1 11111
posiBon<<log(d)posiBon>>log(d)
L-1
Hash sketches proper?es
• Composable:Component-wiseOR/adddistributedsketchestogether
• EsBmate|S1US2U…USk|=set-unioncardinality• Distributedse}ng:
• performslocalcomputaBonateachnode• mergesthesesketchesintoasingleglobalsketch
• Delete-proof:JustusecountersinsteadofbitsinthesketchlocaBons
• +1forinserts,-1fordeletes
28
AMS Sketch
• Goal:Buildsmall-spacesummaryfordistribuBonvectorf(i)(i=1,...,N)seenasastreamofi-values
• BasicConstruct:RandomizedLinearProjecBonoff()=projectontodotproductoff-vectorand
• Simpletocomputeoverthestream:adduponthei-thvalue
• TunableprobabilisBcguaranteesonapproximaBonerror
29
3,1,2,4,2,3,5,...f(1)f(2)f(3)f(4)f(5)
11 12 2
∑>=< iiff ξξ )(, where=vectorofrandomvaluesξ
ξ
iξ
Linear sketches proper?es
• Composable:Simplyaddindependently-builtprojecBons
• Delete-Proof:Justsubtracttodeleteani-thvalueoccurrence
30
iξ
Classic clustering
• Thegoalofclusteringisto• groupdatapointsthatareclose(orsimilar)toeachother• idenBfygroupings(orclusters)inanunsupervisedmanner
• Unsupervised:noinformaBonisprovidedtothealgorithmonwhichdatapointsbelongtowhichclusters
• Example
31
x x
x x
x x
x x
x
Data Stream Clustering
• Clustersoveruser-specifiedBme-horizons• “microclustering”
• OnlineComponent:• periodicallystoresdetailedsummarystaBsBcs
• OfflineComponent:• usesonlythesummarystaBsBcstodoclustering
32
View of Micro-Cluster View of Macro-Cluster
Micro-clusters
• AMicro-ClusterisasetofindividualdatapointsthatareclosetoeachotherandwillbetreatedasasingleunitinfurtherofflineMacro-clustering.
• Themicro-clustersarestoredatsnapshots.
33
… …Snapshot
What to Store in a Micro-Cluster
• SelectrelevantproperBesusefulformaintainingclustersdynamically
• OnlyaddiBve/subtracitveproperBes• wedon’thavetocomputethemfromscratchateachsnapshot
• Examples• first-ordermoment• second-ordermoment
34
Macro-Cluster Crea?on
• CurrentTimeT,thewindowsizeish.Thatmeanstheuserwanttofindtheclustersformedin(T-h,T).
• Approach:• 1ststep:FindthesnapshotforT,getthemicro-clustersetS(T).
• 2ndstep:FindthesnapshotforT-h,getthemicro-clustersetS(T-h).
• UseS(T)-S(T-h)• Specifically,wehaveamergedclusterwithIdlist(C1,C2,C3)inS(T)andaclusterwithIdC1inS(T-h).
• SinceC1areformedbeforeT-h,shouldnotcontributetothemicro-clusterformedin(T-h,T)
• RunK-meansonremainingMicro-Clusters35
Example
36
C_ID:[C1]
Time:T-h
C_ID:[C1,C2,C3]
Time:T
C_ID:[C2,C3]
Result:T-h
Distributed seSng
• Expensive:• transmitallofthedatatoacentralizedserver• naturalapproach
• Efficient:• performslocalclusteringateachnode• mergesthesedifferentclustersintoasingleglobalclustering
• lowcommunicaBoncost
37
Prac?ce
38
Gap between theory and prac?ce
• Somehowformalmethodsandpopulartechnologiesaredisjointnowadays
• Methodsàre-thinkingclassicalproblems• TechnologiesàmakingcomputaBonfeasible
• ForlackofBme,ourdiscussiononpracBcefocusesonmorebasicproblems(levelshiX)thandiscussedintheory(Note:Designpaeernarevalidingeneral)
39
Exercise: Level shiV
• DetecBonofoutliersinsensorstreamusingspark.• Example:Considersomeproductbeingshippedintemperaturecontrolledcontainers.ThecustomerhasaServiceLevelAgreement(SLA)withthetransportaBoncompany,whichdefineshowthetemperatureismaintainedwithinapredefinedrange.
• MeantemperaturewithinaBmewindowhastobebelowpredefinedupperlimitorabovesomepredefinedlowerthreshold.
• SomeminimumpercentageofthedatawithinaBmewindowhastobebelowsomeupperthresholdorabovesomelowerthreshold
40
Caveat
• InMachineLearningparlancetheproblemwearesolvingissupervisedoutlierdetec$on.It’ssupervisedbecausewearespecifyingtheoutliercondiBonsexplicitlythroughtheSLA
41
Sample architecture
NodeManagerSecondaryNodeM
DataNodeNodeManager
ResourceManagerSparkDriver
DataNodeNodeManager
192.160.27.100 192.160.27.101
192.160.27.102 192.160.27.103
HDFSYARNSPARK
42
Sample soVware
• heps://github.com/pranab/ruscello• Java:StreamingalgsforlevelshiXdetecBon(canbeusedbyanystreamcomputaBonframework,e.g.Storm,SparkStreaming)
• Scala:Sparkshell• Python:Everythingelse
43
Spark Streaming
• SparkunifiesbatchandrealBmeprocessing• SparkstreamingnotablecharacterisBcs:
• Messagesareprocessedinmicrobatches,wherethestreamisessenBallyasequenceofRDDs
• RDDsfromthestreamareprocessedlikenormalsparkofflineRDDprocessing.
44
Sensor data genera?on
• Temperatureatdesiredlevel+somerandomnoise• RandomtemperatureshiXsupper/lower• Sensordata:
• SensorID• Timestamp• Temperature
• Datacanbepipedto• Socketserver(sparkstreaminghasasocketstreamreceiver)
• KaVaqueue• HDFS
45
Input window
• SincewearedealingwithBmeseriesdata,weuseBmeboundwindowài.e.,every30sec
• Ifdatasamplesarriveatregularintervalsandthevariabilityinsamplingperiodisnegligible,wecanusesizeboundwindowài.e.,every10samples
• Aseachdatasamplearrives• Adddatatothewindowobject• VerifySLAcondiBonexpression• Iftrue,thenviolaBonisappendedintheobjectstate
46
Output stream
• SparkreturnsastreamofRDDs,whereeachRDDiscomprisedof(sensorID,stateobject)
• QuerystateobjectfornumberofviolaBons• Sampleoutput:
device:U4W8U4L3 num violations:102
device:HCEJRWFP num violations:194
device:U4W8U4L3 num violations:102
device:HCEJRWFP num violations:194
device:U4W8U4L3 num violations:247
device:HCEJRWFP num violations:411
(WecouldalsoproduceamoredetailedoutputcontainingtheBmestampandmeantemperaturereadingforeachviolaBonofeachsensor.)
47
References
48
Useful References
• “Schnase,JohnL.,etal."MERRAanalyBcservices:meeBngthebigdatachallengesofclimatesciencethroughcloud-enabledclimateanalyBcs-as-a-service."Computers,EnvironmentandUrbanSystems(2014)”
• Aggarwal,CharuC.,ed.Managingandminingsensordata.SpringerScience&BusinessMedia,2013.
• hep://pkghosh.wordpress.com/2015/02/19/real-Bme-detecBon-of-outliers-in-sensor-data-using-spark-streaming
49