mining sensor data - roma tre universitytorlone/bigdata/s1-streaming.pdf · not only big data •...

49
Mining Sensor Data Donatella Firmani

Upload: others

Post on 04-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Mining Sensor Data DonatellaFirmani

Page 2: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Preliminary comments

•  InternetofThings=connectedeverythingworld•  AccordingCisco,therewill21billionconnecteddevicesby2018.

• AnalyBcofsensorgenerateddatait’smostlyaboutrealBmeanalyBcofBmeseriesdata

2

Page 3: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Overview of this lecture

• Real-worldexample•  Theory(Datastreaming)• PracBce(Sparkexercise)

3

Page 4: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Real-world example

4

Page 5: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Data collec?on

• HowMuchClimateDataatNASA?•  MERRA*ReanalysisCollecBon~200TB•  TotaldataholdingsoftheNASACenterforClimateSimulaBon(NCCS)is~40PB

•  IntergovernmentalPanelonClimateChange•  FiXhAssessmentReport~5PB(dataonlinenow)•  IntergovernmentalPanelonClimateChangeSixthAssessmentReport~100PB(tobecreatedwithinthenext5to6years)

*ModernEra-RetrospecBveAnalysisforResearchandApplicaBons

5

Page 6: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

MERRA technologies

• HadoopFileSystem•  NaBveMERRAfilesaresequencedandingestedintotheHadoopclusterintriplicated640MBblocks.

•  TotalsizeofMERRAHDFSrepository~480TB.

• MapReduce•  36nodeDellcluster,576Intel2.6GHzSandyBridgecores,1300TBrawstorage,1250GBRAM,11.7TFtheoreBcalpeakcomputecapacity.

•  FDRInfinibandnetworkwithpeakTCP/IPspeeds>20Gbps.

6

Page 7: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Impact

• WeiExperiment(ContribuBonofIrrigaBontoPrecipitaBon)

•  TradiBonal:•  ~8.4TBtransferredfromarchivetolocalworkstaBon(weeks)•  Clipping,averagingperformedbyFortranprogramonlocalworkstaBon(days)

•  MERRA:•  Clipping,averagingperformedbyMERRA(lessthanoneday)•  ~35GBoffinalproductmovedtolocalworkstaBon•  SignificantBmesavingsindatawrangling,•  Rapidscreeningovermonthlymeansfilestakesminutes

7

Page 8: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Other Applica?ons

• Military•  AcBvitymonitoring•  EventdetecBon

• Cosmological•  SpacestaBondata•  Spacetelescopes

• Mobile•  wearablesensors•  socialsensing

8

Page 9: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Theory

9

Page 10: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Problem defini?on

DataMining(discoveryofmeaningfulpaeernsincollecBonsofdata)

DataStreaming

BigData

10

Page 11: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Not only big data

•  IftheinsightbeingsoughtthroughanalyBcneedsaglobalcontext,thenallthedatawillbesenttobackendBigDataplaform(e.g.NOSQLdatabase)

• Otherwise:•  AllthedatamaynotendupinaBigDataplaform(Theremaybehubnodesinasensornetworkwhichmaycollectandaggregatedatafromasetofsensors)

•  ThedataarrivingattheBigDataplaformmaynotalwaysbetherawsensordata.Itmaybedataaggregatedandpreprocessedatthenetworkedge.

11

Page 12: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Data streaming

•  CharacterisBcsofDatastreamsare:◦  ConBnuousflowofdata◦  Infinitelength

Networktraffic

Sensordata

Callcenterrecords

◦  Examples:

12

Page 13: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Challenges

• Datastreamingchallenges•  Volume

•  Specificsensorchallenges•  Onepassofthedata•  Temporalcomponent(NostraighforwardadaptaBonofone-passalgorithms)

•  DataisoXenuncertain•  OXenminedinadistributedfashion(Intermediatesensornodeslimitedprocessingpower)

13

Page 14: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Typical problems

•  Frequentitems•  Frequentitemsets:

•  recurringgroupsofelements•  usedforforecasBng

• Clustering:groupsimilaritems• ClassificaBon:learnmodelbasedonexamples

14

Page 15: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Typical computa?onal models

• overtheenBredatastream•  considersthedatafromthebeginningunBlnow•  calledlandmarkdatamodel

• overawindow•  considersthedatafromnowuptoacertainrangeinthepast

•  slidingwindowmodel• hybrid

•  associatesweightswiththedatainthestream,andgiveshigherweightstorecentdatathanthoseinthepast.

•  dampedwindowmodel

15

Page 16: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Discussed in this lecture

•  Frequentitems• Clustering

16

Page 17: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Classic Frequent PaJern Mining

Customer buys diaper

Customer buys both

Customer buys beer

Tid Items bought

10 Beer, Nuts, Diaper

20 Beer, Coffee, Diaper

30 Beer, Diaper, Eggs

40 Nuts, Eggs, Milk

50 Nuts, Coffee, Diaper, Eggs, Milk

17

Page 18: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Over the en?re stream

Stream

IdenBfyallelementswhosecurrentfrequencyexceedssupportthresholds=0.1%.

18

Page 19: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Related problem

Stream

IdenBfyallsubsetsofitemswhosecurrentfrequencyexceedss=0.1%

19

Page 20: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Over a window

bucket 1 bucket2 bucket 3

Dividethestreamintopossiblyoverlappingbuckets(“slidingwindow”)ItispossibletoholdthetransacBonsineachbucketinmainmemory(i.e.,keepexactcountersforitemsinthebuckets)

20

Page 21: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Hybrid

• Dampedwindow:•  decayfactor:theweightofeachtransacBonismulBpliedbyafactoroff<1,whenanewtransacBonarrives.

•  TheoveralleffectisanexponenBaldecayfuncBon•  effecBveforevolvingdatastream,sincerecenttransacBonsarecountedmoresignificantly

• Outofthescopeofthislecture• Wefocuson“overtheenBrestream”

21

Page 22: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Synopsis 1/2

•  FittheenBrestreamwithintheavailablespace?•  impossible•  computesta$s$calproper$esofthefrequencyvector,insteadofthevectoritself

•  acceptabletogenerateapproximatesoluBons

•  Frequencyvector:histogramofourstream• p-thfrequencymomentofthestream:

22

Page 23: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Notable moments

•  zerofrequencymoment:numberofdisBnctelementsinourstream(numberofnon-zeroentriesofthefrequencyvector)

• firstfrequencymoment:numberofelementsinthestream.

•  secondfrequencymoment:classicstaBsBcforstreamingapplicaBons

23

Page 24: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Synopsis 2/2

• ApproximatesoluBonsbysummarizingthedata:•  Sampling•  HashSketches

•  DisBnctitems•  FM(Flajolet-MarBn)Sketches

•  Linear-ProjecBonSketches•  2ndfrequencymoment•  AMS(Alon,MaBasandSzegedy)Sketches

24

Page 25: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Warm-Up

•  Streamcontainsd−1disBnctintegersx∈[1,d]inanarbitraryorder

• Computethemissingintegerk?•  iniBalizecountera=1⊕2⊕…⊕d•  update(x):a=a⊕x•  query():returna

•  kistheonlyintegerthatappearsonceintheXORsequencesoa=k

• memory:logd+1=O(logd)bits

25

Page 26: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

FM Sketch 1/2

•  AssumeahashfuncBonh(x)thatmapsincomingvaluesxin[0,…,N-1]uniformlyacross[0,…,2^L-1],whereL=O(logN)

•  Letlsb(y)denotetheposiBonoftheleast-significant1bitinthebinaryrepresentaBonofy

•  Avaluexismappedtolsb(h(x))•  MaintainHashSketch=BITMAParrayofLbits,iniBalizedto0

•  Foreachincomingvaluex,setBITMAP[lsb(h(x))]=1

26

x=5 h(x)=101100 lsb(h(x))=2 0 0 0 001

BITMAP543210

Page 27: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

FM Sketch 2/2

•  Byuniformitythroughh(x):Prob[BITMAP[k]=1]=Prob[]=•  AssumingddisBnctvalues:expectd/2tomaptoBITMAP[0],d/4tomaptoBITMAP[1],...

•  LetR=posiBonofrightmostzeroinBITMAP•  Useasindicatoroflog(d)

•  [FM85]provethatE[R]=,where•  EsBmated=•  Averageseveralinstances(differenthashfuncBons)toreduceesBmatorvariance

27

)log( dφ 7735.=φφR2

k10 121+k

0

fringeof0/1saroundlog(d)

0 0 0 00 10 00 111 1 11111

posiBon<<log(d)posiBon>>log(d)

L-1

Page 28: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Hash sketches proper?es

• Composable:Component-wiseOR/adddistributedsketchestogether

•  EsBmate|S1US2U…USk|=set-unioncardinality•  Distributedse}ng:

•  performslocalcomputaBonateachnode•  mergesthesesketchesintoasingleglobalsketch

• Delete-proof:JustusecountersinsteadofbitsinthesketchlocaBons

•  +1forinserts,-1fordeletes

28

Page 29: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

AMS Sketch

•  Goal:Buildsmall-spacesummaryfordistribuBonvectorf(i)(i=1,...,N)seenasastreamofi-values

•  BasicConstruct:RandomizedLinearProjecBonoff()=projectontodotproductoff-vectorand

•  Simpletocomputeoverthestream:adduponthei-thvalue

•  TunableprobabilisBcguaranteesonapproximaBonerror

29

3,1,2,4,2,3,5,...f(1)f(2)f(3)f(4)f(5)

11 12 2

∑>=< iiff ξξ )(, where=vectorofrandomvaluesξ

ξ

Page 30: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Linear sketches proper?es

• Composable:Simplyaddindependently-builtprojecBons

• Delete-Proof:Justsubtracttodeleteani-thvalueoccurrence

30

Page 31: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Classic clustering

•  Thegoalofclusteringisto•  groupdatapointsthatareclose(orsimilar)toeachother•  idenBfygroupings(orclusters)inanunsupervisedmanner

•  Unsupervised:noinformaBonisprovidedtothealgorithmonwhichdatapointsbelongtowhichclusters

•  Example

31

x x

x x

x x

x x

x

Page 32: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Data Stream Clustering

•  Clustersoveruser-specifiedBme-horizons•  “microclustering”

•  OnlineComponent:•  periodicallystoresdetailedsummarystaBsBcs

•  OfflineComponent:•  usesonlythesummarystaBsBcstodoclustering

32

View of Micro-Cluster View of Macro-Cluster

Page 33: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Micro-clusters

• AMicro-ClusterisasetofindividualdatapointsthatareclosetoeachotherandwillbetreatedasasingleunitinfurtherofflineMacro-clustering.

•  Themicro-clustersarestoredatsnapshots.

33

… …Snapshot

Page 34: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

What to Store in a Micro-Cluster

•  SelectrelevantproperBesusefulformaintainingclustersdynamically

• OnlyaddiBve/subtracitveproperBes•  wedon’thavetocomputethemfromscratchateachsnapshot

•  Examples•  first-ordermoment•  second-ordermoment

34

Page 35: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Macro-Cluster Crea?on

•  CurrentTimeT,thewindowsizeish.Thatmeanstheuserwanttofindtheclustersformedin(T-h,T).

•  Approach:•  1ststep:FindthesnapshotforT,getthemicro-clustersetS(T).

•  2ndstep:FindthesnapshotforT-h,getthemicro-clustersetS(T-h).

•  UseS(T)-S(T-h)•  Specifically,wehaveamergedclusterwithIdlist(C1,C2,C3)inS(T)andaclusterwithIdC1inS(T-h).

•  SinceC1areformedbeforeT-h,shouldnotcontributetothemicro-clusterformedin(T-h,T)

•  RunK-meansonremainingMicro-Clusters35

Page 36: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Example

36

C_ID:[C1]

Time:T-h

C_ID:[C1,C2,C3]

Time:T

C_ID:[C2,C3]

Result:T-h

Page 37: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Distributed seSng

•  Expensive:•  transmitallofthedatatoacentralizedserver•  naturalapproach

•  Efficient:•  performslocalclusteringateachnode•  mergesthesedifferentclustersintoasingleglobalclustering

•  lowcommunicaBoncost

37

Page 38: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Prac?ce

38

Page 39: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Gap between theory and prac?ce

•  Somehowformalmethodsandpopulartechnologiesaredisjointnowadays

• Methodsàre-thinkingclassicalproblems•  TechnologiesàmakingcomputaBonfeasible

•  ForlackofBme,ourdiscussiononpracBcefocusesonmorebasicproblems(levelshiX)thandiscussedintheory(Note:Designpaeernarevalidingeneral)

39

Page 40: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Exercise: Level shiV

•  DetecBonofoutliersinsensorstreamusingspark.•  Example:Considersomeproductbeingshippedintemperaturecontrolledcontainers.ThecustomerhasaServiceLevelAgreement(SLA)withthetransportaBoncompany,whichdefineshowthetemperatureismaintainedwithinapredefinedrange.

•  MeantemperaturewithinaBmewindowhastobebelowpredefinedupperlimitorabovesomepredefinedlowerthreshold.

•  SomeminimumpercentageofthedatawithinaBmewindowhastobebelowsomeupperthresholdorabovesomelowerthreshold

40

Page 41: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Caveat

•  InMachineLearningparlancetheproblemwearesolvingissupervisedoutlierdetec$on.It’ssupervisedbecausewearespecifyingtheoutliercondiBonsexplicitlythroughtheSLA

41

Page 42: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Sample architecture

NodeManagerSecondaryNodeM

DataNodeNodeManager

ResourceManagerSparkDriver

DataNodeNodeManager

192.160.27.100 192.160.27.101

192.160.27.102 192.160.27.103

HDFSYARNSPARK

42

Page 43: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Sample soVware

• heps://github.com/pranab/ruscello•  Java:StreamingalgsforlevelshiXdetecBon(canbeusedbyanystreamcomputaBonframework,e.g.Storm,SparkStreaming)

•  Scala:Sparkshell•  Python:Everythingelse

43

Page 44: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Spark Streaming

•  SparkunifiesbatchandrealBmeprocessing•  SparkstreamingnotablecharacterisBcs:

•  Messagesareprocessedinmicrobatches,wherethestreamisessenBallyasequenceofRDDs

•  RDDsfromthestreamareprocessedlikenormalsparkofflineRDDprocessing.

44

Page 45: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Sensor data genera?on

•  Temperatureatdesiredlevel+somerandomnoise•  RandomtemperatureshiXsupper/lower•  Sensordata:

•  SensorID•  Timestamp•  Temperature

•  Datacanbepipedto•  Socketserver(sparkstreaminghasasocketstreamreceiver)

•  KaVaqueue•  HDFS

45

Page 46: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Input window

•  SincewearedealingwithBmeseriesdata,weuseBmeboundwindowài.e.,every30sec

•  Ifdatasamplesarriveatregularintervalsandthevariabilityinsamplingperiodisnegligible,wecanusesizeboundwindowài.e.,every10samples

• Aseachdatasamplearrives•  Adddatatothewindowobject•  VerifySLAcondiBonexpression•  Iftrue,thenviolaBonisappendedintheobjectstate

46

Page 47: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Output stream

•  SparkreturnsastreamofRDDs,whereeachRDDiscomprisedof(sensorID,stateobject)

• QuerystateobjectfornumberofviolaBons•  Sampleoutput:

device:U4W8U4L3 num violations:102

device:HCEJRWFP num violations:194

device:U4W8U4L3 num violations:102

device:HCEJRWFP num violations:194

device:U4W8U4L3 num violations:247

device:HCEJRWFP num violations:411

(WecouldalsoproduceamoredetailedoutputcontainingtheBmestampandmeantemperaturereadingforeachviolaBonofeachsensor.)

47

Page 48: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

References

48

Page 49: Mining Sensor Data - Roma Tre Universitytorlone/bigdata/S1-streaming.pdf · Not only big data • If the insight being sought through analyBc needs a global context, then all the

Useful References

•  “Schnase,JohnL.,etal."MERRAanalyBcservices:meeBngthebigdatachallengesofclimatesciencethroughcloud-enabledclimateanalyBcs-as-a-service."Computers,EnvironmentandUrbanSystems(2014)”

• Aggarwal,CharuC.,ed.Managingandminingsensordata.SpringerScience&BusinessMedia,2013.

• hep://pkghosh.wordpress.com/2015/02/19/real-Bme-detecBon-of-outliers-in-sensor-data-using-spark-streaming

49