awesome big data algorithms
TRANSCRIPT
-
7/27/2019 Awesome Big Data Algorithms
1/37
AwesomeBigDataAlgorithms
http://xkcd.com/1185/
-
7/27/2019 Awesome Big Data Algorithms
2/37
AwesomeBigData
AlgorithmsC.TitusBrown
AsstProfessor,MichiganStateUniversity
(Microbiology,ComputerScience,andBEACON)
-
7/27/2019 Awesome Big Data Algorithms
3/37
Welcome!
Moreofacomputationalscientistthanacomputerscientist;willbeusingsimulations
todemo&explorealgorithmbehavior.
Sendmequestions/comments@ctitusbrown,[email protected].
-
7/27/2019 Awesome Big Data Algorithms
4/37
Features
IwillbeusingPythonratherthanC++,becausePythoniseasiertoread.
IwillbeusingIPythonNotebooktodemo.
Iapologizeinadvancefornotcoveringyourfavoritedatastructureoralgorithm.
-
7/27/2019 Awesome Big Data Algorithms
5/37
Outline Thebasicidea Threeexamples
Skiplists(afastkey/valuestore)HyperLogLogCounting(countingdiscreteelements) BloomfiltersandCountMinSketches
Folding,spindling,andmutilatingDNAsequence Referencesandfurtherreading
-
7/27/2019 Awesome Big Data Algorithms
6/37
Thebasicidea Problem:youhavealotofdatatocount,track,orotherwise
analyze.
ThisdataisDataofUnusualSize,i.e.youcantjustbruteforcetheanalysis.
Forexample, Counttheapproximatenumberofdistinctelementsinaverylarge
(infinite?)dataset
Optimizequeriesbyusinganefficientbutapproximateprefilter Determinethefrequencydistributionofdistinctelementsinavery
largedataset.
-
7/27/2019 Awesome Big Data Algorithms
7/37
Onlineandstreamingvs.offline
Largeishard;infiniteismucheasier.
Offlinealgorithmsanalyzeanentiredatasetallatonce.
Onlinealgorithmsanalyzedataserially,onepieceatatime.
Streamingalgorithmsareonlinealgorithmsthatcanbeusedforverymemory&computelimitedanalysis.
-
7/27/2019 Awesome Big Data Algorithms
8/37
Exactvsrandomorprobabilistic
Oftenanapproximateanswerissufficient,espifyoucanplaceboundsonhowwrongthe
approximationislikelytobe.
Oftenrandomalgorithmsorprobabilisticdatastructurescanbefoundwithgoodtypicalbehaviorbutbadworstcasebehavior.
-
7/27/2019 Awesome Big Data Algorithms
9/37
Forone(stupid)exampleYoucantrim8bitsoffofintegersforthepurposeofaveragingthem
-
7/27/2019 Awesome Big Data Algorithms
10/37
SkiplistsArandomlyindexedimprovementonlinkedlists.
Eachnodecanbelongtooneormoreverticallevels,whichallowfastsearch/insertion/deletion~O(log(n))
typically!
wikipedia
-
7/27/2019 Awesome Big Data Algorithms
11/37
-
7/27/2019 Awesome Big Data Algorithms
12/37
SkiplistsArandomlyindexedimprovementonlinkedlists.
Veryeasytoimplement;asymptoticallygoodbehavior.
Fromreddit,ifsomeoneheldaguntomyheadandaskedmetoimplementanefficientset/mapstorage,Iwould
implementaskiplist.
(Response:doesthishappentoyoualot??) wikipedia
-
7/27/2019 Awesome Big Data Algorithms
13/37
Channelrandomness!
Ifyoucanconstructorrelyonrandomness,thenyoucaneasilygetgoodtypicalbehavior.
Note,agoodhashfunctionisessentiallythesameasagoodrandomnumbergenerator
-
7/27/2019 Awesome Big Data Algorithms
14/37
HyperLogLogcardinalitycounting
Supposeyouhaveanincomingstreamofmany,manyobjects.
Andyouwanttotrackhowmanydistinctitemsthereare,andyouwanttoaccumulatethecountofdistinctobjectsovertime.
-
7/27/2019 Awesome Big Data Algorithms
15/37
Relevantdigression:
Flipsomeunknownnumberofcoins.Q:whatissomethingsimpletotrackthatwilltellyou
roughlyhowmanycoinsyouveflipped?
A:longestrunofHEADs.Longrunsareveryrareandarecorrelatedwithhowmanycoinsyouve
flipped.
-
7/27/2019 Awesome Big Data Algorithms
16/37
-
7/27/2019 Awesome Big Data Algorithms
17/37
CardinalitycountingwithHyperLogLog
Essentially,uselongestrunof0-bitsobservedinahashvalue.
Usemultiplehashfunctionssothatyoucantaketheaverage.
Takeharmonicmean+low/highsamplingadjustment=>result.
-
7/27/2019 Awesome Big Data Algorithms
18/37
-
7/27/2019 Awesome Big Data Algorithms
19/37
Bloomfilters
Asetmembershipdatastructurethatisprobabilisticbutonlyyieldsfalsepositives.
Trivialtoimplement;hashfunctionismaincost;extremelymemoryefficient.
-
7/27/2019 Awesome Big Data Algorithms
20/37
-
7/27/2019 Awesome Big Data Algorithms
21/37
MyresearchapplicationsBiologyisfastbecomingadata-drivenscience.
http://www.genome.gov/sequencingcosts/
-
7/27/2019 Awesome Big Data Algorithms
22/37
Shotgunsequencinganalogy:
feedingbooksintoapapershredder,digitizingtheshreds,andreconstructing
thebook.
Althoughforbooks,weoftenknowthelanguageandnotjustthealphabetJ
-
7/27/2019 Awesome Big Data Algorithms
23/37
Shotgunsequencingis--
Randomlyordered. Randomlysampled. Toobigtoefficientlydomultiplepasses
-
7/27/2019 Awesome Big Data Algorithms
24/37
ShotgunsequencingGenome (unknown)
XX
XX
X
XX
X
X
X
XX
X
X
Reads(randomly chosen;
have errors)
X
XX
Coverageissimplytheaveragenumberofreadsthatoverlap
eachtruebaseingenome.
Here,thecoverageis~10justdrawalinestraightdownfromthetop
throughallofthereads.
-
7/27/2019 Awesome Big Data Algorithms
25/37
Randomsampling=>deepsamplingneeded
Typically10-100xneededforrobustrecovery(300Gbpforhuman)
-
7/27/2019 Awesome Big Data Algorithms
26/37
-
7/27/2019 Awesome Big Data Algorithms
27/37
Streamingalgorithmtodoso:
digitalnormalizationTrue sequence (unknown)
Reads(randomly sequenced)
-
7/27/2019 Awesome Big Data Algorithms
28/37
DigitalnormalizationTrue sequence (unknown)
Reads(randomly sequenced)
X
-
7/27/2019 Awesome Big Data Algorithms
29/37
DigitalnormalizationTrue sequence (unknown)
Reads(randomly sequenced)
XX
XX
XX
XX
X
X
X
-
7/27/2019 Awesome Big Data Algorithms
30/37
DigitalnormalizationTrue sequence (unknown)
Reads(randomly sequenced)
XX
XX
XX
XX
X
X
X
-
7/27/2019 Awesome Big Data Algorithms
31/37
DigitalnormalizationTrue sequence (unknown)
Reads(randomly sequenced)
XX
XX
XX
XX
X
If next read is from a highcoverage region - discard
X
X
-
7/27/2019 Awesome Big Data Algorithms
32/37
DigitalnormalizationTrue sequence (unknown)
Reads(randomly sequenced)
XX
XX
XX
XX
X
X
XX
XX
XX
X
X
XX
X
XX
X
Redundant reads(not needed for assembly)
-
7/27/2019 Awesome Big Data Algorithms
33/37
Storingdatathiswayisbetterthanbest-
possibleinformation-theoreticstorage.
Pelletal.,PNAS2012
-
7/27/2019 Awesome Big Data Algorithms
34/37
UseBloomfiltertostoregraphs
Pelletal.,PNAS2012
GraphsonlygainnodesbecauseofBloomfilterfalsepositives.
-
7/27/2019 Awesome Big Data Algorithms
35/37
Someassemblydetails Thiswascompletelyintractable. ImplementedinC++andPython;goodpractice(?) Wevechangedscalingbehaviorfromdatatoinformation. Practicalscalingfor~soilmetagenomicsis10x:
need
-
7/27/2019 Awesome Big Data Algorithms
36/37
Concludingthoughts
Channelrandomness. Embracestreaming. Livewithminoruncertainty. Dontbeafraidtodiscarddata.
(Also,ImanopensourcehackerwhocanconferPhDs,inexchangeforlongyearsoflowpaylivinginMichigan.
E-mailme!AnddonttalktoBrettCannonaboutPhDsfirst.)
-
7/27/2019 Awesome Big Data Algorithms
37/37
References
SkipLists:Wikipedia,andJohnShipmanscode:
http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/pyskip.pdf
HyperLogLog:AggregateKnowledgesblog,http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/And:https://github.com/svpcom/hyperloglog
BloomFilters:Wikipedia
Ourwork:http://ivory.idyll.org/blog/andhttp://ged.msu.edu/interests.html