awesome big data algorithms

7/27/2019 Awesome Big Data Algorithms

1/37

AwesomeBigDataAlgorithms

http://xkcd.com/1185/


2/37

AwesomeBigData

AlgorithmsC.TitusBrown

[email protected]

AsstProfessor,MichiganStateUniversity

(Microbiology,ComputerScience,andBEACON)


3/37

Welcome!

Moreofacomputationalscientistthanacomputerscientist;willbeusingsimulations

todemo&explorealgorithmbehavior.

Sendmequestions/comments@ctitusbrown,[email protected].


4/37

Features

IwillbeusingPythonratherthanC++,becausePythoniseasiertoread.

IwillbeusingIPythonNotebooktodemo.

Iapologizeinadvancefornotcoveringyourfavoritedatastructureoralgorithm.


5/37

Outline Thebasicidea Threeexamples

Skiplists(afastkey/valuestore)HyperLogLogCounting(countingdiscreteelements) BloomfiltersandCountMinSketches

Folding,spindling,andmutilatingDNAsequence Referencesandfurtherreading


6/37

Thebasicidea Problem:youhavealotofdatatocount,track,orotherwise

analyze.

ThisdataisDataofUnusualSize,i.e.youcantjustbruteforcetheanalysis.

Forexample, Counttheapproximatenumberofdistinctelementsinaverylarge

(infinite?)dataset

Optimizequeriesbyusinganefficientbutapproximateprefilter Determinethefrequencydistributionofdistinctelementsinavery

largedataset.


7/37

Onlineandstreamingvs.offline

Largeishard;infiniteismucheasier.

Offlinealgorithmsanalyzeanentiredatasetallatonce.

Onlinealgorithmsanalyzedataserially,onepieceatatime.

Streamingalgorithmsareonlinealgorithmsthatcanbeusedforverymemory&computelimitedanalysis.


8/37

Exactvsrandomorprobabilistic

Oftenanapproximateanswerissufficient,espifyoucanplaceboundsonhowwrongthe

approximationislikelytobe.

Oftenrandomalgorithmsorprobabilisticdatastructurescanbefoundwithgoodtypicalbehaviorbutbadworstcasebehavior.


9/37

Forone(stupid)exampleYoucantrim8bitsoffofintegersforthepurposeofaveragingthem


10/37

SkiplistsArandomlyindexedimprovementonlinkedlists.

Eachnodecanbelongtooneormoreverticallevels,whichallowfastsearch/insertion/deletion~O(log(n))

typically!

wikipedia


11/37


12/37

SkiplistsArandomlyindexedimprovementonlinkedlists.

Veryeasytoimplement;asymptoticallygoodbehavior.

Fromreddit,ifsomeoneheldaguntomyheadandaskedmetoimplementanefficientset/mapstorage,Iwould

implementaskiplist.

(Response:doesthishappentoyoualot??) wikipedia


13/37

Channelrandomness!

Ifyoucanconstructorrelyonrandomness,thenyoucaneasilygetgoodtypicalbehavior.

Note,agoodhashfunctionisessentiallythesameasagoodrandomnumbergenerator


14/37

HyperLogLogcardinalitycounting

Supposeyouhaveanincomingstreamofmany,manyobjects.

Andyouwanttotrackhowmanydistinctitemsthereare,andyouwanttoaccumulatethecountofdistinctobjectsovertime.


15/37

Relevantdigression:

Flipsomeunknownnumberofcoins.Q:whatissomethingsimpletotrackthatwilltellyou

roughlyhowmanycoinsyouveflipped?

A:longestrunofHEADs.Longrunsareveryrareandarecorrelatedwithhowmanycoinsyouve

flipped.


16/37


17/37

CardinalitycountingwithHyperLogLog

Essentially,uselongestrunof0-bitsobservedinahashvalue.

Usemultiplehashfunctionssothatyoucantaketheaverage.

Takeharmonicmean+low/highsamplingadjustment=>result.


18/37


19/37

Bloomfilters

Asetmembershipdatastructurethatisprobabilisticbutonlyyieldsfalsepositives.

Trivialtoimplement;hashfunctionismaincost;extremelymemoryefficient.


20/37


21/37

MyresearchapplicationsBiologyisfastbecomingadata-drivenscience.

http://www.genome.gov/sequencingcosts/


22/37

Shotgunsequencinganalogy:

feedingbooksintoapapershredder,digitizingtheshreds,andreconstructing

thebook.

Althoughforbooks,weoftenknowthelanguageandnotjustthealphabetJ


23/37

Shotgunsequencingis--

Randomlyordered. Randomlysampled. Toobigtoefficientlydomultiplepasses


24/37

ShotgunsequencingGenome (unknown)

XX

XX

X

XX

X

X

X

XX

X

X

Reads(randomly chosen;

have errors)

X

XX

Coverageissimplytheaveragenumberofreadsthatoverlap

eachtruebaseingenome.

Here,thecoverageis~10justdrawalinestraightdownfromthetop

throughallofthereads.


25/37

Randomsampling=>deepsamplingneeded

Typically10-100xneededforrobustrecovery(300Gbpforhuman)


26/37


27/37

Streamingalgorithmtodoso:

digitalnormalizationTrue sequence (unknown)

Reads(randomly sequenced)


28/37

DigitalnormalizationTrue sequence (unknown)


X


29/37



XX

XX

XX

XX

X

X

X


30/37



XX

XX

XX

XX

X

X

X


31/37



XX

XX

XX

XX

X

If next read is from a highcoverage region - discard

X

X


32/37



XX

XX

XX

XX

X

X

XX

XX

XX

X

X

XX

X

XX

X

Redundant reads(not needed for assembly)


33/37

Storingdatathiswayisbetterthanbest-

possibleinformation-theoreticstorage.

Pelletal.,PNAS2012


34/37

UseBloomfiltertostoregraphs

Pelletal.,PNAS2012

GraphsonlygainnodesbecauseofBloomfilterfalsepositives.


35/37

Someassemblydetails Thiswascompletelyintractable. ImplementedinC++andPython;goodpractice(?) Wevechangedscalingbehaviorfromdatatoinformation. Practicalscalingfor~soilmetagenomicsis10x:

need


36/37

Concludingthoughts

Channelrandomness. Embracestreaming. Livewithminoruncertainty. Dontbeafraidtodiscarddata.

(Also,ImanopensourcehackerwhocanconferPhDs,inexchangeforlongyearsoflowpaylivinginMichigan.

E-mailme!AnddonttalktoBrettCannonaboutPhDsfirst.)


37/37

References

SkipLists:Wikipedia,andJohnShipmanscode:

http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/pyskip.pdf

HyperLogLog:AggregateKnowledgesblog,http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/And:https://github.com/svpcom/hyperloglog

BloomFilters:Wikipedia

Ourwork:http://ivory.idyll.org/blog/andhttp://ged.msu.edu/interests.html

[email protected]

awesome big data algorithms

Documents