awesome big data algorithms

Upload: yamabushi

Post on 02-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Awesome Big Data Algorithms

    1/37

    AwesomeBigDataAlgorithms

    http://xkcd.com/1185/

  • 7/27/2019 Awesome Big Data Algorithms

    2/37

    AwesomeBigData

    AlgorithmsC.TitusBrown

    [email protected]

    AsstProfessor,MichiganStateUniversity

    (Microbiology,ComputerScience,andBEACON)

  • 7/27/2019 Awesome Big Data Algorithms

    3/37

    Welcome!

    Moreofacomputationalscientistthanacomputerscientist;willbeusingsimulations

    todemo&explorealgorithmbehavior.

    Sendmequestions/comments@ctitusbrown,[email protected].

  • 7/27/2019 Awesome Big Data Algorithms

    4/37

    Features

    IwillbeusingPythonratherthanC++,becausePythoniseasiertoread.

    IwillbeusingIPythonNotebooktodemo.

    Iapologizeinadvancefornotcoveringyourfavoritedatastructureoralgorithm.

  • 7/27/2019 Awesome Big Data Algorithms

    5/37

    Outline Thebasicidea Threeexamples

    Skiplists(afastkey/valuestore)HyperLogLogCounting(countingdiscreteelements) BloomfiltersandCountMinSketches

    Folding,spindling,andmutilatingDNAsequence Referencesandfurtherreading

  • 7/27/2019 Awesome Big Data Algorithms

    6/37

    Thebasicidea Problem:youhavealotofdatatocount,track,orotherwise

    analyze.

    ThisdataisDataofUnusualSize,i.e.youcantjustbruteforcetheanalysis.

    Forexample, Counttheapproximatenumberofdistinctelementsinaverylarge

    (infinite?)dataset

    Optimizequeriesbyusinganefficientbutapproximateprefilter Determinethefrequencydistributionofdistinctelementsinavery

    largedataset.

  • 7/27/2019 Awesome Big Data Algorithms

    7/37

    Onlineandstreamingvs.offline

    Largeishard;infiniteismucheasier.

    Offlinealgorithmsanalyzeanentiredatasetallatonce.

    Onlinealgorithmsanalyzedataserially,onepieceatatime.

    Streamingalgorithmsareonlinealgorithmsthatcanbeusedforverymemory&computelimitedanalysis.

  • 7/27/2019 Awesome Big Data Algorithms

    8/37

    Exactvsrandomorprobabilistic

    Oftenanapproximateanswerissufficient,espifyoucanplaceboundsonhowwrongthe

    approximationislikelytobe.

    Oftenrandomalgorithmsorprobabilisticdatastructurescanbefoundwithgoodtypicalbehaviorbutbadworstcasebehavior.

  • 7/27/2019 Awesome Big Data Algorithms

    9/37

    Forone(stupid)exampleYoucantrim8bitsoffofintegersforthepurposeofaveragingthem

  • 7/27/2019 Awesome Big Data Algorithms

    10/37

    SkiplistsArandomlyindexedimprovementonlinkedlists.

    Eachnodecanbelongtooneormoreverticallevels,whichallowfastsearch/insertion/deletion~O(log(n))

    typically!

    wikipedia

  • 7/27/2019 Awesome Big Data Algorithms

    11/37

  • 7/27/2019 Awesome Big Data Algorithms

    12/37

    SkiplistsArandomlyindexedimprovementonlinkedlists.

    Veryeasytoimplement;asymptoticallygoodbehavior.

    Fromreddit,ifsomeoneheldaguntomyheadandaskedmetoimplementanefficientset/mapstorage,Iwould

    implementaskiplist.

    (Response:doesthishappentoyoualot??) wikipedia

  • 7/27/2019 Awesome Big Data Algorithms

    13/37

    Channelrandomness!

    Ifyoucanconstructorrelyonrandomness,thenyoucaneasilygetgoodtypicalbehavior.

    Note,agoodhashfunctionisessentiallythesameasagoodrandomnumbergenerator

  • 7/27/2019 Awesome Big Data Algorithms

    14/37

    HyperLogLogcardinalitycounting

    Supposeyouhaveanincomingstreamofmany,manyobjects.

    Andyouwanttotrackhowmanydistinctitemsthereare,andyouwanttoaccumulatethecountofdistinctobjectsovertime.

  • 7/27/2019 Awesome Big Data Algorithms

    15/37

    Relevantdigression:

    Flipsomeunknownnumberofcoins.Q:whatissomethingsimpletotrackthatwilltellyou

    roughlyhowmanycoinsyouveflipped?

    A:longestrunofHEADs.Longrunsareveryrareandarecorrelatedwithhowmanycoinsyouve

    flipped.

  • 7/27/2019 Awesome Big Data Algorithms

    16/37

  • 7/27/2019 Awesome Big Data Algorithms

    17/37

    CardinalitycountingwithHyperLogLog

    Essentially,uselongestrunof0-bitsobservedinahashvalue.

    Usemultiplehashfunctionssothatyoucantaketheaverage.

    Takeharmonicmean+low/highsamplingadjustment=>result.

  • 7/27/2019 Awesome Big Data Algorithms

    18/37

  • 7/27/2019 Awesome Big Data Algorithms

    19/37

    Bloomfilters

    Asetmembershipdatastructurethatisprobabilisticbutonlyyieldsfalsepositives.

    Trivialtoimplement;hashfunctionismaincost;extremelymemoryefficient.

  • 7/27/2019 Awesome Big Data Algorithms

    20/37

  • 7/27/2019 Awesome Big Data Algorithms

    21/37

    MyresearchapplicationsBiologyisfastbecomingadata-drivenscience.

    http://www.genome.gov/sequencingcosts/

  • 7/27/2019 Awesome Big Data Algorithms

    22/37

    Shotgunsequencinganalogy:

    feedingbooksintoapapershredder,digitizingtheshreds,andreconstructing

    thebook.

    Althoughforbooks,weoftenknowthelanguageandnotjustthealphabetJ

  • 7/27/2019 Awesome Big Data Algorithms

    23/37

    Shotgunsequencingis--

    Randomlyordered. Randomlysampled. Toobigtoefficientlydomultiplepasses

  • 7/27/2019 Awesome Big Data Algorithms

    24/37

    ShotgunsequencingGenome (unknown)

    XX

    XX

    X

    XX

    X

    X

    X

    XX

    X

    X

    Reads(randomly chosen;

    have errors)

    X

    XX

    Coverageissimplytheaveragenumberofreadsthatoverlap

    eachtruebaseingenome.

    Here,thecoverageis~10justdrawalinestraightdownfromthetop

    throughallofthereads.

  • 7/27/2019 Awesome Big Data Algorithms

    25/37

    Randomsampling=>deepsamplingneeded

    Typically10-100xneededforrobustrecovery(300Gbpforhuman)

  • 7/27/2019 Awesome Big Data Algorithms

    26/37

  • 7/27/2019 Awesome Big Data Algorithms

    27/37

    Streamingalgorithmtodoso:

    digitalnormalizationTrue sequence (unknown)

    Reads(randomly sequenced)

  • 7/27/2019 Awesome Big Data Algorithms

    28/37

    DigitalnormalizationTrue sequence (unknown)

    Reads(randomly sequenced)

    X

  • 7/27/2019 Awesome Big Data Algorithms

    29/37

    DigitalnormalizationTrue sequence (unknown)

    Reads(randomly sequenced)

    XX

    XX

    XX

    XX

    X

    X

    X

  • 7/27/2019 Awesome Big Data Algorithms

    30/37

    DigitalnormalizationTrue sequence (unknown)

    Reads(randomly sequenced)

    XX

    XX

    XX

    XX

    X

    X

    X

  • 7/27/2019 Awesome Big Data Algorithms

    31/37

    DigitalnormalizationTrue sequence (unknown)

    Reads(randomly sequenced)

    XX

    XX

    XX

    XX

    X

    If next read is from a highcoverage region - discard

    X

    X

  • 7/27/2019 Awesome Big Data Algorithms

    32/37

    DigitalnormalizationTrue sequence (unknown)

    Reads(randomly sequenced)

    XX

    XX

    XX

    XX

    X

    X

    XX

    XX

    XX

    X

    X

    XX

    X

    XX

    X

    Redundant reads(not needed for assembly)

  • 7/27/2019 Awesome Big Data Algorithms

    33/37

    Storingdatathiswayisbetterthanbest-

    possibleinformation-theoreticstorage.

    Pelletal.,PNAS2012

  • 7/27/2019 Awesome Big Data Algorithms

    34/37

    UseBloomfiltertostoregraphs

    Pelletal.,PNAS2012

    GraphsonlygainnodesbecauseofBloomfilterfalsepositives.

  • 7/27/2019 Awesome Big Data Algorithms

    35/37

    Someassemblydetails Thiswascompletelyintractable. ImplementedinC++andPython;goodpractice(?) Wevechangedscalingbehaviorfromdatatoinformation. Practicalscalingfor~soilmetagenomicsis10x:

    need

  • 7/27/2019 Awesome Big Data Algorithms

    36/37

    Concludingthoughts

    Channelrandomness. Embracestreaming. Livewithminoruncertainty. Dontbeafraidtodiscarddata.

    (Also,ImanopensourcehackerwhocanconferPhDs,inexchangeforlongyearsoflowpaylivinginMichigan.

    E-mailme!AnddonttalktoBrettCannonaboutPhDsfirst.)

  • 7/27/2019 Awesome Big Data Algorithms

    37/37

    References

    SkipLists:Wikipedia,andJohnShipmanscode:

    http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/pyskip.pdf

    HyperLogLog:AggregateKnowledgesblog,http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/And:https://github.com/svpcom/hyperloglog

    BloomFilters:Wikipedia

    Ourwork:http://ivory.idyll.org/blog/andhttp://ged.msu.edu/interests.html

    [email protected]