alexander ponomarenko 2nd international scientific conference … · 2016-09-25 · [y. malkov, a....
TRANSCRIPT
![Page 1: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/1.jpg)
AlgorithmsforBuildingHighlyScalableDistributedDataStorages
AlexanderPonomarenko2ndInternationalScientificConference“SCIENCEOFTHEFUTURE”
September20-232016,Kazan,Russia.
![Page 2: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/2.jpg)
HierarchyvsHeterarchyData
![Page 3: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/3.jpg)
HierarchyvsHeterarchyData
![Page 4: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/4.jpg)
HierarchyvsHeterarchyData
![Page 5: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/5.jpg)
HierarchyvsHeterarchyData
![Page 6: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/6.jpg)
DHTprotocolsandimplementations
• Aeropike• ApacheCassandra• BATONOverlay• MainlineDHT-StandardDHTusedbyBitTorrent(basedonKademlia)• CAN(ContentAddressableNetwork)• Chord• Koorde• Kademlia• Pastry• P-Grid• Riak• Tapestry• TomP2P• Voldemort
6
![Page 7: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/7.jpg)
ApplicationsemployingDHTs
• BTDigg:BitTorrentDHTsearchengine• cjdns:rou\ngengineformesh-basednetworks• CloudSNAP:adecentralizedwebapplica\ondeploymentpla]orm• Codeen:webcaching• CoralContentDistribu\onNetwork• FAROO:peer-to-peerWebsearchengine• Freenet:acensorship-resistantanonymousnetwork• GlusterFS:adistributedfilesystemusedforstoragevirtualiza\on• GNUnet:Freenet-likedistribu\onnetworkincludingaDHTimplementa\on• Hazelcast:Open-sourcein-memorydatagrid• I2P:Anopen-sourceanonymouspeer-to-peernetwork.• I2P-Bote:serverlesssecureanonymouse-mail.• JXTA:open-sourceP2Ppla]orm• OracleCoherence:anin-memorydatagridbuiltontopofaJavaDHTimplementa\on• Retroshare:aFriend-to-friendnetwork[17]• YaCy:adistributedsearchengine• Tox:aninstantmessagingsystemintendedtofunc\onasaSkypereplacement• Twister:amicrobloggingpeer-to-peerpla]orm• PerfectDark:apeer-to-peerfile-sharingapplica\onfromJapan
7
![Page 8: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/8.jpg)
StructuredPeer-to-PeerNetworks:ChordProtocol
Searchingofkey54staringfrom«N8».Routingtableofnode«N8»
Eachnode,n,maintainsaroutingtablewith(atmost)mentries,calledthefingertable.Thei-thentryinthetableatnodencontainstheidentityofthefirstnode,s,thatsucceedsnbyatleast2^(i-1)ontheidentifiercircle,i.e.,s=successor(n+2^(i-1)),where1<=i<=m
Distancefunction:d(x,y)=(y–x)mod2^m
![Page 9: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/9.jpg)
StructuredPeer-to-PeerNetworks:Kademlia
IdentifierspaceofKademlia
MaymounkovP.,MazieresD.Kademlia:Apeer-to-peerinformationsystembasedonthexormetric//Peer-to-PeerSystems.–SpringerBerlinHeidelberg,2002.–С.53-65.
Distancefunction:d(x,y)=xxory
![Page 10: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/10.jpg)
toOvercomeDHTDisadvantages
10
• DHTusesverysimpledistancefunctions• Hashingdestroyssemanticofthedata• It’shardtoperformcomplexqueries
Usenearestneighboursearchinhighdimensionalmetricspaceinsteadofexactsearch
![Page 11: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/11.jpg)
11
Let–domain-distancefunc1onwhichsa1sfiesproper1es:
– strictposi1veness:d(x,y)>0�x≠y,– symmetry:d(x,y)=d(y,x),– reflexivity:d(x,x)=0,– triangleinequality:d(x,y)+d(y,z)≥d(x,z).
NearestNeighborSearch
GivenafinitesetX={p1,…,pn}ofnpointsinsomemetricspace(D,d),needtobuildadatastructureonXsothatforagivenquerypointq∈Donecanfindapointp∈Xwhichminimizesd(p,q)withasfewdistancecomputa<onsaspossible
);0[: +∞→× RDDdD
![Page 12: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/12.jpg)
12
ExamplesofDistanceFunc3ons
• LpMinkovskidistance(forvectors)• L1–city-blockdistance
• L2–Euclideandistance
• L∞ –infinity
• Editdistance(forstrings)• minimalnumberofinser3ons,dele3onsandsubs3tu3ons• d(‘applica3on’,‘applet’)=6
• Jaccard’scoefficient(forsetsA,B)
∑=
−=n
iii yxyxL
11 ||),(
( )∑=
−=n
iii yxyxL
1
22 ),(
ii
n
iyxyxL −=
=∞ max),(
1
( )∪∩BA
BABAd −=1,
![Page 13: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/13.jpg)
13
2(| ( 1, 2) | | ( 1, 2) |)( 1, 2)(| ( 1) | | ( 1) |) (| ( 2) | | ( 2) |)
V G G E G Gsim G GV G E G V G E G
+=
+ ⋅ +
( 1, 2) 1 ( 1, 2)d G G sim G G= −
MaxCommonSubgraphSimilarity
![Page 14: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/14.jpg)
Kleinberg’sNavigableSmallWorld
[KleinbergJ.Thesmall-worldphenomenon:Analgorithmicperspective//Proceedingsofthethirty-secondannualACMsymposiumonTheoryofcomputing.–ACM,2000.–С.163-170.]
![Page 15: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/15.jpg)
VoroNet,RayNet:AscalableobjectnetworkbasedonVoronoitessellations
BeaumontO.etal.VoroNet:AscalableobjectnetworkbasedonVoronoitessellations.–2006.
Distancefunction: 2 22),( yxyxd +=
![Page 16: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/16.jpg)
MetrizedSmallWorldAlgorithmu=1
u=2
u=3+
+=
Navigablesmallworld
“Toplevel”– first(oldest)elements
“Bottom”level- allelements
u=log(N)
u=log(N)-1
queryelement
R1 R2
[Y.Malkov,A.Ponomarenko,A.Logvinov,andV.Krylov,“Scalabledistributedalgorithmforapproximatenearestneighborsearchprobleminhighdimensionalgeneralmetricspaces,”inSimilaritySearchandApplications.Springer,2012,pp.32–147][Y.Malkov,A.Ponomarenko,A.Logvinov,andV.Krylov,“Approximatenearestneighboralgorithmbasedonnavigablesmallworldgraphs,”InformationSystems,vol.45,2014,pp.61–68.][PonomarenkoA.Query-BasedImprovementProcedureandSelf-AdaptiveGraphConstructionAlgorithmforApproximateNearestNeighborSearch//InternationalConferenceonSimilaritySearchandApplications.–SpringerInternationalPublishing,2015.–С.314-319.]
![Page 17: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/17.jpg)
Booleannon-linearprogrammingformulationforoptimalgraphstructure
17
![Page 18: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/18.jpg)
0.6
0.5
0.52 0.52
0.4
0.3
0.42
0.42 0.42
0.41
0.2
0.2
0.20.2
0.41
0.3
Query
EntryPoint
Searchbygreedyalgorithm
![Page 19: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/19.jpg)
Constructionalgorithm
0.5
0.8
0.90.4
0.7 0.3
0.7
0.90.2
![Page 20: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/20.jpg)
Datasets
• CoPhIR(L2)isthecollectionof208-dimensionalvectorsextractedfromimagesinMPEG7format.
• SIFTisapartoftheTexMexdatasetcollectionavailablehttp://corpus-texmex.irisa.frIthasonemillion128-dimensionalvectors.EachvectorcorrespondstodescriptorextractedfromimagedatausingScaleInvariantFeatureTransformation(SIFT)
• Unfi64issyntheticdatasetof64-dimensionalvectors.Thevectorsweregeneratedrandomly,independentlyanduniformlyintheunithypercube.
![Page 21: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/21.jpg)
Performanceofa10-NNsearchfor:plotsinthesamecolumncorrespondtothesamedataset
2L
[PonomarenkoA.etal.ComparativeAnalysisofDataStructuresforApproximateNearestNeighborSearch//DATAANALYTICS2014,TheThirdInternationalConferenceonDataAnalytics.–2014.–С.125-130.]
![Page 22: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/22.jpg)
KL-divergence: ∑=i
ii y
xxyxd log),( Final16,Final64,andFinal256:aresetsof0.5milliontopic
histogramsgeneratedusingtheLatentDirichletAllocation(LDA).
![Page 23: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/23.jpg)
Wikipediadataset
1, 2, ,( , ,..., )j j j n jd w w w=
1, 2, ,( , ,..., )q q n qq w w w=
, ,1
2 2, ,
1 1
( , )|| || || ||
n
i j i qj i
j n nj
i j i qi i
w wd qsim d q
d qw w
=
= =
⋅= =
⋅
∑
∑ ∑
VectorSpaceModel
Wikipedia(cosinesimilarity):isadatasetthatcontains3.2millionvectorsrepresentedinasparseformat.Thissethasanextremelyhighdimensionality(morethan100thousandelements).Yet,thevectorsaresparse:Onaverageonlyabout600elementsarenon-zero.
![Page 24: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/24.jpg)
Dis
tanc
e C
ompu
tatio
ns
0
450000
900000
1350000
1800000
Number of Elements0 1000000 2000000 3000000 4000000
perm. vp-tree perm. incr. sort. msw perm. inv. index
ScalingofmethodsonWikipediadataset
Wikipediaisdatasetthatcontains3.2millionvectorsrepresentedinasparseformat.EachvectorcorrespondstothefrequencytermvectoroftheWikipediapageextractedusingthegensimlibrary.Thissethasanextremelyhighdimensionality(morethan100thousandelements).
Recall=0.9
![Page 25: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/25.jpg)
Dis
tanc
e C
ompu
tatio
ns
0
10000
20000
30000
40000
Number of objects10000 100000 1000000 10000000
ScalingofMSWdatastructure
![Page 26: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/26.jpg)
Summingup• Algorithmisverysimple• Algorithmusesonlydistancevaluesbetweentheobjects,makingitsuitablefor
arbitraryspaces.• Proposeddatastructurehasnorootelement.• Alloperations(additionandsearch)useonlylocalinformationandcanbeinitiated
fromanyelementthatwaspreviouslyaddedtothestructure.• Accuracyoftheapproximatesearchcanbetunedwithoutrebuildingdatastructure• Algorithmhighscalablebothin
sizeanddatadimensionality
Goodbaseforbuildingmanyreal-worldextremedatasetsizehighdimensionalitysimilaritysearchapplications
![Page 27: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/27.jpg)
27
hwps://github.com/searchivarius/NonMetricSpaceLibhwps://github.com/aponom84/MetrizedSmallWorld
SourceCode
![Page 28: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/28.jpg)
28
Questions?
![Page 29: Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov, “Scalable distributed algorithm for approximate](https://reader034.vdocuments.net/reader034/viewer/2022042406/5f20eecfba544a666c20883c/html5/thumbnails/29.jpg)
29
Questions?
WhyCERNdoesn’tuseDHT?