carlos guestrin
DESCRIPTION
2. A Distributed Abstraction for Large-Scale Machine Learning. Carlos Guestrin. Yucheng Low. Joseph Gonzalez. Aapo Kyrola. Danny Bickson. Haijie Gu. Needless to Say, We Need Machine Learning for Big Data. 750 Million Facebook Users. 24 Million Wikipedia Pages. 6 Billion - PowerPoint PPT PresentationTRANSCRIPT
GraphLab A New Parallel Framework for Machine Learning
Carlos Guestrin
YuchengLow
AapoKyrola
DannyBicksonA Distributed Abstraction for Large-Scale Machine Learning
HaijieGuJosephGonzalez
2
Carnegie MellonNeedless to Say, We Need Machine Learning for Big Data
48 Hours a MinuteYouTube24 Million Wikipedia Pages
750 MillionFacebook Users
6 Billion Flickr Photos
data a new class of economic asset, like currency or gold.2How will wedesign and implement parallel learning systems?
Big LearningA Shift Towards Parallelism
GPUsMulticoreClustersCloudsSupercomputers
ML experts repeatedly solve the same parallel design challenges:Race conditions, distributed state, communication The resulting code is:difficult to maintain, extend, debug Graduate studentsAvoid these problems by using high-level abstractions4CPU 1CPU 2CPU 3CPU 4Data Parallelism (MapReduce)
12.942.321.325.8
24.184.318.484.4
17.567.514.934.3Solve a huge number of independent subproblems2 parts. A Map stage and a Reduce stage. The Map stage represents embarassingly parallel computation. That is, each computation is independent and can performed on different macheina without any communciation.5MapReduce for Data-Parallel MLExcellent for large data-parallel tasks!Data-Parallel Graph-ParallelCrossValidationFeature ExtractionMapReduceComputing SufficientStatistics Graphical ModelsGibbs SamplingBelief PropagationVariational Opt.Semi-Supervised LearningLabel PropagationCoEMGraph AnalysisPageRankTriangle CountingCollaborative FilteringTensor FactorizationIs there more toMachine Learning?6
What is this?
Its next to this
The Power of Dependencies
where the value is!Carnegie MellonLabel a Face and Propagate
grandmaPairwise similarity not enough
grandma
Who????Not similar enoughto be surePropagate Similarities & Co-occurrences for Accurate Predictions
grandma
grandma!!!
similarityedgesco-occurringfacesfurther evidenceCollaborative Filtering: Independent Case
Lord of the Rings
Star Wars IV
Star Wars I
Harry Potter
Pirates of the Caribbean recommendCollaborative Filtering: Exploiting DependenciesrecommendCity of God
Wild Strawberries
The Celebration
La Dolce Vita
Women on the Verge of aNervous Breakdown
What do I recommend???Latent Topic Modeling (LDA)CatAppleGrowthHatPlant
Example Topics Discovered from Wikipedia
Data
Machine Learning Pipeline
images
docs
movie ratingsExtractFeatures
faces
importantwords
side infoGraphFormation
similarfaces
sharedwords
ratedmoviesStructuredMachineLearningAlgorithmbeliefpropagation
LDA
collaborativefilteringValuefromDatafacelabels
doctopics
movierecommend.Data
Parallelizing Machine Learning
ExtractFeatures
GraphFormation
StructuredMachineLearningAlgorithmValuefromDataGraph Ingressmostly data-parallelGraph-StructuredComputationgraph-parallelML Tasks Beyond Data-Parallelism Data-Parallel Graph-ParallelCrossValidationFeature ExtractionMap ReduceComputing SufficientStatistics Graphical ModelsGibbs SamplingBelief PropagationVariational Opt.Semi-Supervised LearningLabel PropagationCoEMGraph AnalysisPageRankTriangle CountingCollaborative FilteringTensor Factorization20Example of Graph ParallelismCarnegie MellonPageRank
Whats the rank of this user?Rank?Depends on rank of who follows herDepends on rank of who follows themLoops in graph Must iterate!Pagerank can be expressed as an iterative algorithm, where each vertex updates its rank as a discounted weighted average of its neighbors.
For example, in this graph, vertex 5 updates its rank22PageRank Iteration is the random reset probabilitywji is the prob. transitioning (similarity) from j to i
R[i]R[j]wjiIterate until convergence:My rank is weighted average of my friends ranksPagerank can be expressed as an iterative algorithm, where each vertex updates its rank as a discounted weighted average of its neighbors.
For example, in this graph, vertex 5 updates its rank23Properties of Graph Parallel AlgorithmsDependencyGraphIterativeComputationMy RankFriends RankLocalUpdates
Addressing Graph-Parallel MLData-Parallel Graph-ParallelCrossValidationFeature ExtractionMap ReduceComputing SufficientStatistics Graphical ModelsGibbs SamplingBelief PropagationVariational Opt.Semi-Supervised LearningLabel PropagationCoEMData-MiningPageRankTriangle CountingCollaborative FilteringTensor FactorizationMap Reduce?Graph-Parallel Abstraction25Graph Computation:
Synchronous v.AsynchronousBarrierBulk Synchronous Parallel Model: Pregel (Giraph)
ComputeCommunicate[Valiant 90] Bulk synchronous parallel systems can be highly inefficientProblem:BSP Systems Problem:Curse of the Slow JobDataDataDataDataDataDataDataDataDataDataDataDataDataDataCPU 1CPU 2CPU 3CPU 1CPU 2CPU 3DataDataDataDataDataDataDataCPU 1CPU 2CPU 3IterationsBarrierBarrierDataDataDataDataDataDataDataBarrier29Bulk synchronous parallel model provably inefficient for some ML tasks Analyzing Belief Propagation
A
B
Priority QueueSmart Schedulingfocus here[Gonzalez, Low, G. 09] Asynchronous Parallel Model (rather than BSP) fundamental for efficiencyimportantinfluence31Asynchronous Belief Propagation
Synthetic Noisy ImageCumulative Vertex UpdatesManyUpdatesFewUpdatesAlgorithm identifies and focuses on hidden sequential structure
Graphical ModelChallenge = BoundariesCummulative vertex update movie32BSP ML Problem: Synchronous Algorithms can be InefficientTheorem: Bulk Synchronous BP O(#vertices) slower than Asynchronous BPBulk Synchronous (e.g., Pregel)Asynchronous Splash BP
Efficient parallel implementation was painful, painful, painfulAdd picture of sad Joey. Again, we want to avoid this type of problem-specific tedious labor33
The Need for a New AbstractionData-Parallel Graph-ParallelCrossValidationFeature ExtractionMap ReduceComputing SufficientStatistics Graphical ModelsGibbs SamplingBelief PropagationVariational Opt.Semi-Supervised LearningLabel PropagationCoEMData-MiningPageRankTriangle CountingCollaborative FilteringTensor FactorizationBSP, e.g., PregelNeed: Asynchronous, Dynamic Parallel Computations34The GraphLab Goals
Designed specifically for MLGraph dependenciesIterativeAsynchronousDynamic
Simplifies design of parallel programs:Abstract away hardware issuesAutomatic data synchronizationAddresses multiple hardware architecturesEfficientparallelpredictions
Know how to solve ML problem on 1 machine
1Carnegie MellonData GraphData associated with vertices and edgesVertex Data: User profile text Current interests estimatesEdge Data: Similarity weights Graph: Social Network37pagerank(i, scope){ // Get Neighborhood data (R[i], wij, R[j]) scope;
// Update the vertex data
// Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }
Update FunctionsUser-defined program: applied to vertex transforms data in scope of vertexDynamic computationUpdate function applied (asynchronously) in parallel until convergence
Many schedulers available to prioritize computation38Ensuring Race-Free CodeHow much can computation overlap?Need for Consistency?Consistency in Collaborative FilteringNetflix data, 8 coresConsistent updatesInconsistent updatesGraphLab guarantees consistent updates
User-tunable consistency levelstrades off parallelism & consistencyFull netflix, 8 coresHighly connected movies, bad intermediate results41The GraphLab FrameworkSchedulerConsistency ModelGraph BasedData RepresentationUpdate FunctionsUser Computation42Bayesian Tensor FactorizationGibbs SamplingDynamic Block Gibbs SamplingMatrixFactorizationLassoSVMBelief PropagationPageRankCoEMK-MeansSVD
LDAMany othersLinear SolversSplash SamplerAlternating Least SquaresBetterOptimalGraphLab CoEMNever Ending Learner Project (CoEM)44GraphLab16 Cores30 min15x Faster!6x fewer CPUs!Hadoop95 Cores7.5 hrsDistributedGraphLab32 EC2 machines80 secs0.3% of Hadoop time44The Cost of the Wrong Abstraction
Log-Scale!GraphLab 1 provided excitingscaling performanceButThus farWe couldnt scale up to Altavista Webgraph 20021.4B vertices, 6.7B edgesCarnegie Mellon
Natural Graphs[Image from WikiCommons]Carnegie Mellon47Assumptions of Graph-Parallel AbstractionsIdealized StructureSmall neighborhoodsLow degree verticesVertices have similar degreeEasy to partitionNatural GraphLarge NeighborhoodsHigh degree verticesPower-Law degree distributionDifficult to partition
Natural Graphs Power Law
Top 1% of vertices is adjacent to53% of the edges!Altavista Web Graph: 1.4B Vertices, 6.7B Edges
Power Law-Slope = 2 49High Degree Vertices are Common
UsersMoviesNetflixSocial People
Popular Movies
ZwZwZwZwZwZwZwZwZwZwZwZwZwZwZwZwBHyper ParametersDocsWordsLDACommon WordsObama50Touches a largefraction of graph(GraphLab 1)SequentialVertex-UpdatesProduces manymessages(Pregel)Edge informationtoo large for singlemachineAsynchronous consistencyrequires heavy locking (GraphLab 1)Synchronous consistency is prone tostragglers (Pregel)Problem: High Degree Vertices Limit ParallelismProblem: High Degree Vertices High Communication for Distributed UpdatesYMachine 1Machine 2Data transmittedacross networkO(# cut edges)Natural graphs do not have low-cost balanced cuts [Leskovec et al. 08, Lang 04]
Popular partitioning tools (Metis, Chaco,) perform poorly [Abou-Rjeili et al. 06]Extremely slow and require substantial memoryRandom PartitioningBoth GraphLab 1 and Pregel proposed Random (hashed) partitioning for Natural GraphsMachine 1Machine 2
For p Machines:
10 Machines 90% of edges cut100 Machines 99% of edges cut!In SummaryGraphLab 1 and Pregel are not well suited for natural graphs
Poor performance on high-degree verticesLow Quality PartitioningDistribute a single vertex-updateMove computation to dataParallelize high-degree vertices
Vertex PartitioningSimple online approach, effectively partitions large power-law graphs2
Factorized Vertex UpdatesSplit update into 3 phases+ + + YYYParallelSumYScopeGatherYYApply( , ) YLocally apply the accumulated to vertexApplyYUpdate neighborsScatterData-parallel over edgesData-parallel over edgesFactorized PageRankgather(scope, edge) {return edge.weight * edge.source.value}
merge(accum1, accum2) { return accum1 + accum2 }
apply(scope, accum) {scope.value = ALPHA + (1 - ALPHA) * accum}
scatter(scope, edge) {if (abs(scope.value old_value) > EPSILON) signal(edge.target)}Compute neighbors contributionAddcontribution
Update vertexsrankSchedule neighbor ifchange large57Triangle CountingFor each vertex in graph, countnumber of triangles containing it
Measures both popularity of the vertex and cohesiveness of the vertexs community:
More TrianglesStronger CommunityFewer TrianglesWeaker CommunityFactorized Triangle CountingGather phase creates neighbor list for each vertex:Gather: Add neighbor to listMerge: Union lists
Apply: Save neighbor list
Scatter: Compare neighbor lists of adjacent vertices, count number of shared neighbors (each is a triangle)
Im neighbors with:
Im neighbors with:
Reading Neighbor Lists
Computing Intersections
Writing Results
Triangle Counting on TwitterPopular PeoplePopular PeopleWithStrong CommunitiesFactorized Belief Propagation
Gather: Accumulates product of in messages
Apply: Updates central belief
Scatter: Computes out messages & schedules neighbors as neededCollaborative Filtering (via Alternating Least Squares)
Latent categoriesGoal: discoverlatent categoriesfor users & movies
Factorized Collaborative Filtering Updates
Gather: sum over movies,product of ratings & factor weights(and a little more info)Apply: Compute usersnew factor weightsIterate overusers & moviesMulticore PerformanceMulticore PageRank (25M Vertices, 355M Edges)GraphLab 1GraphLab2FactorizedPregel (implemented in GraphLab)GraphLab2Factorized +CachingYFactorized Updates Split Gather & Scatter Across Machines Significant Decrease in Communication( + )( )YYYF1F2YYO(1) data transmitted over networkMinimizing Communication in GraphLab 2: Vertex CutsYCommunication linear in # spanned machinesYYA vertex-cut minimizes # machines per vertexPercolation theory suggests Power Law graphs can be split by removing only a small set of vertices [Albert et al. 2000]Small vertex cuts possible! Random Vertex Cuts vs Edge CutsConstructing Vertex-CutsGoal: Parallel graph partitioning on ingressGraphLab 2 provides three simple approaches:Random Edge PlacementEdges are placed randomly by each machineGood theoretical guaranteesGreedy Edge Placement with CoordinationEdges are placed using a shared objectiveBetter theoretical guaranteesOblivious-Greedy Edge Placement Edges are placed using a local objectiveBeyond Random Vertex Cuts!
From the Abstraction to a System2
Carnegie MellonLinux Cluster Services (Amazon AWS)MPI/TCP-IP CommsPThreadsBoostHadoop/HDFSSync. EngineAsync. EngineFault ToleranceDistributed GraphMap/ReduceIngressGraphLab Version 2.1 API (C++)Graph AnalyticsGraphicalModelsComputerVisionClusteringTopicModelingCollaborativeFilteringCarnegie MellonSubstantially simpler than original GraphLabSynchronous engine < 600 lines of code
74Triangle Counting in Twitter Graph40M Users 1.2B EdgesTotal:34.8 Billion TrianglesHadoop results from [Suri & Vassilvitskii '11]1536 Machines423 Minutes64 Machines, 1024 Cores1.5 MinutesHadoop [1]S. Suri and S. Vassilvitskii, Counting triangles and the curse of the last reducer, presented at the WWW '11: Proceedings of the 20th international conference on World wide web, 2011.75LDA PerformanceAll English language Wikipedia2.6M documents, 8.3M words, 500M tokens
LDA state-of-the-art sampler (100 Machines)Alex Smola: 150 Million tokens per Second
GraphLab Sampler (64 cc2.8xlarge EC2 Nodes)100 Million Tokens per SecondUsing only 200 Lines of code and 4 human hours
PageRank40M Webpages, 1.4 Billion Links5.5 hrs1 hr8 min$180$41$12Hadoop results from [Kang et al. '11]Twister (in-memory MapReduce) [Ekanayake et al. 10]Comparable numbers are hard to come by as everyone uses different datasets, but we try to equalize as much as possible giving the competition an advantage when in doubtScaling to 100 iterations so costs are in dollars and not cents.
Numbers: Hadoop: Ran on Kronecker graph of 1.1B edges, 50 M45 machines. From available numbers, each machine is approximately 8 cores, 6 GB RAM or roughly a c1.xlarge instance. Putting total cost at $33 per hourSpectral Analysis for Billion-Scale Graphs: Discoveries and Implementation U Kang Brendan Meeder Christos Faloutsos Twister: Ran on ClueWeb dataset of 50M pages, 1.4B edges. Using 64 Nodes of 4 cores each. 16GB RAM per node. Or roughly a m2.xlarge to m2.2xlarge instance. Putting total cost at $28-$56 per hour.Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, Geoffrey Fox, Twister: A Runtime for Iterative MapReduce," The First International Workshop on MapReduce and its Applications (MAPREDUCE'10) - HPDC2010 GraphLab: Ran on 64 cc1.4x large instances at $83.2 per hour77How well does GraphLab scale?Yahoo Altavista Web Graph (2002):One of the largest publicly available webgraphs1.4B Webpages, 6.6 Billion Links
1024 Cores (2048 HT)
4.4 TB RAM64 HPC Nodes
11 Mins1B links processed per second30 lines of user code
Release 2.1 available now
Apache 2 License
Carnegie MellonLinux Cluster Services (Amazon AWS)MPI/TCP-IPPThreadsHadoop/HDFSGraphLab Version 2.1 API (C++)Graph AnalyticsGraphicalModelsComputerVisionClusteringTopicModelingCollaborativeFilteringGraphLab easily incorporates external toolkitsAutomatically detects and builds external toolkitsGraphLab ToolkitsGraph ProcessingExtract knowledge from graph structure
Find communitiesIdentify important individualsDetect vulnerabilitiesAlgorithmsTriangle CountingPagerankK-CoresComing soon:Max-FlowMatchingConnected ComponentsLabel propagation
http://en.wikipedia.org/wiki/Social_network_analysis81Collaborative FilteringUnderstanding PeoplesShared Interests
Target advertisingImprove shopping experience
AlgorithmsALS, Weighted ALSSGD, Biased SGDProposed:SVD++ Sparse ALSTensor Factorization
http://glamproduction.wordpress.com/360-cross-platform-pitch-draft/http://ozgekaraoglu.edublogs.org/2009/07/31/two-gr8-projects-to-collaborate/82Graphical ModelsProbabilistic analysis for correlated data.
Improved predictionsQuantify uncertaintyExtract relationshipsAlgorithmsLoopy Belief PropagationMax Product LPComing soon:Gibbs SamplingParameter LearningL1 Structure LearningM3 NetKernel Belief PropagationAd 1Ad 2User 1User 3User 2Computer Vision (CloudCV)Making sense of pictures.
Recognizing peopleMedical imagingEnhancing images
AlgorithmsImage stitchingFeature extraction Coming soon:Person/object detectorsInteractive segmentationFace recognition
Image: http://www.cs.brown.edu/courses/cs143/84ClusteringIdentify groups of related data
Group customer and productsCommunity detectionIdentify outliers
AlgorithmsK-Means++Coming soon:Structured EMHierarchical ClusteringNonparametric *-Means
Image: http://jordi.pro/netbiz/2012/05/connecting-people-a-must-for-clusters/85Topic ModelingExtract meaning from raw text
Improved searchSummarize textual dataFind related documents
AlgorithmsLDA Gibbs SamplerComing soon:CVB0 for LDALSA/LSICorrelated topic modelsTrending Topic Models
Image: http://knowledgeblog.org/files/2011/03/jiscmrd_wordle.png86GraphChi: Going small with GraphLab
Solve huge problems on small or embedded devices?Key: Exploit non-volatile memory (starting with SSDs and HDs)
GraphChi disk-based GraphLabNovel Parallel Sliding Windows algorithm
Fast!Solves tasks as large as current distributed systemsMinimizes non-sequential disk accesses Efficient on both SSD and hard-driveParallel, asynchronous executionTriangle Counting in Twitter Graph40M Users 1.2B EdgesTotal: 34.8 Billion TrianglesHadoop results from [Suri & Vassilvitskii '11]59 Minutes64 Machines, 1024 Cores1.5 Minutes1536 Machines423 Minutes59 Minutes, 1 Mac Mini!Hadoop [1]S. Suri and S. Vassilvitskii, Counting triangles and the curse of the last reducer, presented at the WWW '11: Proceedings of the 20th international conference on World wide web, 2011.89Demo: Streaming Graph Updates
Stream of Twitter social graph updates
Ingest 100,000 graph updates / secWhile simultaneously computing Pagerank on a Mac Mini, sustaining throughput of 200K updates/secondRelease 2.1 available now
http://graphlab.orgDocumentation Code Tutorials (more on the way) GraphChi 0.1 available nowhttp://graphchi.orgCarnegie Mellon