carlos guestrin

GraphLab A New Parallel Framework for Machine Learning

Carlos Guestrin

YuchengLow

AapoKyrola

DannyBicksonA Distributed Abstraction for Large-Scale Machine Learning

HaijieGuJosephGonzalez

2

Carnegie MellonNeedless to Say, We Need Machine Learning for Big Data

48 Hours a MinuteYouTube24 Million Wikipedia Pages

750 MillionFacebook Users

6 Billion Flickr Photos

data a new class of economic asset, like currency or gold.2How will wedesign and implement parallel learning systems?

Big LearningA Shift Towards Parallelism

GPUsMulticoreClustersCloudsSupercomputers

ML experts repeatedly solve the same parallel design challenges:Race conditions, distributed state, communication The resulting code is:difficult to maintain, extend, debug Graduate studentsAvoid these problems by using high-level abstractions4CPU 1CPU 2CPU 3CPU 4Data Parallelism (MapReduce)

12.942.321.325.8

24.184.318.484.4

17.567.514.934.3Solve a huge number of independent subproblems2 parts. A Map stage and a Reduce stage. The Map stage represents embarassingly parallel computation. That is, each computation is independent and can performed on different macheina without any communciation.5MapReduce for Data-Parallel MLExcellent for large data-parallel tasks!Data-Parallel Graph-ParallelCrossValidationFeature ExtractionMapReduceComputing SufficientStatistics Graphical ModelsGibbs SamplingBelief PropagationVariational Opt.Semi-Supervised LearningLabel PropagationCoEMGraph AnalysisPageRankTriangle CountingCollaborative FilteringTensor FactorizationIs there more toMachine Learning?6

What is this?

Its next to this

The Power of Dependencies

where the value is!Carnegie MellonLabel a Face and Propagate

grandmaPairwise similarity not enough

grandma

Who????Not similar enoughto be surePropagate Similarities & Co-occurrences for Accurate Predictions

grandma

grandma!!!

similarityedgesco-occurringfacesfurther evidenceCollaborative Filtering: Independent Case

Lord of the Rings

Star Wars IV

Star Wars I

Harry Potter

Pirates of the Caribbean recommendCollaborative Filtering: Exploiting DependenciesrecommendCity of God

Wild Strawberries

The Celebration

La Dolce Vita

Women on the Verge of aNervous Breakdown

What do I recommend???Latent Topic Modeling (LDA)CatAppleGrowthHatPlant

Example Topics Discovered from Wikipedia

Data

Machine Learning Pipeline

images

docs

movie ratingsExtractFeatures

faces

importantwords

side infoGraphFormation

similarfaces

sharedwords

ratedmoviesStructuredMachineLearningAlgorithmbeliefpropagation

LDA

collaborativefilteringValuefromDatafacelabels

doctopics

movierecommend.Data

Parallelizing Machine Learning

ExtractFeatures

GraphFormation

StructuredMachineLearningAlgorithmValuefromDataGraph Ingressmostly data-parallelGraph-StructuredComputationgraph-parallelML Tasks Beyond Data-Parallelism Data-Parallel Graph-ParallelCrossValidationFeature ExtractionMap ReduceComputing SufficientStatistics Graphical ModelsGibbs SamplingBelief PropagationVariational Opt.Semi-Supervised LearningLabel PropagationCoEMGraph AnalysisPageRankTriangle CountingCollaborative FilteringTensor Factorization20Example of Graph ParallelismCarnegie MellonPageRank

Whats the rank of this user?Rank?Depends on rank of who follows herDepends on rank of who follows themLoops in graph Must iterate!Pagerank can be expressed as an iterative algorithm, where each vertex updates its rank as a discounted weighted average of its neighbors.

For example, in this graph, vertex 5 updates its rank22PageRank Iteration is the random reset probabilitywji is the prob. transitioning (similarity) from j to i

R[i]R[j]wjiIterate until convergence:My rank is weighted average of my friends ranksPagerank can be expressed as an iterative algorithm, where each vertex updates its rank as a discounted weighted average of its neighbors.

For example, in this graph, vertex 5 updates its rank23Properties of Graph Parallel AlgorithmsDependencyGraphIterativeComputationMy RankFriends RankLocalUpdates

Addressing Graph-Parallel MLData-Parallel Graph-ParallelCrossValidationFeature ExtractionMap ReduceComputing SufficientStatistics Graphical ModelsGibbs SamplingBelief PropagationVariational Opt.Semi-Supervised LearningLabel PropagationCoEMData-MiningPageRankTriangle CountingCollaborative FilteringTensor FactorizationMap Reduce?Graph-Parallel Abstraction25Graph Computation:

Synchronous v.AsynchronousBarrierBulk Synchronous Parallel Model: Pregel (Giraph)

ComputeCommunicate[Valiant 90] Bulk synchronous parallel systems can be highly inefficientProblem:BSP Systems Problem:Curse of the Slow JobDataDataDataDataDataDataDataDataDataDataDataDataDataDataCPU 1CPU 2CPU 3CPU 1CPU 2CPU 3DataDataDataDataDataDataDataCPU 1CPU 2CPU 3IterationsBarrierBarrierDataDataDataDataDataDataDataBarrier29Bulk synchronous parallel model provably inefficient for some ML tasks Analyzing Belief Propagation

A

B

Priority QueueSmart Schedulingfocus here[Gonzalez, Low, G. 09] Asynchronous Parallel Model (rather than BSP) fundamental for efficiencyimportantinfluence31Asynchronous Belief Propagation

Synthetic Noisy ImageCumulative Vertex UpdatesManyUpdatesFewUpdatesAlgorithm identifies and focuses on hidden sequential structure

Graphical ModelChallenge = BoundariesCummulative vertex update movie32BSP ML Problem: Synchronous Algorithms can be InefficientTheorem: Bulk Synchronous BP O(#vertices) slower than Asynchronous BPBulk Synchronous (e.g., Pregel)Asynchronous Splash BP

Efficient parallel implementation was painful, painful, painfulAdd picture of sad Joey. Again, we want to avoid this type of problem-specific tedious labor33

The Need for a New AbstractionData-Parallel Graph-ParallelCrossValidationFeature ExtractionMap ReduceComputing SufficientStatistics Graphical ModelsGibbs SamplingBelief PropagationVariational Opt.Semi-Supervised LearningLabel PropagationCoEMData-MiningPageRankTriangle CountingCollaborative FilteringTensor FactorizationBSP, e.g., PregelNeed: Asynchronous, Dynamic Parallel Computations34The GraphLab Goals

Designed specifically for MLGraph dependenciesIterativeAsynchronousDynamic

Simplifies design of parallel programs:Abstract away hardware issuesAutomatic data synchronizationAddresses multiple hardware architecturesEfficientparallelpredictions

Know how to solve ML problem on 1 machine

1Carnegie MellonData GraphData associated with vertices and edgesVertex Data: User profile text Current interests estimatesEdge Data: Similarity weights Graph: Social Network37pagerank(i, scope){ // Get Neighborhood data (R[i], wij, R[j]) scope;

// Update the vertex data

// Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }

Update FunctionsUser-defined program: applied to vertex transforms data in scope of vertexDynamic computationUpdate function applied (asynchronously) in parallel until convergence

Many schedulers available to prioritize computation38Ensuring Race-Free CodeHow much can computation overlap?Need for Consistency?Consistency in Collaborative FilteringNetflix data, 8 coresConsistent updatesInconsistent updatesGraphLab guarantees consistent updates

User-tunable consistency levelstrades off parallelism & consistencyFull netflix, 8 coresHighly connected movies, bad intermediate results41The GraphLab FrameworkSchedulerConsistency ModelGraph BasedData RepresentationUpdate FunctionsUser Computation42Bayesian Tensor FactorizationGibbs SamplingDynamic Block Gibbs SamplingMatrixFactorizationLassoSVMBelief PropagationPageRankCoEMK-MeansSVD

LDAMany othersLinear SolversSplash SamplerAlternating Least SquaresBetterOptimalGraphLab CoEMNever Ending Learner Project (CoEM)44GraphLab16 Cores30 min15x Faster!6x fewer CPUs!Hadoop95 Cores7.5 hrsDistributedGraphLab32 EC2 machines80 secs0.3% of Hadoop time44The Cost of the Wrong Abstraction

Log-Scale!GraphLab 1 provided excitingscaling performanceButThus farWe couldnt scale up to Altavista Webgraph 20021.4B vertices, 6.7B edgesCarnegie Mellon

Natural Graphs[Image from WikiCommons]Carnegie Mellon47Assumptions of Graph-Parallel AbstractionsIdealized StructureSmall neighborhoodsLow degree verticesVertices have similar degreeEasy to partitionNatural GraphLarge NeighborhoodsHigh degree verticesPower-Law degree distributionDifficult to partition

Natural Graphs Power Law

Top 1% of vertices is adjacent to53% of the edges!Altavista Web Graph: 1.4B Vertices, 6.7B Edges

Power Law-Slope = 2 49High Degree Vertices are Common

UsersMoviesNetflixSocial People

Popular Movies

ZwZwZwZwZwZwZwZwZwZwZwZwZwZwZwZwBHyper ParametersDocsWordsLDACommon WordsObama50Touches a largefraction of graph(GraphLab 1)SequentialVertex-UpdatesProduces manymessages(Pregel)Edge informationtoo large for singlemachineAsynchronous consistencyrequires heavy locking (GraphLab 1)Synchronous consistency is prone tostragglers (Pregel)Problem: High Degree Vertices Limit ParallelismProblem: High Degree Vertices High Communication for Distributed UpdatesYMachine 1Machine 2Data transmittedacross networkO(# cut edges)Natural graphs do not have low-cost balanced cuts [Leskovec et al. 08, Lang 04]

Popular partitioning tools (Metis, Chaco,) perform poorly [Abou-Rjeili et al. 06]Extremely slow and require substantial memoryRandom PartitioningBoth GraphLab 1 and Pregel proposed Random (hashed) partitioning for Natural GraphsMachine 1Machine 2

For p Machines:

10 Machines 90% of edges cut100 Machines 99% of edges cut!In SummaryGraphLab 1 and Pregel are not well suited for natural graphs

Poor performance on high-degree verticesLow Quality PartitioningDistribute a single vertex-updateMove computation to dataParallelize high-degree vertices

Vertex PartitioningSimple online approach, effectively partitions large power-law graphs2

Factorized Vertex UpdatesSplit update into 3 phases+ + + YYYParallelSumYScopeGatherYYApply( , ) YLocally apply the accumulated to vertexApplyYUpdate neighborsScatterData-parallel over edgesData-parallel over edgesFactorized PageRankgather(scope, edge) {return edge.weight * edge.source.value}

merge(accum1, accum2) { return accum1 + accum2 }

apply(scope, accum) {scope.value = ALPHA + (1 - ALPHA) * accum}

scatter(scope, edge) {if (abs(scope.value old_value) > EPSILON) signal(edge.target)}Compute neighbors contributionAddcontribution

Update vertexsrankSchedule neighbor ifchange large57Triangle CountingFor each vertex in graph, countnumber of triangles containing it

Measures both popularity of the vertex and cohesiveness of the vertexs community:

More TrianglesStronger CommunityFewer TrianglesWeaker CommunityFactorized Triangle CountingGather phase creates neighbor list for each vertex:Gather: Add neighbor to listMerge: Union lists

Apply: Save neighbor list

Scatter: Compare neighbor lists of adjacent vertices, count number of shared neighbors (each is a triangle)

Im neighbors with:

Im neighbors with:

Reading Neighbor Lists

Computing Intersections

Writing Results

Triangle Counting on TwitterPopular PeoplePopular PeopleWithStrong CommunitiesFactorized Belief Propagation

Gather: Accumulates product of in messages

Apply: Updates central belief

Scatter: Computes out messages & schedules neighbors as neededCollaborative Filtering (via Alternating Least Squares)

Latent categoriesGoal: discoverlatent categoriesfor users & movies

Factorized Collaborative Filtering Updates

Gather: sum over movies,product of ratings & factor weights(and a little more info)Apply: Compute usersnew factor weightsIterate overusers & moviesMulticore PerformanceMulticore PageRank (25M Vertices, 355M Edges)GraphLab 1GraphLab2FactorizedPregel (implemented in GraphLab)GraphLab2Factorized +CachingYFactorized Updates Split Gather & Scatter Across Machines Significant Decrease in Communication( + )( )YYYF1F2YYO(1) data transmitted over networkMinimizing Communication in GraphLab 2: Vertex CutsYCommunication linear in # spanned machinesYYA vertex-cut minimizes # machines per vertexPercolation theory suggests Power Law graphs can be split by removing only a small set of vertices [Albert et al. 2000]Small vertex cuts possible! Random Vertex Cuts vs Edge CutsConstructing Vertex-CutsGoal: Parallel graph partitioning on ingressGraphLab 2 provides three simple approaches:Random Edge PlacementEdges are placed randomly by each machineGood theoretical guaranteesGreedy Edge Placement with CoordinationEdges are placed using a shared objectiveBetter theoretical guaranteesOblivious-Greedy Edge Placement Edges are placed using a local objectiveBeyond Random Vertex Cuts!

From the Abstraction to a System2

Carnegie MellonLinux Cluster Services (Amazon AWS)MPI/TCP-IP CommsPThreadsBoostHadoop/HDFSSync. EngineAsync. EngineFault ToleranceDistributed GraphMap/ReduceIngressGraphLab Version 2.1 API (C++)Graph AnalyticsGraphicalModelsComputerVisionClusteringTopicModelingCollaborativeFilteringCarnegie MellonSubstantially simpler than original GraphLabSynchronous engine < 600 lines of code

74Triangle Counting in Twitter Graph40M Users 1.2B EdgesTotal:34.8 Billion TrianglesHadoop results from [Suri & Vassilvitskii '11]1536 Machines423 Minutes64 Machines, 1024 Cores1.5 MinutesHadoop [1]S. Suri and S. Vassilvitskii, Counting triangles and the curse of the last reducer, presented at the WWW '11: Proceedings of the 20th international conference on World wide web, 2011.75LDA PerformanceAll English language Wikipedia2.6M documents, 8.3M words, 500M tokens

LDA state-of-the-art sampler (100 Machines)Alex Smola: 150 Million tokens per Second

GraphLab Sampler (64 cc2.8xlarge EC2 Nodes)100 Million Tokens per SecondUsing only 200 Lines of code and 4 human hours

PageRank40M Webpages, 1.4 Billion Links5.5 hrs1 hr8 min$180$41$12Hadoop results from [Kang et al. '11]Twister (in-memory MapReduce) [Ekanayake et al. 10]Comparable numbers are hard to come by as everyone uses different datasets, but we try to equalize as much as possible giving the competition an advantage when in doubtScaling to 100 iterations so costs are in dollars and not cents.

Numbers: Hadoop: Ran on Kronecker graph of 1.1B edges, 50 M45 machines. From available numbers, each machine is approximately 8 cores, 6 GB RAM or roughly a c1.xlarge instance. Putting total cost at $33 per hourSpectral Analysis for Billion-Scale Graphs: Discoveries and Implementation U Kang Brendan Meeder Christos Faloutsos Twister: Ran on ClueWeb dataset of 50M pages, 1.4B edges. Using 64 Nodes of 4 cores each. 16GB RAM per node. Or roughly a m2.xlarge to m2.2xlarge instance. Putting total cost at $28-$56 per hour.Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, Geoffrey Fox, Twister: A Runtime for Iterative MapReduce," The First International Workshop on MapReduce and its Applications (MAPREDUCE'10) - HPDC2010 GraphLab: Ran on 64 cc1.4x large instances at $83.2 per hour77How well does GraphLab scale?Yahoo Altavista Web Graph (2002):One of the largest publicly available webgraphs1.4B Webpages, 6.6 Billion Links

1024 Cores (2048 HT)

4.4 TB RAM64 HPC Nodes

11 Mins1B links processed per second30 lines of user code

Release 2.1 available now

Apache 2 License

Carnegie MellonLinux Cluster Services (Amazon AWS)MPI/TCP-IPPThreadsHadoop/HDFSGraphLab Version 2.1 API (C++)Graph AnalyticsGraphicalModelsComputerVisionClusteringTopicModelingCollaborativeFilteringGraphLab easily incorporates external toolkitsAutomatically detects and builds external toolkitsGraphLab ToolkitsGraph ProcessingExtract knowledge from graph structure

Find communitiesIdentify important individualsDetect vulnerabilitiesAlgorithmsTriangle CountingPagerankK-CoresComing soon:Max-FlowMatchingConnected ComponentsLabel propagation

http://en.wikipedia.org/wiki/Social_network_analysis81Collaborative FilteringUnderstanding PeoplesShared Interests

Target advertisingImprove shopping experience

AlgorithmsALS, Weighted ALSSGD, Biased SGDProposed:SVD++ Sparse ALSTensor Factorization

http://glamproduction.wordpress.com/360-cross-platform-pitch-draft/http://ozgekaraoglu.edublogs.org/2009/07/31/two-gr8-projects-to-collaborate/82Graphical ModelsProbabilistic analysis for correlated data.

Improved predictionsQuantify uncertaintyExtract relationshipsAlgorithmsLoopy Belief PropagationMax Product LPComing soon:Gibbs SamplingParameter LearningL1 Structure LearningM3 NetKernel Belief PropagationAd 1Ad 2User 1User 3User 2Computer Vision (CloudCV)Making sense of pictures.

Recognizing peopleMedical imagingEnhancing images

AlgorithmsImage stitchingFeature extraction Coming soon:Person/object detectorsInteractive segmentationFace recognition

Image: http://www.cs.brown.edu/courses/cs143/84ClusteringIdentify groups of related data

Group customer and productsCommunity detectionIdentify outliers

AlgorithmsK-Means++Coming soon:Structured EMHierarchical ClusteringNonparametric *-Means

Image: http://jordi.pro/netbiz/2012/05/connecting-people-a-must-for-clusters/85Topic ModelingExtract meaning from raw text

Improved searchSummarize textual dataFind related documents

AlgorithmsLDA Gibbs SamplerComing soon:CVB0 for LDALSA/LSICorrelated topic modelsTrending Topic Models

Image: http://knowledgeblog.org/files/2011/03/jiscmrd_wordle.png86GraphChi: Going small with GraphLab

Solve huge problems on small or embedded devices?Key: Exploit non-volatile memory (starting with SSDs and HDs)

GraphChi disk-based GraphLabNovel Parallel Sliding Windows algorithm

Fast!Solves tasks as large as current distributed systemsMinimizes non-sequential disk accesses Efficient on both SSD and hard-driveParallel, asynchronous executionTriangle Counting in Twitter Graph40M Users 1.2B EdgesTotal: 34.8 Billion TrianglesHadoop results from [Suri & Vassilvitskii '11]59 Minutes64 Machines, 1024 Cores1.5 Minutes1536 Machines423 Minutes59 Minutes, 1 Mac Mini!Hadoop [1]S. Suri and S. Vassilvitskii, Counting triangles and the curse of the last reducer, presented at the WWW '11: Proceedings of the 20th international conference on World wide web, 2011.89Demo: Streaming Graph Updates

Stream of Twitter social graph updates

Ingest 100,000 graph updates / secWhile simultaneously computing Pagerank on a Mac Mini, sustaining throughput of 200K updates/secondRelease 2.1 available now

http://graphlab.orgDocumentation Code Tutorials (more on the way) GraphChi 0.1 available nowhttp://graphchi.orgCarnegie Mellon

carlos guestrin

Documents

dataparallel mlexcellent

big data

large dataparallel tasks

data parallelism mapreduce

parallel learning systems

parallel computation

flickr photos data

tomachine learning