dale wong - spark graphx demo
TRANSCRIPT
Nature-Inspired Algorithm for AdTech Model Exploration
• Pages are linked by similarity,forming a network of branches
• Ads are like butterflies,drifting towards attractors
• The flight of the butterflies is a function of local attraction to pages, plus some randomness to escape local minima
• System converges to ads hovering around relevant pages
Nature-Inspired Algorithm for AdTech Model Exploration
• Pages are linked by similarity,forming a network of branches
• Ads are like butterflies,drifting towards attractors
• The flight of the butterflies is a function of local attraction to pages, plus some randomness to escape local minima
• System converges to ads hovering around relevant pages
Nature-Inspired Algorithm for AdTech Model Exploration
• Pages are linked by similarity,forming a network of branches
• Ads are like butterflies,drifting towards attractors
• The flight of the butterflies is a function of local attraction to pages, plus some randomness to escape local minima
• System converges to ads hovering around relevant pages
Nature-Inspired Algorithm for AdTech Model Exploration
• Pages are linked by similarity,forming a network of branches
• Ads are like butterflies,drifting towards attractors
• The flight of the butterflies is a function of local attraction to pages, plus some randomness to escape local minima
• System converges to ads hovering around relevant pages
Nature-Inspired Algorithm for AdTech Model Exploration
• Pages are linked by similarity,forming a network of branches
• Ads are like butterflies,drifting towards attractors
• The flight of the butterflies is a function of local attraction to pages, plus some randomness to escape local minima
• System converges to ads hovering around relevant pages
Similarity Graph
vertex = pageedge = similarity
val allPairs = pages.cartesian(pages).filter{ case (a, b) => a._1 < b._1 } val similarPairs = allPairs.filter{ case (page1, page2) => page1._2.intersect(page2._2).length >= 1 }
Data Set•Kaggle 2012 Challenge: Click-Thru Rate Prediction •Actual data provided by a Chinese search company
•CSV files •26M search queries
•each query has its list of words •e.g. “data scientist”
• 4M ads •each ad has its list of words •e.g. “Insight Data Engineering Program”
Data Set•Kaggle 2012 Challenge: Click-Thru Rate Prediction •Actual data provided by a Chinese search company
•CSV files •26M search queries
•each query has its list of words •e.g. “data scientist”
• 4M ads •each ad has its list of words •e.g. “Insight Data Engineering Program”
Data Set•Kaggle 2012 Challenge: Click-Thru Rate Prediction •Actual data provided by a Chinese search company
•CSV files •26M search queries
•each query has its list of words •e.g. “data scientist”
• 4M ads •each ad has its list of words •e.g. “Insight Data Engineering Program”
Butterfly Simulation is a Good Fit for Spark GraphX
• Many parallel computations of localized operations
• ad migration
• attraction propagation
• select ad for page request
Spark GraphX Google Pregel API
MapReduce for each vertex:
1.Send Messages
• send msgs to neighbors
2. Merge Messages
• merge msgs to same vertex
3. Vertex Program
• process incoming msgs
Vertex Program
Send Message
MergeMessages
Spark GraphX Google Pregel API
MapReduce for each vertex:
1.Send Messages
• send msgs to neighbors
2. Merge Messages
• merge msgs to same vertex
3. Vertex Program
• process incoming msgs
Vertex Program
Send Message
MergeMessages
Spark GraphX Google Pregel API
MapReduce for each vertex:
1.Send Messages
• send msgs to neighbors
2. Merge Messages
• merge msgs to same vertex
3. Vertex Program
• process incoming msgs
Vertex Program
Send Message
MergeMessages
Spark GraphX Google Pregel API
MapReduce for each vertex:
1.Send Messages (for each edge)
• send msgs to neighbors
2. Merge Messages
• merge msgs to same vertex
3. Vertex Program
• process incoming msgs
Vertex Program
Send Message
MergeMessages
Need to Adapt Programming Model
• Adapt Vertex-centric algorithmto Edge-centric API
• Replicate each vertex’s data onto its neighbors
• Replication implemented as an initialization phase Pregel cycle
• Localizes vertex calculations
• With GraphX cluster, network bandwidth is more of a concern than storage
Need to Adapt Programming Model
• Adapt Vertex-centric algorithmto Edge-centric API
• Replicate each vertex’s data onto its neighbors
• Replication implemented as an initialization phase Pregel cycle
• Localizes vertex calculations
• With GraphX cluster, network bandwidth is more of a concern than storage
Need to Adapt Programming Model
• Adapt Vertex-centric algorithmto Edge-centric API
• Replicate each vertex’s data onto its neighbors
• Replication implemented as an initialization phase Pregel cycle
• Localizes vertex calculations
• With Spark GraphX cluster, network bandwidth is more of a concern than storage