hadoop graph processing with apache giraph
Post on 22-Sep-2014
27 views
DESCRIPTION
PayPal prvoides an online transfer money network. Each payment flow connects senders and receivers into a giant network where each sender/receiver is a node and each transaction is an edge. Traditionally, the risk score of a transaction is computed based on the characteristics of the involved sender/receiver/transaction. In this talk, we will describe a novel network inference approach to calculate transaction risk score that also includes the risk profile of neighboring senders and receivers using Apache Giraph. The approach reveals additional risk insights not possible with the traditional method. We leverage Hadoop to support a graph computation involving hundreds of millions of nodes and edges.TRANSCRIPT
![Page 1: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/1.jpg)
June, 2013
Jay Tang
GRAPH MINING WITH APACHE GIRAPH
![Page 2: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/2.jpg)
Confidential and Proprietary2
• Introduction
• Big Data problem
• Graph mining platform
• Use case
• Lessons
• Future work
AGENDA
![Page 3: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/3.jpg)
Confidential and Proprietary3
• Director of Big Data Platform & Analytics, PayPal
− Hadoop, Graph mining, Real-time analytics, ML, text mining
• 20 years of software experience in the valley focused on data
• Member of original Hadoop team @Yahoo
• Built data warehouse, relational database, OLAP product @Yahoo, Oracle/Hyperion, IBM Informix, DB2
ABOUT ME
![Page 4: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/4.jpg)
Confidential and Proprietary4
BIG DATA PROBLEM
![Page 5: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/5.jpg)
Confidential and Proprietary5
• Enable Online, Offline, and Mobile payment
• 128M customers worldwide
• $160B payment volume processed annually
• Major retail locations accepting PayPal
20K today 2M end of 2013
• PayPal Here launching in US and international markets
Petabye Data Problem & Growing
BIG DATA PROBLEM @ PAYPAL
![Page 6: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/6.jpg)
Confidential and Proprietary6
• Detect and prevent fraud
• Assess credit risk
• Relevant offer to our customers
• Improve user experience
• Provide better insights to our merchants
BIG DATA POWERS PAYPAL ANALYTICS
![Page 7: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/7.jpg)
Confidential and Proprietary7
GRAPH MINING PLATFORM
![Page 8: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/8.jpg)
Confidential and Proprietary8
BIG DATA STACK
DataCloud
![Page 9: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/9.jpg)
Confidential and Proprietary9
Traditional data processing abstraction -- TABLE
• Rows
• Columns
• Data Types
DATA ABSTRACTION
![Page 10: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/10.jpg)
Confidential and Proprietary10
• Internet & WWW
• Social network
• PayPal payment network – accounts & transactions
GRAPH IS EVERYWHERE
![Page 11: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/11.jpg)
Confidential and Proprietary11
• Think like a vertex
• Two basic operations
− Fusion: aggregate information from neighbors to a set of entities
− Diffusion: propagate information from a vertex to neighbors
GRAPH COMPUTING
![Page 12: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/12.jpg)
Confidential and Proprietary12
THING LIKE A VERTEX - FUSION
![Page 13: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/13.jpg)
Confidential and Proprietary13
THINK LIKE A VERTEX - DIFFUSION
![Page 14: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/14.jpg)
Confidential and Proprietary14
• Which graph mining engine to use?
− GraphLab
− Apache Giraph
− Apache Hamas
• Hadoop compatible
− Data is on Hadoop
− Leverage existing cluster infrastructure
− Integration with Hadoop
• Easy of deployment and update
• Community
GRAPH MINING ENGINE
![Page 15: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/15.jpg)
Confidential and Proprietary15
• Apache open src implementation of Google Pregel on Hadoop
• Send msg from a vertex to any other vertex
• In-memory scalable system
− Map-only jobs, Zookeeper, Netty
BSP & GIRAPH
![Page 16: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/16.jpg)
Confidential and Proprietary16
GRAPH MINING USE CASE
![Page 17: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/17.jpg)
Confidential and Proprietary17
• Stop fraudsters from stealing money from PayPal payment network
• Sophisticate risk models running in real-time based on
− Online data
− Offline data
• Risk profile traditionally based on a variety of data
− Account
− Transaction -- frequency, amount, history
− IP
− Email domain
RISK DETECTION & MITIGATION
![Page 18: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/18.jpg)
Confidential and Proprietary18
RISK COMPUTATION
Current TX Details
Risk Models
Approve
DeclineHistory Data
![Page 19: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/19.jpg)
Confidential and Proprietary19
• PayPal data are connected
• Form multiple communities that have hidden inferences
• Discover the inferences via a graph approach
• Build a system to extract the inferences
GRAPH MINING CONNECTED DATA
![Page 20: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/20.jpg)
Confidential and Proprietary20
GRAPH VIEW OF DATA
User1
User2
Merchant
BUY
BUY
P2P Money Transfer
![Page 21: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/21.jpg)
Confidential and Proprietary21
GRAPH VIEW OF DATA
Account 1
IP1 IP2
Account 2
IP3
![Page 22: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/22.jpg)
Confidential and Proprietary22
GRAPH MINING DATA PIPELINE
Pre Processing
Graph Processing
Post Processing
Giraph
MapReduce
MapReduce
![Page 23: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/23.jpg)
Confidential and Proprietary23
• Input data is raw transaction data
• Custom MapReduce jobs to pre-process data into graph model
• Output is JSON format of adjacent node list
− Easy to consume in Java and by humans
− Use gson library
• Post processing – output format conversion
GRAPH DATA PIPELINE
![Page 24: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/24.jpg)
Confidential and Proprietary24
• Customers/Accounts linked via transactions
• Compute risk = intrinsic risk + risk propagated from peers
• Send risk message to peers
• Iterate till converge
GRAPH PROCESSING
Cus1
Cus2
Transaction T1
Transaction T0
Transaction T2
Transaction T3
![Page 25: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/25.jpg)
Confidential and Proprietary25
IP3
IP2
GRAPH PROCESSING
Account 1
IP1 IP2
Account 2
IP3IP1
![Page 26: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/26.jpg)
Confidential and Proprietary26
LESSONS LEARNED
![Page 27: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/27.jpg)
Confidential and Proprietary27
• Giraph is an emerging technology
− Incubation in 2012
− Rapidly evolving
− 0.1 and 0.2 are not compatible
− Lack of knowledge & doc
• Build internal git repo
• Read code and join mailing list
• Port code from 0.1 to 0.2
• Use Giraph 1.0 released on May 6 2013
GIRAPH
![Page 28: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/28.jpg)
Confidential and Proprietary28
• Must guarantee minimum number of Mappers
• Capacity scheduler
− set MIN mapper of queue > Giraph job needs
• Fair scheduler
− set MIN mapper of queue > Giraph job needs
− Turn on pre-emption
− Set pre-emption wait time to a small interval – 20 sec
HADOOP ENVIRONMENT INTEGRATION
![Page 29: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/29.jpg)
Confidential and Proprietary29
• Memory constraint in a shared Hadoop environment
− 1.2B edges and 300M nodes
− Single purpose POC cluster mapper memory = 10 GB
− Shared R&D cluster mapper memory = 3 GB
• Reduce memory consumption is key
− Convert String to long for graph processing
− Convert back to String in post-processing for downstream application
− Cap the number of messages passed
− distance from current vertex
− message payload data values
MEMORY SCALABILITY
![Page 30: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/30.jpg)
Confidential and Proprietary30
• Giraph-based data engine to produce enriched data set
• Leverage Giraph on YARN
• Number of worker scalability
FUTURE WORK
![Page 31: Hadoop Graph Processing with Apache Giraph](https://reader033.vdocuments.net/reader033/viewer/2022061104/541f2fd17bef0ab16e8b4681/html5/thumbnails/31.jpg)
Q&A
WE ARE HIRING