real time fraud detection at 1+m scale on hadoop stack

41
Real time fraud detection at 1+M scale on hadoop stack Ishan Chhabra Nitin Aggarwal Rocketfuel Inc

Upload: hadoop-summit

Post on 07-Jan-2017

372 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Real time fraud detection at 1+M scale on hadoop stack

Real time fraud detection at 1+M scale on hadoop stack

Ishan ChhabraNitin AggarwalRocketfuel Inc

Page 2: Real time fraud detection at 1+M scale on hadoop stack

Agenda• Rocketfuel & Advertising auction process• Various kinds of frauds• Problem statement• Helios: Architecture• Implementation in Hadoop Ecosystem• Details about HDFS spout and datacube• Key takeaways

Page 3: Real time fraud detection at 1+M scale on hadoop stack

Rocketfuel Inc

• AdTech firm that enables marketers using AI & Big Data

• Scores 120+ Billion Ad auctions in a day

• Handles 1-2 Million TPS at peak traffic

Page 4: Real time fraud detection at 1+M scale on hadoop stack

Auction Process

(1) Request Ad (2) Auction request(3) $$ + HTML

(4b) Notification

(4a) Serve Ad

(5) Record impression

(6) Render Ad(7) Ad HTML

Page 5: Real time fraud detection at 1+M scale on hadoop stack

Exchange - Rocketfuel discrepancy

(4b) Notification(5) Record impression

count(4b) != count(5)

Page 6: Real time fraud detection at 1+M scale on hadoop stack

Rocketfuel - Advertiser discrepancy

(5) Record impression

(6) Render Ad

count(5) != count(6)

Page 7: Real time fraud detection at 1+M scale on hadoop stack

Common causes• Fraud

– Bot networks and malware– Hidden ad slots

• Human error– AD JavaScript site or browser specific issues– Bugs in Ad JavaScript– 3rd-party JavaScript interactions in Ad or site

Page 8: Real time fraud detection at 1+M scale on hadoop stack

Need for real time• Micro-patterns that change frequently• Latency has big business impact; delays in reacting leads to

loss of money• A lot of times discrepancies arise due to breakages and

sudden unexpected changes

Page 9: Real time fraud detection at 1+M scale on hadoop stack

Goal: Significantly reduce money loss from both ends by reacting to these micropatterns in near real time

Page 10: Real time fraud detection at 1+M scale on hadoop stack

Data flow

x2

x2

x2x2

Bidding Sites

Analytics Site

Page 11: Real time fraud detection at 1+M scale on hadoop stack

Data flow

Bids & Notifications (batched and delayed)

Impressions(near real time)

Bidding SiteAnalytics Site

Page 12: Real time fraud detection at 1+M scale on hadoop stack

Problem statement• 3 streams with various delays (2 from HDFS, 1 from Kafka)• Join and aggregate• Filter among 2^n feature combinations to identify the top

culprits (OLAP cube)• Feedback into bidding

Page 13: Real time fraud detection at 1+M scale on hadoop stack

Lambda architecture

Logs

Storm & HBase on YARN (Slider)

Serving Infra(Bidders and Ad-servers)

Near real-time pipeline

Batch pipeline

Page 14: Real time fraud detection at 1+M scale on hadoop stack

Helios: Abstraction for real time learning

• Real time processing of data streams from sources like Kafka and HDFS, with efficient join

• Process joined event views to generate different analytics, using HBase and MapReduce

• OLAP support• Join with dimensional data; different use-cases

Page 15: Real time fraud detection at 1+M scale on hadoop stack

Logs

Storm Cluster(Slider and YARN)

HBase Cluster(Slider and YARN)

Serving Infra(Bidders and Ad-servers)

Helios architecture

OLAP

Metrics

Page 16: Real time fraud detection at 1+M scale on hadoop stack

Step 1a: Ingesting events from Kafka

Logs

Storm Cluster(Slider and YARN)

Serving Infra(Bidders and Ad-servers)

Page 17: Real time fraud detection at 1+M scale on hadoop stack

Processing Kafka events in real-time

• Relies on logs streams written to Kafka by scribe• Kafka Topic with 200+ partitions• Data produced and written via scribe from more than 3K

nodes• Using upstream Kafka spout to read data

– Spout granularity is at record-level– Uses Zookeeper extensively for book-keeping

Page 18: Real time fraud detection at 1+M scale on hadoop stack

Processing Kafka events in real-time

• Topology Statistics:– Running on YARN as an application, so easily scalable

•Container: Memory: 2700m– Running with 25 workers (5 executors/worker)– Supervisor JVM opts:

• -Xms512m -Xmx512m -XX:PermSize=64m -XX:MaxPermSize=64m– Worker JVM opts:

• -Xmx1800m -Xms1800m– Processing nearly 100K events per second

Page 19: Real time fraud detection at 1+M scale on hadoop stack

Step 1b: Ingesting events from HDFS

Logs

Storm Cluster(Slider and YARN)

Serving Infra(Bidders and Ad-servers)

Page 20: Real time fraud detection at 1+M scale on hadoop stack

Processing HDFS events in real-time

• Relies on logs streams written to HDFS by scribe• WAN limitations introduce high compression needs• DistCp, rather than Kafka• Using in-house Storm spout to read streams from HDFS

Page 21: Real time fraud detection at 1+M scale on hadoop stack

Processing Bid-logs in real-timeStorm Topology Statistics:• Running on YARN as an application via slider (easily scalable)

–Container: Memory: 2700m• Currently running with 350 workers (~10 executors/worker).• Supervisor JVM opts:

–-Xms512m -Xmx512m -XX:PermSize=64m -XX:MaxPermSize=64m• Worker JVM opts:

–-Xmx1800m -Xms1800m• Processing nearly 1.5-2.0 million events per second (~ 100+B

events per day)

Page 22: Real time fraud detection at 1+M scale on hadoop stack

HDFS Spout Architecture

• Master-slave architecture• Spout granularity is at file-level, with record level offset

bookkeeping.• Use Zookeeper extensively for book-keeping

–Curator and recipes make life lot easier.• Highly influenced from Kafka Spout

Page 23: Real time fraud detection at 1+M scale on hadoop stack

HDFS Spout Architecture

Spout Leader

Spout Workers

un-assigned locked checkpoint done

offset Offset-lock

Page 24: Real time fraud detection at 1+M scale on hadoop stack

HDFS Spout Architecture• Assignment Manager (AM):

– Elected based on leader election algorithm– Polls HDFS periodically to identify new files, based on

timestamp and partitioned paths– Publish files to be processed as work tasks in zookeeper (ZK)– Manage time and path offsets, for cleaning up done nodes– Create periodic done-markers on HDFS

Page 25: Real time fraud detection at 1+M scale on hadoop stack

HDFS Spout Architecture• Worker (W):

– Select work-tasks from the available ones in ZK, when done with current work, with ephemeral node locking

– Perform file checkpointing using record-offset in ZK to save work

– Create done node in ZK, after processing the file

Page 26: Real time fraud detection at 1+M scale on hadoop stack

HDFS Spout ArchitectureBookkeeping node hierarchy:• Pluggable Backend: Current implementation use ZK• Work Life Cycle

– unassigned - file added here by AM– locked - created by worker on selecting work– checkpoint - timely checkpointing here– processed - created by worker on completion

• Offset Management– offset - stores path, time offset of HDFS– offset-lock - ephemeral lock for offset update

Page 27: Real time fraud detection at 1+M scale on hadoop stack

HDFS Spout Architecture

• Spout Failures– Slaves - Work made available again by Master– Master - One of the slaves become master via leader

election and give away the slave duties• Spouts Contention for work assignment via ZK ephemeral

nodes• Leverage partitioned data directories and done-markers

based model in the organization

Page 28: Real time fraud detection at 1+M scale on hadoop stack

Comparison with official HDFS spout Storm-1199• Use HDFS for book-keeping• Move or rename source files.

• All slave architecture, all spouts contend for failed works

• No leverage for partitioned data

• Kerberos support

In-house Implementation● Uses ZK for book-keeping.● No changes to source files

● Master-Slave architecture with leader election

● Leverage partitioned data, and done-markers.

● No Kerberos support.

Page 29: Real time fraud detection at 1+M scale on hadoop stack

Step 2: Join via HBase

Logs

Storm Cluster(Slider and YARN)

HBase Cluster(Slider and YARN)

Donemarkers

Page 30: Real time fraud detection at 1+M scale on hadoop stack

HBase for joining streams of data• Use request-id as key, to join different streams• Different Column Qualifiers for different event streams• HBase Cluster configuration

–Running on YARN as service via slider–Region-servers: 40 instances, with 4G memory each–Optimized for writes, with large MemStore–Tuned compactions, to avoid unnecessary merging of files,

as they expire quickly (low retention)•Date based compactions in HBase 2.0 available.

• Write throughput: 1M+ TPS

Page 31: Real time fraud detection at 1+M scale on hadoop stack

Observations from running Storm at scale• ZeroMQ more stable than Netty in version 0.9.x

– Many Netty Optimizations available in 0.10.x• Local-shuffle mode helpful for large data volumes• Need to tune heartbeats interval

– (task|worker|supervisor).heartbeat.frequency.secs– Pacemaker: Available in 1.0

• Need to tune code sync interval– Distributed Cache: Available in 1.0

Page 32: Real time fraud detection at 1+M scale on hadoop stack

Step 3: Scan joined view and populate OLAP

OLAP

Metrics

Donemarkers

Event Streams

Start MR Job

Page 33: Real time fraud detection at 1+M scale on hadoop stack

OLAP with multi-dimensional data• Developed Mapreduce backed workflow

– Cron triggered hourly jobs based on donemarkers– Scan data from HBase using snapshots– Semantics for hour boundaries– Event metric reporting

Page 34: Real time fraud detection at 1+M scale on hadoop stack

OLAP with multi-dimensional data• Modular API for processing records

– Pluggable architecture for different use-cases– OLAP implemented as a first-class use-case

• Use datacube library (Urban Airship) for generating OLAP data.

– Configurable metric reporting.

Page 35: Real time fraud detection at 1+M scale on hadoop stack

OLAP with multi-dimensional dataDatacube for OLAP• Library was developed at Urban Airship.• About the API

– Need to define dimensions and rollups for the cube– IO library for writing measures for cube– Pluggable Databases: HBase, In-memory Map– ID Service: Optimization for encoding values via ID substitution– Support for bulk-loading and backfilling

Page 36: Real time fraud detection at 1+M scale on hadoop stack

OLAP with multi-dimensional dataNew features (forked)• Reverse lookups for scans• New InputFormat for MR Jobs• Prefix hashes (data and lookups) for load distribution.• Optimized DB performance by using Async HBase library for efficient

reads/writesMR Job statistics• Use HBase Snapshots• MR job runs every hour (Run time: 5-15mins)• Hour is closed with delays of 30-60 minutes (on average), considering

log rotation and shipping(scribe) latencies.

D.V. Reddy
Might be better to skip the MR job statistics,I feel only the last bullet is relavant, others are too much info
Page 37: Real time fraud detection at 1+M scale on hadoop stack

Step 4: Scan OLAP cube for top feature vectors

OLAP

Metrics

Donemarkers

Start MR Job

Feature Vectors

Page 38: Real time fraud detection at 1+M scale on hadoop stack

OLAP with multi-dimensional dataSerialize OLAP View• Customizable MapReduce Job scans OLAP data (backed by HBase), writes

to HDFS.• Different Jobs can use this easily accessible data from HDFS for

processing, and upload computed feedback stats to sources like MySQLMR Job Statistics• MR job runs every hour (Runtime: 2-5mins)

Page 39: Real time fraud detection at 1+M scale on hadoop stack

DevOps Automation• Monitoring Service• Topology submission service

Page 40: Real time fraud detection at 1+M scale on hadoop stack

Key Takeaways• Hadoop ecosystem offers a productive stack for high velocity

real time learning problems• YARN allows one to easily experiment with and tweak vertical

to horizontal scalability ratios

Page 41: Real time fraud detection at 1+M scale on hadoop stack

THANKS!ANY QUESTIONS?

Reach us [email protected]@rocketfuel.com