fraud detection using hadoop
Post on 22-Jan-2018
837 Views
Preview:
TRANSCRIPT
Real Time Fraud Detection Patterns and reference architectures
Ted Malaska // PSA Gwen Shapira // Software Engineer
2
• Intro • Review Problem • Quick overview of key technology • High level architecture • Deep Dive into NRT Processing • Completing the Puzzle – Micro-batch, Ingest and Batch
Overview
©2014 Cloudera, Inc. All rights reserved.
3 ©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data • Formerly consultant • Now Cloudera Engineer:
– Sqoop Committer – Kafka – Flume
• @gwenshap
Gwen Shapira
4
• Ted Malaska (PSA at Cloudera)
• Hadoop for ~5 years
• Contributed to – HDFS, MapReduce, Yarn, HBase, Spark, Avro, – Kite, Pig, Navigator, Cloudera Manager, Flume, Kafke, Sqoop, Accumulo – And working on a Sentry Patch
• Co-Author to O’Reilly Hadoop Application Architectures
• Worked with about 70 companies in 8 countries
• Marvel Fan Boy
• Runner
Hello
©2014 Cloudera, Inc. All rights reserved.
11
• Typical Atomic Card Fraud Detection • Ikea Meat Ball • Multi Coupons Combinations • OP or Negative Video Games Strategies • Ad Serving • Health Insurance Fraud • Kid Coming Home From School
Review of the Problem
©2014 Cloudera, Inc. All rights reserved.
12
How do we React • Human Brain at Tennis
– Muscle Memory – Reaction Thought – Reflective Meditation
©2014 Cloudera, Inc. All rights reserved.
15 ©2014 Cloudera, Inc. All rights reserved.
• Messages are organized into topics • Producers push messages • Consumers pull messages • Kafka runs in a cluster. Nodes are called brokers
The Basics
18 ©2014 Cloudera, Inc. All rights reserved.
Each Broker has many partitions
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 1
Partition 0
Partition 2 Partion 2
19 ©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
20 ©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
21 ©2014 Cloudera, Inc. All rights reserved.
Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Partition A (File)
Partition B (File)
Partition C (File)
Consumer
Consumer
Consumer
Order retained with in partition
Order retained with in partition but not over
partitions
Off
Set
X
Off
Set
X
Off
Set
X
Off
Set
Y
Off
Set
Y
Off
Set
Y
Off sets are kept per consumer group
23
Sources Interceptors Selectors Channels Sinks
Flume Agent
Short Intro to Flume Twitter, logs, JMS, webserver, Kafka
Mask, re-format, validate…
DR, critical Memory, file,
Kafka HDFS, HBase,
Solr
24
Flume and/or Kafka
©2014 Cloudera, Inc. All rights reserved.
Flume
UpStream
Flume Source
Interceptor
Flume Channel
Flume Sink
Down Stream
Selector Can Be Ka9a Can Be Ka9a Can Be Ka9a
25
Interceptors
• Mask fields • Validate information
against external source • Extract fields • Modify data format • Filter or split events
©2014 Cloudera, Inc. All rights reserved.
27
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”) 2. val ssc = new StreamingContext(conf, Seconds(1)) 3. val lines = ssc.socketTextStream("localhost", 9999) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print() 8. SSC.start()
28
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”) 2. val sc = new SparkContext(conf) 3. val lines = sc.textFile(path, 2) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print()
29
DStream
DStream
DStream
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first Batch
First Batch
Second Batch
30
DStream
DStream
DStream Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count
Pre-first Batch
First Batch
Second Batch
Stateful RDD 1
Stateful RDD 2
Stateful RDD 1
31
Spark Streaming and HBase
©2014 Cloudera, Inc. All rights reserved.
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks
33
Real-Time Event Processing Approach
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II Storage Processing
SolR
Hadoop Cluster I
Client Client Flume Agents
Hbase / Memory
Spark Streaming
HDFS
Hive/Impala Map/
Reduce
Spark
Search
Automated & Manual
Analytical
Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of
NRT Changes and Counters
Local Cache
Kafka
Clients: (Swipe here!)
Web App
35
Focus on NRT First
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II Storage Processing
SolR
Hadoop Cluster I
Client Client Flume Agents
Hbase / Memory
Spark Streaming
HDFS
Hive/Impala Map/
Reduce
Spark
Search
Automated & Manual
Analytical
Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of
NRT Changes and Counters
Local Cache
Kafka
Clients: (Swipe here!)
Web App
NRT Event Processing with Context
36
Streaming Architecture – NRT Event Processing
©2014 Cloudera, Inc. All rights reserved.
Flume Source Flume Source
Kafka
Initial Events Topic
Flume Source
Flume Interceptor
Event Processing Logic
Local Memory
HBase Client
Kafka
Answer Topic
HBase
Kaf
ka C
onsu
mer
Kaf
ka P
rodu
cer
Able to respond with in 10s of
milliseconds
37
Partitioned NRT Event Processing
©2014 Cloudera, Inc. All rights reserved.
Flume Source Flume Source
Kafka
Initial Events Topic Flume Source
Flume Interceptor
Event Processing Logic
Local Memory
HBase Client
Kafka
Answer Topic
HBase
Kaf
ka C
onsu
mer
Kaf
ka P
rodu
cer
Topic
Partition A
Partition B
Partition C
Producer
Partitione
r
Producer
Partitione
r
Producer
Partitione
r
Custom Partitioner
Better use of local memory
39
Micro Batching
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II Storage Processing
SolR
Hadoop Cluster I
Client Client Flume Agents
Hbase / Memory
Spark Streaming
HDFS
Hive/Impala Map/
Reduce
Spark
Search
Automated & Manual
Analytical
Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of
NRT Changes and Counters
Local Cache
Kafka
Clients: (Swipe here!)
Web App
Micro Batching
Micro Batching Micro Batching
40
Complex Topologies
©2014 Cloudera, Inc. All rights reserved.
Kafka
Initial Events Topic
Spark Streaming
Kaf
ka D
irect
C
onne
ctio
n
Dag Topologies
Kafka
Initial Events Topic
Spark Streaming
Kafka Receivers
Dag Topologies
Kafka Receivers
Kafka Receivers
• Manages Offset • Stores Offset is RDD • No longer needs HDFS for initial RDD check
pointing
• Lets Kafka Manage Offsets • Uses HDFS for initial RDD recovery
1.3
1.2
41
MicroBatch Bad-Input Handling
©2014 Cloudera, Inc. All rights reserved.
0 1 2 3 4 5 6 7 8 9 10
11
12
13
Kafka – incoming events topic
Dag Topologies
0 1 2 3 4 5 6 7 8 9 10
11
12
13
Kafka – bad events topic
0 1 2 3 4 5 6 7 8 9 10
11
12
13
Kafka – resolved events topic
0 1 2 3 4 5 6 7 8 9 10
11
12
13
Kafka – results topic
42
Ingestion
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II Storage Processing
SolR
Hadoop Cluster I
Client Client Flume Agents
Hbase / Memory
Spark Streaming
HDFS
Hive/Impala Map/
Reduce
Spark
Search
Automated & Manual
Analytical
Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of
NRT Changes and Counters
Local Cache
Kafka
Clients: (Swipe here!)
Web App
Ingestion
Ingestion
43
Ingestion
©2014 Cloudera, Inc. All rights reserved.
Flume HDFS Sink
Kafka Cluster
Topic
Partition A
Partition B
Partition C
Sink
Sink
Sink
HDFS
Flume SolR Sink Sink
Sink
Sink SolR
Flume Hbase Sink Sink
Sink
Sink HBase
44
Reflective Thoughts
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II Storage Processing
SolR
Hadoop Cluster I
Client Client Flume Agents
Hbase / Memory
Spark Streaming
HDFS
Hive/Impala Map/
Reduce
Spark
Search
Automated & Manual
Analytical
Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of
NRT Changes and Counters
Local Cache
Kafka
Clients: (Swipe here!)
Web App
Research and Searching
top related