fraud detection using hadoop

45
Real Time Fraud Detection Patterns and reference architectures Ted Malaska // PSA Gwen Shapira // Software Engineer

Upload: hadooparchbook

Post on 22-Jan-2018

837 views

Category:

Technology


2 download

TRANSCRIPT

Real Time Fraud Detection Patterns and reference architectures

Ted Malaska // PSA Gwen Shapira // Software Engineer

2

•  Intro •  Review Problem •  Quick overview of key technology •  High level architecture •  Deep Dive into NRT Processing •  Completing the Puzzle – Micro-batch, Ingest and Batch

Overview

©2014 Cloudera, Inc. All rights reserved.

3 ©2014 Cloudera, Inc. All rights reserved.

•  15 years of moving data •  Formerly consultant •  Now Cloudera Engineer:

–  Sqoop Committer –  Kafka –  Flume

•  @gwenshap

Gwen Shapira

4

•  Ted Malaska (PSA at Cloudera)

•  Hadoop for ~5 years

•  Contributed to –  HDFS, MapReduce, Yarn, HBase, Spark, Avro, –  Kite, Pig, Navigator, Cloudera Manager, Flume, Kafke, Sqoop, Accumulo –  And working on a Sentry Patch

•  Co-Author to O’Reilly Hadoop Application Architectures

•  Worked with about 70 companies in 8 countries

•  Marvel Fan Boy

•  Runner

Hello

©2014 Cloudera, Inc. All rights reserved.

5

The Problem

©2014 Cloudera, Inc. All rights reserved.

6

Credit Card Transaction Fraud

©2014 Cloudera, Inc. All rights reserved.

7

Ikea Meat Balls

©2014 Cloudera, Inc. All rights reserved.

8

Coupon Fraud

©2014 Cloudera, Inc. All rights reserved.

9

Video Game Strategy

©2014 Cloudera, Inc. All rights reserved.

10

Health Insurance Fraud

©2014 Cloudera, Inc. All rights reserved.

11

•  Typical Atomic Card Fraud Detection •  Ikea Meat Ball •  Multi Coupons Combinations •  OP or Negative Video Games Strategies •  Ad Serving •  Health Insurance Fraud •  Kid Coming Home From School

Review of the Problem

©2014 Cloudera, Inc. All rights reserved.

12

How do we React •  Human Brain at Tennis

–  Muscle Memory –  Reaction Thought –  Reflective Meditation

©2014 Cloudera, Inc. All rights reserved.

13

Overview of Key Technologies

©2014 Cloudera, Inc. All rights reserved.

14

Kafka

©2014 Cloudera, Inc. All Rights Reserved.

15 ©2014 Cloudera, Inc. All rights reserved.

• Messages are organized into topics • Producers push messages • Consumers pull messages • Kafka runs in a cluster. Nodes are called brokers

The Basics

16 ©2014 Cloudera, Inc. All rights reserved.

Topics, Partitions and Logs

17 ©2014 Cloudera, Inc. All rights reserved.

Each partition is a log

18 ©2014 Cloudera, Inc. All rights reserved.

Each Broker has many partitions

Partition 0 Partition 0

Partition 1 Partition 1

Partition 2

Partition 1

Partition 0

Partition 2 Partion 2

19 ©2014 Cloudera, Inc. All rights reserved.

Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

20 ©2014 Cloudera, Inc. All rights reserved.

Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

21 ©2014 Cloudera, Inc. All rights reserved.

Consumers

Consumer Group Y

Consumer Group X

Consumer

Kafka Cluster

Topic

Partition A (File)

Partition B (File)

Partition C (File)

Consumer

Consumer

Consumer

Order retained with in partition

Order retained with in partition but not over

partitions

Off

Set

X

Off

Set

X

Off

Set

X

Off

Set

Y

Off

Set

Y

Off

Set

Y

Off sets are kept per consumer group

22

Flume

23

Sources Interceptors Selectors Channels Sinks

Flume Agent

Short Intro to Flume Twitter, logs, JMS, webserver, Kafka

Mask, re-format, validate…

DR, critical Memory, file,

Kafka HDFS, HBase,

Solr

24

Flume and/or Kafka

©2014 Cloudera, Inc. All rights reserved.

Flume  

UpStream  

Flume  Source  

Interceptor  

Flume  Channel  

Flume  Sink  

Down  Stream  

Selector  Can  Be  Ka9a  Can  Be  Ka9a  Can  Be  Ka9a  

25

Interceptors

• Mask fields • Validate information

against external source • Extract fields • Modify data format •  Filter or split events

©2014 Cloudera, Inc. All rights reserved.

26

SparkStreaming

27

Spark Streaming Example

©2014 Cloudera, Inc. All rights reserved.

1.  val conf = new SparkConf().setMaster("local[2]”) 2.  val ssc = new StreamingContext(conf, Seconds(1)) 3.  val lines = ssc.socketTextStream("localhost", 9999) 4.  val words = lines.flatMap(_.split(" ")) 5.  val pairs = words.map(word => (word, 1)) 6.  val wordCounts = pairs.reduceByKey(_ + _) 7.  wordCounts.print() 8.  SSC.start()

28

Spark Streaming Example

©2014 Cloudera, Inc. All rights reserved.

1.  val conf = new SparkConf().setMaster("local[2]”) 2.  val sc = new SparkContext(conf) 3.  val lines = sc.textFile(path, 2) 4.  val words = lines.flatMap(_.split(" ")) 5.  val pairs = words.map(word => (word, 1)) 6.  val wordCounts = pairs.reduceByKey(_ + _) 7.  wordCounts.print()

29

DStream

DStream

DStream

Spark Streaming

Confidentiality Information Goes Here

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count Print

Pre-first Batch

First Batch

Second Batch

30

DStream

DStream

DStream Spark Streaming

Confidentiality Information Goes Here

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

Print

Stateful RDD 2

Stateful RDD 1

31

Spark Streaming and HBase

©2014 Cloudera, Inc. All rights reserved.

Driver

Walker Node

Configs

Executor

Static Space

Configs

HConnection

Tasks Tasks

Walker Node

Executor

Static Space

Configs

HConnection

Tasks Tasks

32

High Level Architecture

©2014 Cloudera, Inc. All rights reserved.

33

Real-Time Event Processing Approach

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster II Storage Processing

SolR

Hadoop Cluster I

Client Client Flume Agents

Hbase / Memory

Spark Streaming

HDFS

Hive/Impala Map/

Reduce

Spark

Search

Automated & Manual

Analytical

Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of

NRT Changes and Counters

Local Cache

Kafka

Clients: (Swipe here!)

Web App

34

NRT Processing

©2014 Cloudera, Inc. All rights reserved.

35

Focus on NRT First

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster II Storage Processing

SolR

Hadoop Cluster I

Client Client Flume Agents

Hbase / Memory

Spark Streaming

HDFS

Hive/Impala Map/

Reduce

Spark

Search

Automated & Manual

Analytical

Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of

NRT Changes and Counters

Local Cache

Kafka

Clients: (Swipe here!)

Web App

NRT Event Processing with Context

36

Streaming Architecture – NRT Event Processing

©2014 Cloudera, Inc. All rights reserved.

Flume Source Flume Source

Kafka

Initial Events Topic

Flume Source

Flume Interceptor

Event Processing Logic

Local Memory

HBase Client

Kafka

Answer Topic

HBase

Kaf

ka C

onsu

mer

Kaf

ka P

rodu

cer

Able to respond with in 10s of

milliseconds

37

Partitioned NRT Event Processing

©2014 Cloudera, Inc. All rights reserved.

Flume Source Flume Source

Kafka

Initial Events Topic Flume Source

Flume Interceptor

Event Processing Logic

Local Memory

HBase Client

Kafka

Answer Topic

HBase

Kaf

ka C

onsu

mer

Kaf

ka P

rodu

cer

Topic

Partition A

Partition B

Partition C

Producer

Partitione

r

Producer

Partitione

r

Producer

Partitione

r

Custom Partitioner

Better use of local memory

38

Completing the Puzzle

©2014 Cloudera, Inc. All rights reserved.

39

Micro Batching

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster II Storage Processing

SolR

Hadoop Cluster I

Client Client Flume Agents

Hbase / Memory

Spark Streaming

HDFS

Hive/Impala Map/

Reduce

Spark

Search

Automated & Manual

Analytical

Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of

NRT Changes and Counters

Local Cache

Kafka

Clients: (Swipe here!)

Web App

Micro Batching

Micro Batching Micro Batching

40

Complex Topologies

©2014 Cloudera, Inc. All rights reserved.

Kafka

Initial Events Topic

Spark Streaming

Kaf

ka D

irect

C

onne

ctio

n

Dag Topologies

Kafka

Initial Events Topic

Spark Streaming

Kafka Receivers

Dag Topologies

Kafka Receivers

Kafka Receivers

•  Manages Offset •  Stores Offset is RDD •  No longer needs HDFS for initial RDD check

pointing

•  Lets Kafka Manage Offsets •  Uses HDFS for initial RDD recovery

1.3

1.2

41

MicroBatch Bad-Input Handling

©2014 Cloudera, Inc. All rights reserved.

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Kafka – incoming events topic

Dag Topologies

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Kafka – bad events topic

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Kafka – resolved events topic

0 1 2 3 4 5 6 7 8 9 10

11

12

13

Kafka – results topic

42

Ingestion

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster II Storage Processing

SolR

Hadoop Cluster I

Client Client Flume Agents

Hbase / Memory

Spark Streaming

HDFS

Hive/Impala Map/

Reduce

Spark

Search

Automated & Manual

Analytical

Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of

NRT Changes and Counters

Local Cache

Kafka

Clients: (Swipe here!)

Web App

Ingestion

Ingestion

43

Ingestion

©2014 Cloudera, Inc. All rights reserved.

Flume HDFS Sink

Kafka Cluster

Topic

Partition A

Partition B

Partition C

Sink

Sink

Sink

HDFS

Flume SolR Sink Sink

Sink

Sink SolR

Flume Hbase Sink Sink

Sink

Sink HBase

44

Reflective Thoughts

©2014 Cloudera, Inc. All rights reserved.

Hadoop Cluster II Storage Processing

SolR

Hadoop Cluster I

Client Client Flume Agents

Hbase / Memory

Spark Streaming

HDFS

Hive/Impala Map/

Reduce

Spark

Search

Automated & Manual

Analytical

Adjustments and Pattern detection

Fetching & Updating Profiles

Adjusting NRT Stats

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of

NRT Changes and Counters

Local Cache

Kafka

Clients: (Swipe here!)

Web App

Research and Searching

©2014 Cloudera, Inc. All rights reserved.