complex event processing: use cases & flinkcep library (flink.tw meetup 2016/07/19)
TRANSCRIPT
Complex Event Processing:Use Cases & FlinkCEP Library
Gordon Tai - @tzulitai
July 19, 2016 @ Flink.tw Meetup
00 This Talk is About ...● How FlinkCEP got me interested in Flink
● CEP use cases & applications○ Use case study #1: tracking an order process○ Use case study #2: advertisement targeting
● A look at the API
1
● 戴資力(Gordon)● Data Engineer @ VMFive● Java, Scala● Using Flink as an user on VMFive’s Adtech platform● Enjoy working on distributed computing systems● Works on Flink during free time● Contributor: Flink Kinesis Consumer connector
00 Me & Flink
2
Tale of a Data Engineer trying to figure out how to build up a streaming analytics pipeline ...
1. First lesson: non-trivial streaming applications are never stateless
2. Second lesson: statefull streaming topologies are a pain
3
1. Exactly-once state updates on failures for correctness2. Idempotance wrt. external state stores3. Out-of-order events4. Aggregating on time windows5. Rapid application development
Applications I was working on:Streaming aggregation for reporting &Conversion patterns for alerting
4
TL;DR. It isn’t fun. At all.● Reference:
Building a Stream ProcessingSystem for Playable Ads Dataat VMFive @ HadoopCon 2015
● Redis was used as an external state store
● All state update had to be idempotent
● Exactly-once & replay on failover implemented with Storm’s tuple acking mechanism
5
● Generate derived events when a specified pattern on raw events occur in a data stream○ if A and then B → infer complex event C
● Goal: identify meaningful event patterns and respond to them as quickly as possible
● Demanding on the stream processor to provide robust state handling & out-of-order events support while keeping low latency with high throughput
01 Complex Event Processing
6
02 Apache Flink CEP Library● Built upon Flink’s
DataStream API
● Allows users to define patterns, inject them on event streams, and generates new event streams based on the pattern
● Exploits Flink’s exactly-once semantics for definite correctness
7
eCommerce Order Process TrackingUse case study #1
** Note: the illustrations & content in this section is from Data Artisans’ presentation: Streaming Analytics & CEP - Two Sides of the Same Coin?
03 Order Tracking Data Model
● Order(orderId, tStamp, “received”) extends Event● Shipment(orderId, tStamp, “shipped”) extends Event● Delivery(orderId, tStamp, “delivered”) extends Event
8
04 Real-Time Warnings for SLAs
● ProcessSucc(orderId, tStamp, duration)● ProcessWarn(orderId, tStamp)● DeliverySucc(orderId, tStamp, duration)● DeliveryWarn(orderId, tStamp)
New inferred events:
9
05 Glimpse at the FlinkCEP APIval processingPattern = Pattern .begin[Event]("orderReceived").subtype(classOf[ Order]) .followedBy( "orderShipped").where(_.status == "shipped") .within(Time.hours(1))
val processingPatternStream = CEP.pattern( input.keyBy( "orderId"), processingPattern)
val procResult: DataStream[Either[ProcessWarn, ProcessSucc]] = processingPatternStream.select { (pP, timestamp) => // Timeout handler ProcessWarn(pP("orderReceived").orderId, timestamp) } { fP => // Select function ProcessSucc( fP("orderReceived").orderId, fP( "orderShipped").tStamp, fP("orderShipped").tStamp – fP( "orderReceived").tStamp) }
10
06 Glimpse at the FlinkCEP APIval env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val input: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(...))
val processingPattern = Pattern.begin(...)...
val processingPatternStream = CEP.pattern(input.keyBy( "orderId"), processingPattern)
val procResult: DataStream[Either[ProcessWarn, ProcessSucc]] = processingPatternStream.select(...)
procResult.addSink(new RedisSink(...)) // .addSink(new FlinkKafkaProducer09(...)) // .addSink(new ElasticsearchSink(...)) // .map(new MapFunction{...}) // … anything you’d like to continue to do with the inferred event stream
env.execute()
11
07 Glimpse at the FlinkCEP APIval env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironmentenv.setStreamTimeCharacteristic( TimeCharacteristic.EventTime)
val input: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(...)).assignTimestampsAndWatermarks(new CustomExtractor)
val processingPattern = Pattern.begin(...)...
val processingPatternStream = CEP.pattern(input.keyBy( "orderId"), processingPattern)
val procResult: DataStream[Either[ProcessWarn, ProcessSucc]] = processingPatternStream.select(...)
procResult.addSink(new RedisSink(...)) // .addSink(new FlinkKafkaProducer09(...)) // .addSink(new ElasticsearchSink(...)) // .map(new MapFunction{...})
env.execute()
12
08 Combining Stream SQL & CEP
● Further reading: Streaming Analytics & CEP - Two Sides of the Same Coin?
13
Ad Targeting based on User AttributionUse case study #2
** Note: the content in this section is heavily based on my experience at VMFive 14
09 Ad Targeting 101
● What an ad server does, in a nutshell →determine an appropriate advertisement, chosen from an advertisement campaign pool, for each incoming ad request
AdServer
CampaignPool
(1) request advertisement
(2) return appropriateadvertisement infofrom campaign pool
● “appropriate”:fulfill the targeting rules ofeach campaign
15
10 Ad Targeting Rule Types
● Fundamental campaign targeting rule types:○ Target users’ current location, ex. users in Taipei○ Target specific user device type, ex. tablet or phone○ ...
● Advanced campaign targeting rule types:○ Target user’s past location trace, ex. in Taipei for the past 7 days○ Target users entering / departuring countries○ Target users with specific attribution, ex. viewed ○ ...
16
11 Ad Targeting Rule Types
● Fundamental campaign targeting rule types:○ Target users’ current location, ex. users in Taipei○ Target specific user device type, ex. tablet or phone○ ...
● Advanced campaign targeting rule types:○ Target user’s past location trace, ex. in Taipei for the past 7 days○ Target users entering / departuring countries○ Target users with specific attribution, ex. viewed ○ ...
● Does not require event aggregation● The rules can be matched simply
based on info at request time
● Requires aggregation of historical events● Aggregating at request time will be far too slow● Requires inferring complex events from patterns in
raw event stream → CEP to the rescue!
16
12 Basic Ad Targeting Architecture
Campaign PoolTargeting Cache
Ad Targeter
register adcampaigns
Event Logger
Web
Ser
vice
AdServerData Warehouse
17
(1) initialconnection
12 Basic Ad Targeting Architecture
Campaign PoolTargeting Cache
Ad Targeter
Event Logger
Web
Ser
vice
AdServerData Warehouse
17
(2) fetch ad
12 Basic Ad Targeting Architecture
Ad Targeter
Event Logger
Web
Ser
vice
AdServerData Warehouse
Raw Logs
Event Bus Service
Reporting & analyticsservices
Bat
ch
Stre
amin
g
...
Campaign PoolTargeting Cache
18
(3) eventtracking
13 Advanced Ad Targeting Architecture
Ad Targeter
Event Logger
Web
Ser
vice
AdServerData Warehouse
Raw Logs
Event Bus Service
Reporting & analyticsservices
Bat
ch
Stre
amin
g
...
Rul
es S
ervu
ce
Campaign PoolTargeting Cache
CEP
19
13 Advanced Ad Targeting ArchitectureData Warehouse
Raw Logs
Event Bus Service
Bat
ch
Stre
amin
g
...
Rul
es S
ervi
ce
CEP
CEP-Rule Templates
RuleFulfillment
Cache(Redis)
Entry /Depart
UserAttribution ...
(1) Inject a ruleto start matchingon event stream
(3)submitCEPtopology
(2) Return Rule ID
20
13 Advanced Ad Targeting ArchitectureData Warehouse
Raw Logs
Event Bus Service
Bat
ch
Stre
amin
g
...
Rul
es S
ervi
ce
CEP
CEP-Rule Templates
RuleFulfillment
Cache(Redis)
Entry /Depart
UserAttribution ...
(4) When CEP pattern is fulfilled,write to cache:UID → RuleID
(5) Lookup whether a UID has fulfilled aRuleID
21
13 Advanced Ad Targeting Architecture
Ad Targeter
register adcampaigns
Event Logger
Web
Ser
vice
AdServerData Warehouse
Raw Logs
Event Bus Service
Reporting & analyticsservices
Bat
ch
Stre
amin
g
...
Rul
es S
ervi
ce
Campaign PoolTargeting Cache
CEP
22
(1) register rulefor campaign
(2) lookup whetheruser fulfils a rule
14 Some Discussion
● Why a fixed pool of CEP-Rule Templates?○ Prevent rogue rules to match, ex. rules that will consume too much resource○ It’s a lot less work and complication ;)
● Would be very nice to have a freestyle rule service○ Pattern matching across different event streams of an organization○ For BI, there will be arbitrary complex events / patterns analysts want to monitor
● Further study for similar use case: King’s RBEA○ RBEA: Rule-Based Event Aggregator○ https://techblog.king.com/rbea-scalable-real-time-analytics-king/
○ http://data-artisans.com/rbea-scalable-real-time-analytics-at-king/
23
Closing
XX Closing● Complex Event Processing is an emerging way to draw
insights from data streams, and is demanding of the underlying stream processor for exactly-once semantics for correctness
● FlinkCEP builds on the DataStreamAPI to make this possible and easy
24