data days 2014 - matt weiden

21
SKETCHY: A COMPLEX EVENT PROCESSING NETWORK FOR SPAM DETECTION. Matt Weiden / SoundCloud Ltd.

Upload: datadays

Post on 25-Jun-2015

150 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Data Days 2014 - Matt Weiden

SKETCHY: A COMPLEX EVENT PROCESSING NETWORK FOR SPAM DETECTION.

!Matt Weiden / SoundCloud Ltd.

Page 2: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

!WHO?

Ich heiße Matt Weiden. Freut mich. • Backend Engineer, SoundCloud’s Trust, Safety & Security

Team • Previously Cognitive Science, BCI research !

Contributors • Rany Keddo • Michael Brückner • Astera Schneeweisz • Others

Page 3: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

WHAT? INFERENCE FROM RELATED STREAMS OF DATA

The problem: How quickly and efficiently can we draw aggregate inferences from large streams of related events? !!!!!

!!!What inferences could we make? How quickly and efficiently can we make them?

Time

Posts

Views

Follows

Page 4: Data Days 2014 - Matt Weiden

DRINKING FROM A FIREHOSE.

Performing this for a whole site might take a little more thought.

Page 5: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

WHAT (MORE SPECIFICALLY)? INFERENCE FROM RELATED STREAMS OF DATA

!!!How quickly and efficiently can we draw aggregate inferences

from large streams of related events?

Page 6: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

HOW? EVENT-DRIVEN ARCHITECTURE

Event-Driven Architecture (EDA) !

• Near realtime • Only process the data once* • Operate on

• incremental sub-goal results • ‘Complex Events’ by adding ‘Context’

• Asynchronous, pipelined parallelism • Broadcast reusable events and complex events

Page 7: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

HOW? EVENT PROCESSING NETWORK

Event Processing Networks (EPNs) implement EDA !

• Represented as a directed acyclic graph of • Event producers • Event processing agents (EPAs)

• enrich events • transform events into complex events • detect patterns

• Event consumers • Event channels

Page 8: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

HOW? EVENT PROCESSING NETWORK

Sketchy is an EPN that implements EDA !

• Prevents text and social graph spam at SoundCloud • Open-source • Modular

• written as a flexible library, adaptable • many common components available out of the box

• Battle tested • ingests many sensitive event types at SoundCloud

Page 9: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

HOW? EVENT PROCESSING NETWORK

Event producers introduce events into a network !• Represented as a directed graph of

• Event producers • Event channels • Event processing agents (EPAs)

• enrich events • transform events into complex events • detect patterns

• Event consumers

Producer

Event Channel A

Event Channel B

Event Channel C

Page 10: Data Days 2014 - Matt Weiden

Producer or EPA 2

Consumer or EPA 4

Matt Weiden / SoundCloud Ltd.

HOW? EVENT PROCESSING NETWORK

Event channels route events through the network !• Represented as a directed graph of

• Event producers • Event channels • Event processing agents (EPAs)

• enrich events • transform events into complex events • detect patterns

• Event consumersProducer or EPA 1

Event Channel

Consumer or EPA 3

Page 11: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

HOW? EVENT PROCESSING NETWORK

Event processing agents contain business logic !• Represented as a directed graph of

• Event producers • Event channels • Event processing agents (EPAs)

• enrich events • transform events into complex events • detect patterns

• Event consumers

Event Processing

Agent

Event Channel A

Event Channel B

Event Channel A

Event Channel B

DB 1 cache

Page 12: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

HOW? EVENT PROCESSING NETWORK

Event consumers act on processing in the network !• Represented as a directed graph of

• Event producers • Event channels • Event processing agents (EPAs)

• enrich events • transform events into complex events • detect patterns

• Event consumers

Consumer

Event Channel A

Event Channel B

Event Channel C

Page 13: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

HOW? DO EPNs ACHIEVE EDA’s GOALS?

• Asynchronous, pipelined parallelism !

!!!!!!The node to node flow allows parallelism asynchronous computation.

Producer or EPA 1 Event Channel Consumer

or EPA 3

Page 14: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

HOW? DO EPNs ACHIEVE EDA’s GOALS?

• Asynchronous, pipelined parallelism !!!!!!!

!!Source: http://www2.engr.arizona.edu/~ece462/Lec03-pipe/

Page 15: Data Days 2014 - Matt Weiden

• Build ‘Complex Events’ by putting events into the context in which they occur !

!!!!!!!!!Possible by aggregating and/or summarizing with data from external sources.

Matt Weiden / SoundCloud Ltd.

HOW? DO EPNs ACHIEVE EDA’s GOALS?

Event Processing

Agent

DB 1

Abstract example of a complex event being created.

EVENT5

EVENT5

E1 E2 E3 E4

+context

cache

Page 16: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

HOW? DO EPNs ACHIEVE EDA’s GOALS?

• Build ‘Complex Events’ by putting events into the context in which they occur !

!!!!!!!!! In Sketchy the bulk agent stores a text fingerprint context in memcached.

M1 M2 M3 M4

fingerprints

MSG 4 bulkStatisticsAgent bulkDetectorAgent

Bulk! Complex Event

M1 M2 M3 M4

Stores Fingerprint Finds similar fingerprints (Jacquard distance)

memcached

Page 17: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

HOW? DO EPNs ACHIEVE EDA’s GOALS?

• Broadcast events and complex events wherever their reuse is possible !

!!!!!!!!

A common use case in a Sketchy network.

Producer or EPA 2

Producer or EPA 1

Event Channel Consumer or EPA 4

Consumer or EPA 3

The event channel can send messages in this fashion.

messageCreateIngester

junkStatisticsAgent junkDetectorAgent

signalEmitterAgent

rateLimiterAgent

Page 18: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

SKETCHY@SOUNDCLOUD

messageCreateIngester

junkStatisticsAgent junkDetectorAgent

signalEmitterAgent

rateLimiterAgent

Page 19: Data Days 2014 - Matt Weiden

Storm is a framework for building EPNs at scale

Matt Weiden / SoundCloud Ltd.

MOVE SKETCHY’S LOGIC TO TWITTER’S STORM?

STORM Sketchy’s Network Components

Language Scala Scala

Parallelism Multiple workers on Multiple hosts

Multiple workers on Single host

Deployment ‘Nimbus’ & Zookeeper Bazooka

Messaging Guarantees atLeastOnce, atMostOnce

Not yet

Hadoop Integration Yes No

Page 20: Data Days 2014 - Matt Weiden

Matt Weiden / SoundCloud Ltd.

LEARN MORE

• Event Processing Networks • Sharon and Etzion, “Event Processing Network, A

Conceptual Model,” VLDB, 2007 • Sketchy

• https://github.com/soundcloud/sketchy-core • Storm

• Toshniwal et al., “Storm@Twitter,” SIGMOD, 2014 • https://storm.incubator.apache.org

Page 21: Data Days 2014 - Matt Weiden

THANK YOU. QUESTIONS? !

Matt Weiden / SoundCloud Ltd. @mweiden, [email protected]