#twitterrealtime - real time processing @twitter

Post on 09-Jan-2017

429 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

TWITTER IS REAL TIME

WHAT IS REAL TIME?

REAL TIME PIPELINE

REAL TIME COMPONENTS

REAL TIME USE CASES

ETL BI PRODUCT SAFETY TRENDS

ML MEDIA OPS ADS

20 PB

2 Trillion Events/Day

100 mse2e

latency

400 Real Time Jobs

DLOG & HERON are

Open Sourced

WE ARE HIRING!

Messaging

Data Infrastructure

Core Services

Search Infrastructure

Traffic

Real Time Compute

Compute PlatformPlatform Engineering

Kernel

#LoveWhereYouWorkLearn more at careers.twitter.com

Hadoop

Core Data Libraries

Data Applications

Core Metrics

- Easy operations

- Small technology portfolio

- Quick development Iteration

- Diverse use cases

Bookkeeper

WriteProxy

ReadProxy clie

nt

client

Bookkeeper

WriteProxy

ReadProxy

PublisherSubscriber

Read Write

DistributedLog

Metadata

Self Serve

20 PB

2 Trillion Events

100 mse2e

latency

- Event

A discrete, self-contained, piece of data

- Stream

A persistent, unordered collection of events with a time

- Partition

A portion of a stream with a proportional amount of that the overall capacity

- Subscriber

A collection of processes collectively consuming a copy of the stream

Bookkeeper

WriteProxy

ReadProxy

PublisherSubscriber

Read Write

DistributedLog

Metadata

Self Serve

Flow Control

Stream Configuration

Partition Ownership

DistributedLog

(E => Future[Unit])

Offset Tracking

Offset Store

Metadata

DL Read Proxy

@DistributedLoghttp://distributedlog.io

Leigh Stewart <@l4stewar>, Sijie Guo <@sijieg>, Franck Cuny <@franckcuny>, Jordan Bull <@jordangbull>, Mahak Patidar <@mahakp>, Philip Su <@philipsu522>, Yiming Zang <@zang_yiming>

Messaging Alumni: David Helder, Aniruddha Laud, Robin Dhamankar

STORM/HERON TERMINOLOGY- TOPOLOGY

Directed acyclic graph

Vertices=computation, and edges=streams of data tuples

- SPOUTS

Sources of data tuples for the topology

Examples - Kafka/Distributed Log/MySQL/Postgres

- BOLTS

Process incoming tuples and emit outgoing tuples

Examples - filtering/aggregation/join/arbitrary function

STORM/HERON TOPOLOGY

BOLT 1

BOLT 2

BOLT 3

BOLT 4

BOLT 5

SPOUT 1

SPOUT 2

WHY HERON?

● SCALABILITY and PERFORMANCE PREDICTABILITY

● IMPROVE DEVELOPER PRODUCTIVITY

● EASE OF MANAGEABILITY

TOPOLOGY ARCHITECTURE

TopologyMaster

ZKCLUSTER

Stream Manager

I1 I2 I3 I4

Stream Manager

I1 I2 I3 I4

Logical Plan, Physical Plan and Execution State

Sync Physical Plan

CONTAINER CONTAINER

Metrics Manager

Metrics Manager

HERON ARCHITECTURE

Topology 1

TOPOLOGYSUBMISSION

Scheduler

Topology 2

Topology 3

Topology N

HERON SAMPLE TOPOLOGIES

Large amount of data produced every day

Large cluster Several hundred topologies deployed

Several million messages every second

HERON @TWITTER

1 stage 10 stages

3x reduction in cores and memory

Heron has been in production for 2 years

STRAGGLERS

Stragglers are the norm in a multi-tenant distributed systems

● BAD/SLOW HOST

● EXECUTION SKEW

● INADEQUATE PROVISIONING

APPROACHES TO HANDLE STRAGGLERS

d

● SENDERS TO STRAGGLER DROP DATA

● SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER

● DETECT STRAGGLERS AND RESCHEDULE THEM

S1 B2

B3

SLOW DOWN SENDERS STRATEGY

Stream Manager

Stream Manager

Stream Manager

Stream Manager

S1 B2

B3 B4

S1 B2

B3

S1 B2

B3 B4

B4

S1 S1

S1S1

BACK PRESSURE IN PRACTICE

● IN MOST SCENARIOS BACK PRESSURE RECOVERS

Without any manual intervention

● SOMETIMES USER PREFER DROPPING OF DATA

Care about only latest data

● SUSTAINED BACK PRESSURE

Irrecoverable GC cycles

Bad or faulty host

ENVIRONMENT'S SUPPORTED

STORM API

PRE- 1.0.0

POST 1.0.0

SUMMINGBIRD FOR HERON

CURIOUS TO LEARN MORE…

INTERESTED IN HERON?

CONTRIBUTIONS ARE WELCOME!https://github.com/twitter/heron

http://heronstreaming.io

HERON IS OPEN SOURCED

FOLLOW US @HERONSTREAMING

● 100K+ Advertisers, $2B+ revenue/year

● 300M+ Users

● Impressions/Engagements

○ Tens of billions of events daily

Use Heron & EventBus:

● Prediction

● Serving

● Analytics

● Online learning: models require real-time data○ On-going training for existing ads

■ CTR, conversions, RTs, Likes○ On-going training for user data

■ Interests change, targeting must stay relevant○ New ads arrive constantly

● Consumes 150 GB/second from EventBus streams

Ad Server● Reads Prediction models● Finalizes Ad selection● Writes 56GB/second to EventBus

○ Served impressions○ Spend events

Callback Service● Receives engagements from clients● Writes engagements to EventBus

○ Consumed by Prediction and Analytics

Advertiser Dashboard keeps advertisers informed in real-time

For Ads:● Impressions● Engagements● Spend rate● Uniques

For Users:● Geolocation● Gender● Age● Followers● Keywords● Interests

Offline layer (hours)● Engagement log● Billing pipeline● 14TB/hour

Online layer (seconds)● Heron topologies read 1M events/sec

From EventBus, provide real-time analytics

Advertiser Dashboard● Ad-hoc queries for desired time range● View performance of ads in real-time

http://tech.lalitbhatt.net/2015/03/big-and-fast-data-lambda-architecture.html

(~6 hrs)

#RealTime processing helps us scale our Ads business:

● Prediction - Online learning○ Ads○ Users

● Analytics - Advertisers get real-time visibility into ad performance

This enables us to provide high ROI for Advertisers.

Image Credits:http://images.clipartpanda.com/cycle-clipart-bike_red.pnghttp://sweetclipart.com/multisite/sweetclipart/files/motor_scooter_blue.pnghttp://www.clipartkid.com/images/152/clipart-car-car-clip-art-mHtTUp-clipart.png

Observation

● Anti-Spam Team fights spammy content, engagements, behaviors in Twitter● Spam campaign comes in large batch ● Despite of randomized tweaks, enough similarity among spammy entities are preserved

Requirement

● Real-time : a competition game with spammers i.e. “detect” vs “mutate”● Generic : need to support all common feature representations

Crest is a generic online similarity clustering system

● Inputs are a stream of entities

● Similar Clustering system groups similar entities together ( according to predefined

similarity metric)

● outputs are the clusters and the cluster entity members.

“Built on top of Heron“ https://github.com/twitter/heron

● Locality sensitive hashing

probabilistic similarity-preserving random projection method

Entity1 => hashValue1 (010010001110010100101001000011)

Entity2 => hashValue2 (000111001110010101100110100100)

Sim(Entity1, Entity2) ~ Sim(hash1, hash2)

● No “Pair-wise” similarity calculation

Similarity match based on “signature band”

Similarity match based on “signature band” collision

Cut signatures into bands:

01001 00011 10010 10010 10010 00011 ( 30 sigs = 6 bands * 5sigs/band)

Two entities become similarly candidates, if they collide on at least one band.

(i.e. need to match all signatures within some band)

1. Given entity features, calculate signatures, and cut into bands2. Match with all existing clusters in cluster store, which collide with at least one band3. Find the closest cluster

Incoming Entity: 01001 00011 10010 10010 10010 00011

Known Cluster1: 01011 00011 01010 10111 11110 10011

Known Cluster2: 01101 01011 01000 10010 10010 01111

1. Count for each band signatures

2. Use Count-Min Sketch to find the hot signatures

3. Send entities with hot signatures for clustering

1. Group entities by band signatures2. Run in-memory clustering algorithm when the group is big enough3. Save the cluster in cluster key-value store

1. Real-time : streamline data processing flow 2. Scalability : flexible grouping and shuffling ( Application / Signature ) 3. Maintenance : separated bolts for system optimizations ( Memory, GC, CPU, etc )

● Crest : similarity clustering system , based on locality-sensitive hashing

● Detect spam in real-time , built on top of heron topology● Generic interface, clustering “everything” happening in Twitter

top related