#twitterrealtime - real time processing @twitter
TRANSCRIPT
TWITTER IS REAL TIME
WHAT IS REAL TIME?
REAL TIME PIPELINE
REAL TIME COMPONENTS
REAL TIME USE CASES
ETL BI PRODUCT SAFETY TRENDS
ML MEDIA OPS ADS
20 PB
2 Trillion Events/Day
100 mse2e
latency
400 Real Time Jobs
DLOG & HERON are
Open Sourced
WE ARE HIRING!
Messaging
Data Infrastructure
Core Services
Search Infrastructure
Traffic
Real Time Compute
Compute PlatformPlatform Engineering
Kernel
#LoveWhereYouWorkLearn more at careers.twitter.com
Hadoop
Core Data Libraries
Data Applications
Core Metrics
- Easy operations
- Small technology portfolio
- Quick development Iteration
- Diverse use cases
Bookkeeper
WriteProxy
ReadProxy clie
nt
client
Bookkeeper
WriteProxy
ReadProxy
PublisherSubscriber
Read Write
DistributedLog
Metadata
Self Serve
20 PB
2 Trillion Events
100 mse2e
latency
- Event
A discrete, self-contained, piece of data
- Stream
A persistent, unordered collection of events with a time
- Partition
A portion of a stream with a proportional amount of that the overall capacity
- Subscriber
A collection of processes collectively consuming a copy of the stream
Bookkeeper
WriteProxy
ReadProxy
PublisherSubscriber
Read Write
DistributedLog
Metadata
Self Serve
Flow Control
Stream Configuration
Partition Ownership
DistributedLog
(E => Future[Unit])
Offset Tracking
Offset Store
Metadata
DL Read Proxy
@DistributedLoghttp://distributedlog.io
Leigh Stewart <@l4stewar>, Sijie Guo <@sijieg>, Franck Cuny <@franckcuny>, Jordan Bull <@jordangbull>, Mahak Patidar <@mahakp>, Philip Su <@philipsu522>, Yiming Zang <@zang_yiming>
Messaging Alumni: David Helder, Aniruddha Laud, Robin Dhamankar
STORM/HERON TERMINOLOGY- TOPOLOGY
Directed acyclic graph
Vertices=computation, and edges=streams of data tuples
- SPOUTS
Sources of data tuples for the topology
Examples - Kafka/Distributed Log/MySQL/Postgres
- BOLTS
Process incoming tuples and emit outgoing tuples
Examples - filtering/aggregation/join/arbitrary function
STORM/HERON TOPOLOGY
BOLT 1
BOLT 2
BOLT 3
BOLT 4
BOLT 5
SPOUT 1
SPOUT 2
WHY HERON?
● SCALABILITY and PERFORMANCE PREDICTABILITY
● IMPROVE DEVELOPER PRODUCTIVITY
● EASE OF MANAGEABILITY
TOPOLOGY ARCHITECTURE
TopologyMaster
ZKCLUSTER
Stream Manager
I1 I2 I3 I4
Stream Manager
I1 I2 I3 I4
Logical Plan, Physical Plan and Execution State
Sync Physical Plan
CONTAINER CONTAINER
Metrics Manager
Metrics Manager
HERON ARCHITECTURE
Topology 1
TOPOLOGYSUBMISSION
Scheduler
Topology 2
Topology 3
Topology N
HERON SAMPLE TOPOLOGIES
Large amount of data produced every day
Large cluster Several hundred topologies deployed
Several million messages every second
HERON @TWITTER
1 stage 10 stages
3x reduction in cores and memory
Heron has been in production for 2 years
STRAGGLERS
Stragglers are the norm in a multi-tenant distributed systems
● BAD/SLOW HOST
● EXECUTION SKEW
● INADEQUATE PROVISIONING
APPROACHES TO HANDLE STRAGGLERS
d
● SENDERS TO STRAGGLER DROP DATA
● SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER
● DETECT STRAGGLERS AND RESCHEDULE THEM
S1 B2
B3
SLOW DOWN SENDERS STRATEGY
Stream Manager
Stream Manager
Stream Manager
Stream Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4
S1 S1
S1S1
BACK PRESSURE IN PRACTICE
● IN MOST SCENARIOS BACK PRESSURE RECOVERS
Without any manual intervention
● SOMETIMES USER PREFER DROPPING OF DATA
Care about only latest data
● SUSTAINED BACK PRESSURE
Irrecoverable GC cycles
Bad or faulty host
ENVIRONMENT'S SUPPORTED
STORM API
PRE- 1.0.0
POST 1.0.0
SUMMINGBIRD FOR HERON
CURIOUS TO LEARN MORE…
INTERESTED IN HERON?
CONTRIBUTIONS ARE WELCOME!https://github.com/twitter/heron
http://heronstreaming.io
HERON IS OPEN SOURCED
FOLLOW US @HERONSTREAMING
● 100K+ Advertisers, $2B+ revenue/year
● 300M+ Users
● Impressions/Engagements
○ Tens of billions of events daily
Use Heron & EventBus:
● Prediction
● Serving
● Analytics
● Online learning: models require real-time data○ On-going training for existing ads
■ CTR, conversions, RTs, Likes○ On-going training for user data
■ Interests change, targeting must stay relevant○ New ads arrive constantly
● Consumes 150 GB/second from EventBus streams
Ad Server● Reads Prediction models● Finalizes Ad selection● Writes 56GB/second to EventBus
○ Served impressions○ Spend events
Callback Service● Receives engagements from clients● Writes engagements to EventBus
○ Consumed by Prediction and Analytics
Advertiser Dashboard keeps advertisers informed in real-time
For Ads:● Impressions● Engagements● Spend rate● Uniques
For Users:● Geolocation● Gender● Age● Followers● Keywords● Interests
Offline layer (hours)● Engagement log● Billing pipeline● 14TB/hour
Online layer (seconds)● Heron topologies read 1M events/sec
From EventBus, provide real-time analytics
Advertiser Dashboard● Ad-hoc queries for desired time range● View performance of ads in real-time
http://tech.lalitbhatt.net/2015/03/big-and-fast-data-lambda-architecture.html
(~6 hrs)
#RealTime processing helps us scale our Ads business:
● Prediction - Online learning○ Ads○ Users
● Analytics - Advertisers get real-time visibility into ad performance
This enables us to provide high ROI for Advertisers.
Image Credits:http://images.clipartpanda.com/cycle-clipart-bike_red.pnghttp://sweetclipart.com/multisite/sweetclipart/files/motor_scooter_blue.pnghttp://www.clipartkid.com/images/152/clipart-car-car-clip-art-mHtTUp-clipart.png
Observation
● Anti-Spam Team fights spammy content, engagements, behaviors in Twitter● Spam campaign comes in large batch ● Despite of randomized tweaks, enough similarity among spammy entities are preserved
Requirement
● Real-time : a competition game with spammers i.e. “detect” vs “mutate”● Generic : need to support all common feature representations
Crest is a generic online similarity clustering system
● Inputs are a stream of entities
● Similar Clustering system groups similar entities together ( according to predefined
similarity metric)
● outputs are the clusters and the cluster entity members.
“Built on top of Heron“ https://github.com/twitter/heron
● Locality sensitive hashing
probabilistic similarity-preserving random projection method
Entity1 => hashValue1 (010010001110010100101001000011)
Entity2 => hashValue2 (000111001110010101100110100100)
Sim(Entity1, Entity2) ~ Sim(hash1, hash2)
● No “Pair-wise” similarity calculation
Similarity match based on “signature band”
Similarity match based on “signature band” collision
Cut signatures into bands:
01001 00011 10010 10010 10010 00011 ( 30 sigs = 6 bands * 5sigs/band)
Two entities become similarly candidates, if they collide on at least one band.
(i.e. need to match all signatures within some band)
1. Given entity features, calculate signatures, and cut into bands2. Match with all existing clusters in cluster store, which collide with at least one band3. Find the closest cluster
Incoming Entity: 01001 00011 10010 10010 10010 00011
Known Cluster1: 01011 00011 01010 10111 11110 10011
Known Cluster2: 01101 01011 01000 10010 10010 01111
1. Count for each band signatures
2. Use Count-Min Sketch to find the hot signatures
3. Send entities with hot signatures for clustering
1. Group entities by band signatures2. Run in-memory clustering algorithm when the group is big enough3. Save the cluster in cluster key-value store
1. Real-time : streamline data processing flow 2. Scalability : flexible grouping and shuffling ( Application / Signature ) 3. Maintenance : separated bolts for system optimizations ( Memory, GC, CPU, etc )
● Crest : similarity clustering system , based on locality-sensitive hashing
● Detect spam in real-time , built on top of heron topology● Generic interface, clustering “everything” happening in Twitter