Download - Real Time Data Streaming using Kafka & Storm
DATA
LivePerson Case Study: Real Time Data Streaming
March 20th 2014Ran Silberman
About me● Technical Leader of Data Platform in LivePerson
● Bird watcher and amateur bird photographer
Pharaoh Eagle-Owl / Bubo ascalaphus This is what the people from previous slide were looking at…
Amir Silberman
Agenda● Why we chose Kafka + Storm
● How implementation was done
● Measures of success
● Two examples of use
● Tips from our experience
Data in LivePersonVisitor in Site
Chat Window
Agent console
LivePerson SaaS Server
LoginMonitor
Rules,Intelligence,Decision
Chat
Chat
Invite
DATA
DATA DATA
BIGDATA
Legacy Data flow in LivePerson
BI DWH (Oracle)
RealTime servers
ETLSessionize
Modeling
Schema View
Real-Time data
Historical data
Why Kafka + Storm?● Need to scale out and plan for future scale
○ Limit for scale should not be technology
○ Let the limit be cost of (commodity) hardware
● What Data platforms can be implemented quickly?
○ Open source - fast evolving and community
○ Micro-services - do only what you ought to do!
● Are there risks in this choice?
○ Yes! technology is not mature enough
○ But, there is no other mature technology that can
address our needs!
Long-eared Owl / Asio otusAmir Silberman
Legacy Data flow in LivePerson
BI DWH (Oracle)
RealTime servers
Customers
ETLSessionize
Modeling
Schema View
1st phase - move to Hadoop
ETLSessionize
Modeling
Schema View
RealTime servers
BI DWH (Vertica)HDFS
Hadoop
MR Job transfers data to BI DWH
Customers
2. move to Kafka
6
RealTime servers
HDFSBI DWH (Vertica)
Hadoop
MR Job transfers data to BI DWH
KafkaTopic-1
Customers
3. Integrate with new producers
6
RealTime servers
HDFSBI DWH (Vertica)
Hadoop
MR Job transfers data to BI DWH
KafkaTopic-1 Topic-2
New RealTime servers
Customers
4. Add Real-time BI
6
Customers
RealTime servers
HDFSBI DWH (Vertica)
Hadoop
MR Job transfers data to BI DWH
KafkaTopic-1 Topic-2
New RealTime servers
Storm
Topology
Analytics DB
Architecture
Real-time servers
Kafka
Storm
Cassandra/ CouchBase
Real Time Processing
Flow rate into Kafka:33 MB/Sec
Flow rate from Kafka: 20 MB/Sec
Total daily data in Kafka:17 Billion events
Some Numbers: Cyber Monday 2013
Dashboards
4 topologies reading all events
Eurasian Wryneck / Jynx torquillaAmir Silberman
Two use cases 1. Visitor list
2. Agent State
1st Strom Use Case: “Visitors List”Use case:
● Show list of visitors in the “Agent Console”
● Collect data about visitor in real time
● Visitor stickiness in streaming process
Visitors List Topology
Selected Analytics DB - Couchbase
1st Strom Use Case: “Visitors List”
● Document Store - for complex documents
● Searchable - possible to search by different
attributes.
● High throughput - Read & Write
First Storm Topology – Visitor Feed
Storm Topology
Kafka Spout Analyze relevant events
Write event to Visitor document
emit emit
Kafka events stream
Add/ Update
Couchbase
“Visitor List” Topology: Analytics DB: Couchbase - Document store
Parse Avro into tuple
emit
Visitors List - Storm considerations● Complex calculations before sending to DB
○ Ignore delayed events
○ Reorder events before storing
● Document cached in memory
● Fields Grouping to bolt that writes to CouchBase
● High parallelism in bolt that writes to CouchBase
Visitors List Topology
European Roller / Coracias garrulusAmir Silberman
2nd Storm Use Case: “Agent State”Use case:
● Show Agent activity on “Agent Console”
● Count Agent statistics
● Display graphs
Agent Status Topology
Selected Analytics DB - Cassandra
2nd Storm Use Case: “Agent State”
● Wide Column Store DB
● Highly Available w/o Single point of failure
● High throughput
● Optimized for counters
First Storm Topology – Visitor Feed
Storm Topology
Kafka Spout Analyze relevant events
Send events
emit emit
Kafka events stream
Add
“Agent Status” Topology: Analytics DB: Cassandra - Document store
Parse Avro into tuple
emit
Data visualization using Highcharts
Agent Status - Storm considerations● Counters stored by topology
● Calculations done after reading from DB
● Delayed events should not be ignored
● Order of events does not matter
● Using Highcharts for data visualization
Spur-winged Lapwing / Vanellus spinosusAmir Silberman
3rd Storm Use Case: Data AuditingUse case:
● Needs to be able to tell whether events arrived
○ Where there any missing events?
○ Where there any duplicated events?
○ How long did it take for events to arrive?
● Data not important - only count of events
3rd Storm Use Case: Data AuditingRealtime server
Kafka Topics
Auditing Topic
Storm Sync topology
Audit-loader topology
MySql
Hadoop
HDFS
audit job
kafka1
3
4
2
Auditor
First Storm Topology – Visitor Feed
Storm Topology
Kafka Spout Analyze relevant events
Send events
emit emit
Kafka events stream
Add
“Sync Audit” Topology: Sync messages between two topics
Parse Avro into tuple
emit
Kafka Audit topic
First Storm Topology – Visitor Feed
Storm Topology
Kafka Spout Analyze relevant events
Send events
emit emit
Kafka Audit topic
Add
“Load Audit” Topology: Analytics DB: MySql - RDBMS
Parse Avro into tuple
emit
Auditing Report
“Load Audit” Topology:● Stores statistics of events count
● SQL type DB
● Used for Auditing and other statistics
● Requires metadata in events header
Challenges:
● High network traffic
● Writing to Kafka is faster than reading
● All topologies read all events
● How to avoid resource starvation in Storm
Subalpine Warbler / Sylvia cantillansAmir Silberman
Optimizations of Kafka● Increase Kafka consuming rate by adding partitions
● Run on physical machines with RAID
● Set retention to the proper need
● Monitor data flow!
Optimizations of Storm● #of Kafka-Spouts = number of total partitions
● Set “Isolation mode” for important topologies
● Validate Network cards can carry network traffic
● Set Storm cluster on high CPU machines
● Monitor servers CPU & Memory (Graphite)
● Assess min. #Cores that topology needs
○ Use “top” -> “load” to find server load
Demo● Agent Console - https://z1.le.liveperson.net/
71394613 / [email protected]
● My Site - http://birds-of-israel.weebly.com/
Questions?
Little Owl / Athene noctuaAmir Silberman
Thank you!
Ruff / Philomachus pugnaxAmir Silberman