big data and nosql in real time

40
Big Data and NoSQL in REAL TIME Facebook and Twitter Examples Ron Zavner

Upload: ron-zavner

Post on 18-Nov-2014

572 views

Category:

Technology


0 download

DESCRIPTION

Explain the challenge of having real time analytics in big data and nosql applications. Showing Facebook and Twitter examples.

TRANSCRIPT

Page 1: Big data and noSQL in real time

Big Data and NoSQL in REAL TIMEFacebook and Twitter Examples

Ron Zavner

Page 2: Big data and noSQL in real time

2

Agenda

Our real time world… Flavors of Big Data Facebook messaging and real time analytics system Twitter analytics system Winning architecture?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 3: Big data and noSQL in real time

What is Real Time?

3® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 4: Big data and noSQL in real time

We’re Living in a Real Time World…Homeland Security

Real Time Search

Social

eCommerce

User Tracking & Engagement

Financial Services

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4

Page 5: Big data and noSQL in real time

Big Data Predictions

“Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.”

Edd Dumbill, O’REILLY

5® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 6: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6

The Two Vs of Big Data

Velocity Volume

Page 7: Big data and noSQL in real time

The Flavors of Big Data Analytics

Counting Correlating Research

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7

Page 8: Big data and noSQL in real time

Analytics – Counting

How many signups, tweets, retweets for a topic?

What’s the average latency?

Demographics Countries and cities Gender Age groups Device types …

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8

Page 9: Big data and noSQL in real time

Analytics – Correlating

What devices fail at the same time?

What features get user hooked?

What places on the globe are “happening”?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9

Page 10: Big data and noSQL in real time

Analytics – Research

Sentiment analysis “Obama is popular”

Trends “People like to tweet

after watching American Idol”

Spam patterns How can you tell when

a user spams?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10

Page 11: Big data and noSQL in real time

It’s All about Timing

• Event driven / stream processing • High resolution – every tweet gets counted

• Ad-hoc querying • Medium resolution

• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11

This is what we’re here to discuss

Page 12: Big data and noSQL in real time

FACEBOOK REAL-TIMEANALYTICS SYSTEM

12

Page 13: Big data and noSQL in real time

13

Store 135+ Billion Messages A Month

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 14: Big data and noSQL in real time

14

The actual analytics.. Like button analytics

Comments box analytics

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 15: Big data and noSQL in real time

15

Goals

Show why plugins are valuable Make the data more actionable Make the data more timely Remove point of failures Handle massive load - 200K events per second

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 16: Big data and noSQL in real time

16

Technology Evaluation

MySQL DB Counters In-Memory Counters MapReduce Cassandra HBase

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 17: Big data and noSQL in real time

PTail

Scribe

Puma

HbaseFACEBOOK

Log

FACEBOOK

Log

FACEBOOK

Log

HDFS

Real Time Long Term

Batch1.5 Sec

The solution..10,000 write/sec per server

Page 18: Big data and noSQL in real time

Keep Things In Memory

Facebook keeps 80% of its data in Memory (Stanford research)

RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec

Page 19: Big data and noSQL in real time

TWITTER REAL-TIMEANALYTICS SYSTEM

19

Page 20: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved20

Twitter Reach – Here’s One Use Case

Page 21: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved21

Let’s start with some statistics ….

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 22: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22

It takes a week for users to

send 1 billion Tweets.

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 23: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23

On average,

140 million tweets get sent every day.

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 24: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24

The highest throughput to date is

6,939 tweets/sec.

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 25: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25

460,000 new accounts

are created daily.

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 26: Big data and noSQL in real time

26

5% of the users generate

75% of the content.

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Twitter in Numbers

Source: http://www.sysomos.com/insidetwitter/

Page 27: Big data and noSQL in real time

Challenge – Word Count

Word:Count

Tweets

Count?® Copyright 2011 Gigaspaces Ltd. All Rights Reserved27

• Hottest topics• URL mentions• etc.

Page 28: Big data and noSQL in real time

(Tens of) thousands of tweets per second to process Assumption: Need to process in near real time

Aggregate counters for each word A few 10s of thousands of words (or hundreds of

thousands if we include URLs) System needs to linearly scale System needs to be fault tolerant

Word Count - Analyze the Problem

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved28

Page 29: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved29

Use EDA (Event Driven Architecture)

TokenizerRaw FiltererTokenized CounterFiltered

Page 30: Big data and noSQL in real time

Sharding (Partitioning)

Tokenizer1 Filterer 1

Tokenizer2 Filterer 2

Tokenizer 3 Filterer 3

Tokenizer n Filterer n

Counter Updater 1

Counter Updater 2

Counter Updater 3

Counter Updater n

Page 31: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved31

Computing Reach with Event Streams

Page 32: Big data and noSQL in real time

Twitter Storm

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved32

Page 33: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved33

Twitter Storm

Page 34: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved34

Storm Overview

Page 35: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved35

Storm Cluster

Page 36: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved36

Streaming word count with Storm

Page 37: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved37

Storage Data Persistency Querying

Storm LimitationSpouts

Bolt

Topologies

Page 38: Big data and noSQL in real time

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved38

Event driven / flow Reliable Storage Data Persistency Querying

Winner is… storm & in memory data grids

Page 40: Big data and noSQL in real time

[email protected]

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved40

Q&A