josa techtalks - real-time and big data

30
REAL-TIME AND BIG DATA Mahmoud M. Jalajel

Upload: jordan-open-source-association

Post on 17-Jul-2015

282 views

Category:

Software


0 download

TRANSCRIPT

Page 1: JOSA TechTalks - Real-Time and Big Data

REAL-TIME AND BIG DATA

Mahmoud M. Jalajel

Page 2: JOSA TechTalks - Real-Time and Big Data

OUTLINE

• Intro: Real-time with Big Data

• The Lambda Architecture

• The Relay Model

Page 3: JOSA TechTalks - Real-Time and Big Data

WHY SOLVE FOR REAL-

TIME• Real-time offers more business value

• Live Web Analytics

• Recommendations

• Real-time = (semi-) realtime

• Event to index ~ single digit minutes

• Query duration ~ single digit seconds

Page 4: JOSA TechTalks - Real-Time and Big Data

REAL-TIME

IMPLEMENTATION• Incremental Implementation

• Stream processing / No full data context

• A real-time implementation is:

• Far more useful

• Faster

• Easily adaptable to batch mode

Page 5: JOSA TechTalks - Real-Time and Big Data

REAL-TIME IN HADOOP

MongoDb Query Time

(optimized, single-node)

Hive Query Time

(5 nodes)

Hangs, crashes, starts

begging for mercy then

commits suicide and

weepingly dies

A few hours

2 Seconds 15 Minutes

Page 6: JOSA TechTalks - Real-Time and Big Data

LAMBDA ARCHITECTURECreated by: Nathan Marz

lambda-architecture.net

Page 7: JOSA TechTalks - Real-Time and Big Data

LAMBDA

ARCHITECTURE

Page 8: JOSA TechTalks - Real-Time and Big Data

BASIC ASSUMPTIONS

1. Query = Function(All Data)

2. Data are immutable timely facts

3. Append-Only (CRUD becomes CR)

4. Human Fault-Tolerance

Page 9: JOSA TechTalks - Real-Time and Big Data

THE BATCH LAYER

• Accepts stream of data

• Appends to master

dataset

• Uses: HDFS

Page 10: JOSA TechTalks - Real-Time and Big Data

THE SERVING LAYER

• Precomputes different

views

• Works on full dataset

• Refreshes regularly offline

• Batch views are usually

stored in a key-value store

Page 11: JOSA TechTalks - Real-Time and Big Data
Page 12: JOSA TechTalks - Real-Time and Big Data
Page 13: JOSA TechTalks - Real-Time and Big Data

CHECKPOINT

• Typical Hadoop Setup

• Slow, inefficient

• Outdated. usually lagging by hours or days

• Although accurate for surveyed data

• Costly to re-run. Real-time is not an option

Page 14: JOSA TechTalks - Real-Time and Big Data

THE SPEED LAYER

• Works with recent data

• Complements results

• Incremental implementation

Page 15: JOSA TechTalks - Real-Time and Big Data
Page 16: JOSA TechTalks - Real-Time and Big Data

THE FULL PICTUREQuery Merging

Page 17: JOSA TechTalks - Real-Time and Big Data

EXAMPLE

TECHNOLOGIES

Page 18: JOSA TechTalks - Real-Time and Big Data

DRUID EXAMPLE

Page 19: JOSA TechTalks - Real-Time and Big Data

REVIEWING LA

PROs

• Modular

• Flexible

• Self-Auditing

• Proven components

CONs

• Complex

• Maintainability

• Query Merging

Page 20: JOSA TechTalks - Real-Time and Big Data

THE RELAY MODEL

Page 21: JOSA TechTalks - Real-Time and Big Data

RELAY MODELQuery Merging

Page 22: JOSA TechTalks - Real-Time and Big Data

THE WORKFLOW

Page 23: JOSA TechTalks - Real-Time and Big Data

REVIEWING RM

PROs

• Coherent, Simpler

than LA

• Extensible to full

LA

• Cheaper

CONs

• Master Data

Storage

• Query flexibility

Page 24: JOSA TechTalks - Real-Time and Big Data

WHY NOT HADOOP NOW?

• Too much time, no capacity

• Too soon or too late

• Too expensive

• Hammer/nail problem

Page 25: JOSA TechTalks - Real-Time and Big Data

CONCLUSIONS

• Think big data, now!

• No need to invest years of development to

perfect a big data system.

• Start now! gradually grow system requirements

and engineering skill-set

• Select scalable components

Page 26: JOSA TechTalks - Real-Time and Big Data

Mahmoud Jalajel – @mjalajel

Questions ?

Page 27: JOSA TechTalks - Real-Time and Big Data

APPENDIX

Page 28: JOSA TechTalks - Real-Time and Big Data

Apache Kafka

Page 29: JOSA TechTalks - Real-Time and Big Data

Apache Storm

Page 30: JOSA TechTalks - Real-Time and Big Data

Apache Storm with external systems