ashish thusoo evolution of big data architectures

43
Evolution of Big Data Architectures Architecture Summit, Aug 2012 Ashish Thusoo

Upload: drewz-lin

Post on 15-Jul-2015

443 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Ashish thusoo   evolution of big data architectures

Evolution of Big Data ArchitecturesArchitecture Summit, Aug 2012

Ashish Thusoo

Page 2: Ashish thusoo   evolution of big data architectures

Outline

Demand for Big Data

Architectural Trade Offs and Evolution

Where next?

Page 3: Ashish thusoo   evolution of big data architectures

The Changing Planet

3 Technology Drivers

Devices

Infrastructure

Applications

Page 4: Ashish thusoo   evolution of big data architectures

Evolution: Devices

Page 5: Ashish thusoo   evolution of big data architectures

Evolution: Devices

Key Capabilities

Connected

Location Aware

Sensory & Powerful

Page 6: Ashish thusoo   evolution of big data architectures

Evolution: Devices

Page 7: Ashish thusoo   evolution of big data architectures

Evolution: Connectivity

Mobile Subscription Density 2004

Page 8: Ashish thusoo   evolution of big data architectures

Evolution: Connectivity

Mobile Subscription Density 2010

Page 9: Ashish thusoo   evolution of big data architectures

Evolution: Bandwidth

Page 10: Ashish thusoo   evolution of big data architectures

Evolution: Applications

Salient Traits

Cloud based

Web scale

Page 11: Ashish thusoo   evolution of big data architectures

Explosion in Data

Big Data

Volume

Velocity

Variety

Page 12: Ashish thusoo   evolution of big data architectures

Big Data: Volume

Volume:

2011: 1.8 zettabytes of digital universe

2009 - 2020: 35 zettabytes

Page 13: Ashish thusoo   evolution of big data architectures

Big Data: Velocity

Velocity

340 million tweets per day

72 hours of video uploaded every minute on YouTube

2.9 million emails a second

Page 14: Ashish thusoo   evolution of big data architectures

Big Data: Variety

Variety

Video

Pictures

Applications Logs

etc. etc...

Page 15: Ashish thusoo   evolution of big data architectures

Disruptive Architectures

Page 16: Ashish thusoo   evolution of big data architectures

Disruptions in Data Arch

Change in Focus (1990s -> 2000s)

Performance -> Scalability & Availability

Rigid/Structured -> Flexible/Semistructured

Page 17: Ashish thusoo   evolution of big data architectures

Scalability & Availability

Page 18: Ashish thusoo   evolution of big data architectures

Towards Scalability

Problem

10K ops/sec -> 1M ops/sec

TB of data -> PB of data

Page 19: Ashish thusoo   evolution of big data architectures

Towards Scalability

Solution: SHARDING (Divide and Conquer)

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

Page 20: Ashish thusoo   evolution of big data architectures

Towards Scalability

How do we quickly route a record to a shard?

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

fn( )- Consistent Hashing- Mapping Table

Page 21: Ashish thusoo   evolution of big data architectures

Towards Scalability

What happens is part of the record is in one shard and part in another?

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

Page 22: Ashish thusoo   evolution of big data architectures

Towards Scalability

Keep it Simple: Application deals with atomicity & consistency semantics

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

Page 23: Ashish thusoo   evolution of big data architectures

Towards AvailabilityWhat if my shard is down? Where do I put my record?

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

X?

Page 24: Ashish thusoo   evolution of big data architectures

Towards AvailabilityLets just replicate the shards and pray that one is available :)

1101100011000001100100101111101011011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

X11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

Page 25: Ashish thusoo   evolution of big data architectures

Towards Availability

Replication strategies

What should be the number of replicas?

How to rebuild a replica?

How to propogate a record to a replica?

Page 26: Ashish thusoo   evolution of big data architectures

1990s vs 2000sDifferent Focus: 1990s (Raw Performance)

Optimal I/O structures

Cache Sensitive Algorithms

2000s (Scalability, Availability)

Sharding

Replication

Page 27: Ashish thusoo   evolution of big data architectures

Flexibility/Semi-structure

Page 28: Ashish thusoo   evolution of big data architectures

Towards Flexibility

Problem

Does structure in a database make it slower to write applications (sprint vs waterfall model)?

My data is not records and tables?

Page 29: Ashish thusoo   evolution of big data architectures

Towards Flexibility

How knowing my record structure help by data system?

Helps to optimize execution plans

Helps to optimize my storage layouts

Trade off?

Application change means database schema change, rebuilding indexes etc. etc.

Page 30: Ashish thusoo   evolution of big data architectures

Towards Flexibility

Most of my operations are simple lookups, range lookups and updates

Since the execution is simple we don’t need all the structure

Keep enough structure to support fast gets and puts

Page 31: Ashish thusoo   evolution of big data architectures

Towards Flexibility

Solution: Key-Value Stores (NoSQL)

1101100011

1101100011

1101100011

1101100011

1101100011

1101100011

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

KEY VALUE

1101100011 11011000110000011001001011111010

1101100011

1101100011

1101100011 11011000110000011001001011111010

- Sorted HashMaps

- Sorted Files

Page 32: Ashish thusoo   evolution of big data architectures

Towards Flexibility

Need to update related “values” of a key (Some Atomicity)

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110

11011000110

11011000110

11011000110

11011000110

11011000110

KEY VALUE

Page 33: Ashish thusoo   evolution of big data architectures

Towards Flexibility

Need update related “values” of a key (Some Atomicity)

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110

11011000110

11011000110

11011000110

11011000110

11011000110

KEY VALUE11011000110

11011000110

11011000110

11011000110

11011000110

11011000110

TAG

TAG = COLUMN FAMILY

Page 34: Ashish thusoo   evolution of big data architectures

Towards Flexibility

gets and puts are fine for online applications BUT..

What about Analytics?

Transformations can be really complicated...

Page 35: Ashish thusoo   evolution of big data architectures

Towards Flexibility

Is there a simple construct that can solve a number of analytics queries

of course: SORT

And it can be parallelized too

Page 36: Ashish thusoo   evolution of big data architectures

Towards Flexibility

MAP/REDUCE (Scalable Parallel Pluggable SORT)

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

Mappers11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

Reducers

m{ } r{ }m: user defined map functionr: user defined reduce function

Page 37: Ashish thusoo   evolution of big data architectures

Towards Flexibility

MAP/REDUCE and Failures

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

Mappers

X11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

Reducers

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

11011000110000011001001011111010

Page 38: Ashish thusoo   evolution of big data architectures

1990s vs 2000sDifferent Focus: 1990s (Raw Performance)

Structure important for speed optimizations

Stream everything through Query plan

2000s (Sprint mode of application development)

Support dev efficiency and data variety

Checkpointing for restartability

Page 39: Ashish thusoo   evolution of big data architectures

Where now?

Page 40: Ashish thusoo   evolution of big data architectures

The New Meets The Old

Disruption?

Well we still need SQL

We still need to make these work with other components

Guess what? Efficiency is also important at scale

Page 41: Ashish thusoo   evolution of big data architectures

Where Does New Fail?

Transactions?

Moving money from one account to another

Graphs?

Networks everywhere

How to do second order analysis on graphs

Page 42: Ashish thusoo   evolution of big data architectures

Thank You!

Page 43: Ashish thusoo   evolution of big data architectures