how to survive the data deluge: petabyte scale cloud …dbms evolution • ‘60s codasyl • ‘70s...

39
How to survive the Data Deluge: Petabyte scale Cloud Computing Gianmarco De Francisci Morales IMT Institute for Advanced Studies Lucca CSE PhD XXIV Cycle 18 Jan 2010 1 lunedì 18 gennaio 2010

Upload: others

Post on 04-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

How to survive the Data Deluge: Petabyte scale

Cloud ComputingGianmarco De Francisci Morales

IMT Institute for Advanced Studies LuccaCSE PhD XXIV Cycle

18 Jan 2010

1lunedì 18 gennaio 2010

Page 2: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Outline

• Part 1: Introduction

• What, Why and History

• Part 2: Technology overview

• Current systems and comparison

• Part 3: Research directions

• Ideas for future improvements

2lunedì 18 gennaio 2010

Page 3: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Part 1Introduction

3lunedì 18 gennaio 2010

Page 4: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

How would you sort...

• ... 1GB of data?

• ... 100GB of data?

• ... 10TB of data?

• Scale matters!

• Because More Isn't Just More, More Is Different

4lunedì 18 gennaio 2010

Page 5: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

The Petabyte Age5lunedì 18 gennaio 2010

Page 6: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

What is scalability?

• The ability for a system to accept increased volume without impacting the profits

• Scale-free systems

• Scale-up vs Scale-out

• Types of parallel architectures:

• Shared memory, Shared disk, Shared nothing

6lunedì 18 gennaio 2010

Page 7: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

What if you need...

• ... to store and analyze 10TB of data per day?

• Parallel is a must, but not enough

• Usual approaches fail at this scale because of secondary effects

• Operational costs

• Faults

7lunedì 18 gennaio 2010

Page 8: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

What is fault tolerance?

• System operates properly in spite of the failure of some of its components

• High Availability

• Real world need

• Software has bugs

• Hardware fails

8lunedì 18 gennaio 2010

Page 9: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Why data?

• The world is drowning in data: Data Deluge

• Data sources:

• Web 2.0 (user generated content)

• Scientific experiments

• Physics (particle accelerators)Astronomy (satellite images)Biology (genomic maps)

• Can you think of others?

9lunedì 18 gennaio 2010

Page 10: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

“Data is not information, ⋅information is not knowledge, ⋅knowledge is not wisdom.” Clifford Stoll

10lunedì 18 gennaio 2010

Page 11: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

DBMS evolution

• ‘60s CODASYL

• ‘70s Relational DBMS

• ‘80s Object-Oriented DBMS (Back to navigation)

• ‘80s & ‘90s Parallel DBMS

• Not much has happened since the ‘70s

• The fundamental model and the code lines are still the same

11lunedì 18 gennaio 2010

Page 12: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

DBMS yesterday

• Business transaction processing (OLTP)

• Relational model

• SQL

12lunedì 18 gennaio 2010

Page 13: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

DBMS today

• Different markets (OLTP, OLAP, Stream, etc..)

• Stored Procedures & User Defined Functions

• Parallel DBMS (Teradata, Vertica, etc..)

• Not enough flexibility

• Limited fault-tolerance and scalability

13lunedì 18 gennaio 2010

Page 14: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Why cloud?

• Parallel computing is dead

• Amdahl’s law: SpUp(N) = 1 / ((1-Pa)+Pa/N)

• Long live parallel computing

• Gustafson’s law: SpUp(N) = PG*N + (1-PG)

• Physical limits

• Manycore

• Money

14lunedì 18 gennaio 2010

Page 15: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Parallel computing evolution

• Parallel (single)

• Cluster (intra-site)

• Grid (inter-site)

• Cloud (scale-free)

• What’s next?

15lunedì 18 gennaio 2010

Page 16: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Parallel computing yesterday

• CPU bound problems

• Tightly coupled

• Use of MPI or PVM

• Move data among computing nodes

• Use of NAS/SAN

• Expensive and does not scale (shared disk)

16lunedì 18 gennaio 2010

Page 17: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Parallel computing today

• I/O bound problems (often)

• Move computing near data

• Focus on scalability and fault tolerance

• Simple!

• Shared nothing architectureon commodity hardware

• Data streaming

17lunedì 18 gennaio 2010

Page 18: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Wrap-up

• Main motivations

• Scalability

• Money

• Focus on BIG data

• BIG = need to stop & think because of its size

• Common issues with PDBMS (load balancing, data skew)

18lunedì 18 gennaio 2010

Page 19: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Part 2Technology overview

19lunedì 18 gennaio 2010

Page 20: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

What is Cloud Computing?

• Did anyone notice I skipped the definition?

• Buzzword!

• IaaS (EC2, S3)

• PaaS (App Engine, Azure Services Platform)

• SaaS (Salesforce, OnLive, virtually any Web App)

• Scale free computing architecture

20lunedì 18 gennaio 2010

Page 21: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Who is involved?

21lunedì 18 gennaio 2010

Page 22: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Google Yahoo Microsoft Others

High Level Languages

Computation

Data Abstraction

Distributed Data

Coordination

Sawzall Pig/Latin DryadLINQ, Scope

Hive, Cascading

MapReduce Hadoop Dryad

BigTable HBase, PNUTS

Cassandra, Voldemort

GFS HDFS Cosmos CloudStore,Dynamo

Chubby Zookeeper

Software stacks

22lunedì 18 gennaio 2010

Page 23: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Comparison with PDBMS

• CAP Theorem

• BASE vs ACID

• Computing on large data vs Handling large data

• OLAP vs OLTP

• User Defined Functions vs Select-Project-Join

• Nested vs Flat data model

23lunedì 18 gennaio 2010

Page 24: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Comparison with PDBMS

• Start small (no upfront schema, flexible, agile)Grow big (optimize common patterns)

• MapReduce, a major step backwards DeWitt, Stonebraker

• "If the only tool you have is a hammer, you tend to see every problem as a nail" Abraham Maslow

• SQL and Relational Model are not the answer

24lunedì 18 gennaio 2010

Page 25: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Wrap-up

• A lot of hype

• But also activity

• Industry is leading the trend, has cutting edge software

• Different approaches

• Most focus on MapReduce

• Shift toward higher level abstractions

25lunedì 18 gennaio 2010

Page 26: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Wrap-up

• NoSQL movement

• No Relational Model

• No ACID

• No Join

26lunedì 18 gennaio 2010

Page 27: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Part 3Research Directions

27lunedì 18 gennaio 2010

Page 28: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

• Extensions

• Models

• High velocity analytics

• Hybrid systems

• Optimizations

28lunedì 18 gennaio 2010

Page 29: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Extensions

• Map-Reduce-Merge: simplified relational data processing on large clusters.H. Yang, A. Dasdan, R. Hsiao, and D. Parker. In SIGMOD 2007.

• Goal: implement relational operators efficiently

• How: new final phase that merges 2 key-value lists

• Issues: very low level and hard to useneeds integration into a high level language

29lunedì 18 gennaio 2010

Page 30: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Models

• A new computation model for rack-based computing. F. Afrati and J. Ullman. Unpublished.

• Goal: I/O cost characterization

• Issues: only theoretical analysisno existing reference system

• Future: best algorithms for the modelmodel adaptation to real systems

30lunedì 18 gennaio 2010

Page 31: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Models

• A model of computation for MapReduce.H. Karloff, S. Suri, and S. Vassilvitskii. In SODA, 2010.

• Goal: theoretical computability characterization of MapReduce algorithms

• Result: algorithmic design technique for MapReduce

• Future: develop algorithms in this class find relationships with other classes

31lunedì 18 gennaio 2010

Page 32: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

High velocity analytics

• Interactive analysis of web-scale data. C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, B. Reed. In CIDR, 2009.

• Goal: speed up general queries for big data

• How: pre-computed templates to fill at run-time

• Future: which templates are useful for interactive?help the user to formulate templates (sampling?)

32lunedì 18 gennaio 2010

Page 33: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

High velocity analytics

• MapReduce online.T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears. Technical report, University of California, Berkeley, 2009.

• Goal: speed up turnaround of MapReduce jobs

• How: operator pipelining, online aggregation

• Issues: limited inter-job pipelining (data only)inter-job aggregation problematic (scratch data)

33lunedì 18 gennaio 2010

Page 34: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Hybrid systems

• HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, A. Rasin. In VLDB, 2009.

• Goal: advantages of both DB and MapReduce

• How: integrate a DBMS (PostgreSQL) in Hadoop, Hive as interface

• Issues: better reuse principles than technology

34lunedì 18 gennaio 2010

Page 35: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Optimizations

• The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce. J. Lin. In LSDS-IR, 2009.

• Goal: data distribution effects on MapReduceparallel query/pairwise similarity as case study

• How: balance input data (split long posting lists)

• Issues: very specific for the problem/algorithm

35lunedì 18 gennaio 2010

Page 36: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Other ideas

• Sampling and result estimation

• A good enough result is often acceptable

• Semantic clues

• Leverage properties of M/R functions (associativity, commutativity)

• Properties of the input may speed up the computation

36lunedì 18 gennaio 2010

Page 37: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Wrap-up

• New and active field

• Many opportunities for research

• Crossroad of Distributed Systems and Databases

• Answer the plea not to "reinvent the wheel"

37lunedì 18 gennaio 2010

Page 38: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

How to survive the Data Deluge: Petabyte scale

Cloud Computing

• Integrate DB principles into Cloud systems

• Enable interactive and approximate analytics

• Evolve beyond the MapReduce paradigm

38lunedì 18 gennaio 2010

Page 39: How to survive the Data Deluge: Petabyte scale Cloud …DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS • ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s

Questions?

39lunedì 18 gennaio 2010