data management research forecast: hot with an increasing chance of clouds

17
Data Management Research Forecast: Hot With an Increasing Chance of Clouds.. . Michael Carey Information Systems Group CS Department UC Irvine

Upload: deiondre

Post on 23-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Data Management Research Forecast: Hot With an Increasing Chance of Clouds.. . Michael Carey Information Systems Group CS Department UC Irvine. Cloud-Related DB Bandwagons. MapReduce and Hadoop Parallel programming for dummies But now Hive, Pig, Scope, Jaql , … - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

Data Management Research Forecast:Hot With an Increasing

Chance of Clouds...Michael Carey

Information Systems GroupCS Department

UC Irvine

Page 2: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

2

Cloud-Related DB Bandwagons• MapReduce and Hadoop– Parallel programming for dummies– But now Hive, Pig, Scope, Jaql, … MapReduce is the new runtime

• DFSs and HDFS (and CSS)– Scalable, self-managed, Really Big Files– But now BigTable, HBase, … HDFS (or CSS) is the new file storage

• Key-value stores– Charter members of the “NoSQL movement”– Includes S3, Dynamo, HBase, Cassandra, … Key-value stores are the new record managers

Page 3: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

Perhaps Our Brains Are Clouded!

• Some of the possible symptoms include…– A number of us are focused on tweaking Hadoop– Most of us are compiling our declarative data analysis

languages into Hadoop jobs (map/reduce/combine)– Some of us are building “cloud database systems” as a layer

on top of “givens” such as HDFS– Some other possible signs of “cloud dementia” include…• We keep forgetting where we put our schemas• We don’t remember working on parallel DBs in the 80’s• We’re about to learn some hard lessons all over again…

Thou Shalt Worship DFS& BigTable

Thou Shalt MapReduce!

3

Page 4: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

4

DB-ers: Let’s Do This Stuff “Right”!• In my opinion:– Yes, the OS/DS folks out-scaled us (oops!)– But, we’d be “nuts” to simply build on their

current (i.e., first-generation) foundation• Let’s bring our DB experience to the party…!– Recall important past architectural lessons• I’ll give some quick examples of this…

– Plus, it can be “windy” in the stratosphere• I’ll give one quick example of this too…

Page 5: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

5

Current Data Layers

HDFS

HBase

Cloud DBMS?

Ordered byte streamsAppend onlyRandomly partitioned

Sets of records/columnsKey-based retrievalRange partitioned

Some data model?Some query language?

Page 6: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

6

Page 7: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

7

Page 8: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

8

!!

Page 9: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

9

Let’s Do This Stuff “Right”! (cont.)

• Identify the lessons, then do it “right”– Open-source S/W on commodity H/W– Non-monolithic software components– Equal opportunity data access (external sources)– Fault-tolerant query execution (for data analysis)– Little pre-planning or DBA-type work required– Tolerant of flexible / nested / absent schemas– Types and declarative languages (duh…! )

Page 10: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

10

What If We’d Meant To Do This?• What is the “right” basis for analyzing and

managing the data of the future?– Storage and data distribution layers?– Query runtime layer (and division of labor)?– ….

• Let’s explore how to build new information management systems to live in the cloud that…– Scale to thousands of nodes (and beyond)– Make effective use of shared (and elastic) resources– Execute queries in the face of partial failures– Seamlessly support external data access– Don’t require five-star wizard administrators– ….

Page 11: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

11

In Short…

• Ask not what cloud software can do for you, but what you can do for cloud software…!– We are, after all, the data experts!– We just have to remember our roots and bring our experiences and

past lessons to bear on cloud issues• We’re asking this question at UCI (and UCSD/UCR):– ASTERIX: A new data platform to ingest, store, manage, index, query,

analyze, and publish vast quantities of semi-structured information– Hyracks: A new foundation for scalable data analysis – both a

runtime system for ASTERIX and an extensible parallel data analysis platform in its own right

– Open-source sharing planned for both (eventually)

Page 12: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

12

• Semistructured data management– Core work exists– XML & XQuery, JSON, …– Time to parallelize and scale out

• Parallel database systems– Research quiesced in mid-1990’s– Renewed industrial interest– Time to scale up and de-schema-tize

• Data-intensive computing– MapReduce and Hadoop quite popular– Language efforts even more popular (Pig, Hive, Jaql, …)– Ripe for parallel DB ideas (e.g., for query processing) and support

for stored, indexed data sets

The ASTERIX ProjectSemistructured

Data Management

Parallel Database Systems

Data-Intensive

Computing

Page 13: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

13

ASTERIX Project Overview

Disk

MainMemory

Disk

CPU(s)

ADMData

MainMemory

Disk

CPU(s)

ADMData

ADMData

Hi-Speed Interconnect

Data loads & feeds from external sources (XML,

JSON, …)

AQL queries & scripting

requests and programs

Data publishing to

external sources and

appsASTERIX Goal: To ingest, digest, persist, index, manage, query, analyze, and publish massive quantities of semistructured information…

(ADM = ASTERIX

Data Model;AQL =

ASTERIX Query

Language)

MainMemory

CPU(s)

Page 14: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

14

ASTERIX User Model• Data model (ADM) with

open and closed types• Query language (AQL)

for nested and semi-structured querying

• Support for both stored and external datasets

for $user in dataset(‘User’)where some $i in $user.interests satisfies $i =“movies”return {“name”: $user.name };

Page 15: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

15

Hyracks Runtime• Partitioned-parallel platform for data-intensive computing• Job = dataflow DAG of operators and connectors

– Operators consume/produce partitions of data– Connectors repartition/route data between operators

• Hyracks vs. the “competition”– Based on time-tested parallel database principles– vs. Hadoop: More flexible model, less “pessimistic”implementation– vs. Dryad: Supports data as a first-class citizen

Page 16: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

16

ASTERIX Research Issue Sampler• Semistructured data modeling

– Open/closed types, type evolution, relationships, ….– Efficient physical storage scheme(s)

• Scalable storage and indexing– Self-managing scalable partitioned datasets– Ditto for indexes (hash, range, spatial, fuzzy; combos)

• Large scale parallel query processing– Division of labor between compiler and runtime– Query plan decision-making timing and basis– Model-independent complex object algebra (AQUA)– Fuzzy matching as well as exact-match queries

• Multiuser workload management (scheduling)– Uniformly cited: Facebook, Yahoo!, eBay, Teradata, ….

Page 17: Data Management Research Forecast: Hot With an Increasing Chance of Clouds

17

Closing Thoughts• IMO, this is the most exciting time to be a data

management researcher in the last 20-30 years!– Data coming out our ears (and other openings) thanks to

WWW, Twitter, FaceBook, ... (so-called “data exhaust”)– Data models and query languages totally up for grabs (witness

the “NoSQL movement”)– System layers and architectures also up for grabs ()

• Many interesting challenges to Cloud our thinking– Opportunities for DB, OS, and PL research to converge– Important to learn from the past as we plan for the future– Resource management at scale is one of the biggies– Consistency models / programming models are another