tackling the challenges of big data big data storage ?· tackling the challenges of big data big...

Download Tackling The Challenges of Big Data Big Data Storage ?· Tackling The Challenges of Big Data Big Data…

Post on 04-Oct-2018

218 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • 1

    Tackling The Challenges of Big Data Big Data Storage

    Samuel Madden Professor and Director of Big Data at CSAIL

    Massachusetts Institute of Technology

    Tackling The Challenges of Big Data Big Data Storage NoSQL, NewSQL

    Introduction

    Samuel Madden Professor and Director of Big Data at CSAIL

    Massachusetts Institute of Technology

    2014 Massachusetts Institute of Technology!

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    What Does a Traditional Database Provide? Record-oriented persistent storage

    e.g., bank accounts, shopping carts, employee records Carefully structured data (schemas)

    account: (customer, balance, interest_rate, ) Powerful Query Language (SQL)

    SELECT balance FROM account WHERE cust=Madden Transactions (ACID Semantics)

    Group of statements executed together All or nothing

    *E.g., Transfer $100 from account A to B Even in a distributed setting (at some complexity)

  • 2

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    A Thousand Flowers Bloom Traditional properties not a good fit for all apps Big Data Proliferation of Storage Systems New Capabilities Needed

    Very high throughput operation (millions of users) Distributed across many machines Highly available, even with network failures

    *Eventually Consistent vs ACID Different programming interfaces or data models

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    Requirement: High Throughput

    Modern large web apps run thousands of transactions per sec second: Ad Serving Email Services Shopping Carts Financial Trades

    Conventional database systems not engineered for such rates

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    Requirement: Distributed, Highly Available

    For web applications, availability is critically important

    May even be willing to sacrifice data quality

    E.g., OK if some search results are missing Traditional databases ACID consistency over availability

    Availability through multi-node replication Failover in the event of outage

  • 3

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    Requirement: New Programming Models Not all data is

    relational, e.g., json

    Sometimes SQL is complex, slow

    SQL is schema first; wild data may not conform to a schema, or schema may be hard to describe

    { sku: "00e8da9d",! type: "Film",! ...,! asin: "B000P0J0AQ",!! shipping: { ... },!! pricing: { ... },!! details: {! title: "The Matrix",! director: [ "Andy Wachowski", "Larry Wachowski" ],! writer: [ "Andy Wachowski", "Larry Wachowski" ],! ...,! aspect_ratio: "1.66:1"! },!}!Source: http://docs.mongodb.org/ecosystem/use-cases/product-catalog/!

    A JSON Document

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    Rest of This Module

    A survey of a few ideas and systems! Impossible to be exhastive!

    NoSQL !Non-tabular data!Understanding eventual consistency!

    NewSQL!H-Store: high throughput main memory relational

    database!

    Tackling The Challenges of Big Data Big Data Storage NoSQL, NewSQL

    THANK YOU

    2014 Massachusetts Institute of Technology!

  • 4

    Tackling The Challenges of Big Data Big Data Storage NoSQL, NewSQL

    Alternative Data Models

    Samuel Madden Professor and Director of Big Data at CSAIL

    Massachusetts Institute of Technology

    2014 Massachusetts Institute of Technology!

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    What is a Data Model?

    A way of representing data in a database! Classic Approach: Relational!

    ! Why are data models significant?!

    Representation dictates how programs access data!SQL: most common relational language!

    *Allows complex queries over table structure!

    CourseID! Title! ProfID! Time! Room!6.01! Intro to EECS! 1! 8:30 MW! 123!6.02! Intro to EECS ! 1! 9:30 TR! 145!6.033! Systems! 2! 9:30 MW! 154!6.814! Databaes! 3! 10:30 TR! 138!

    Relational Representation of Course Catalog !

    Reference, or Relationship!

    SELECT profName FROM classes, profs!WHERE profs.ProfID = classes.ProfID!AND classes.ProfID in (! SELECT ProfID! FROM classes! HAVING count(*) > 1! GROUP BY ProfID!)!!A SQL Query to Find Profs Who Teach More than 1 Class!

    join!

    aggregate! subquery!

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    NoSQL Data Models

    Relational SQL NoSQL Non-relational?

    Key Value Stores Document Stores

  • 5

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    Key Value Stores (Dynamo, Riak, Cassandra, ) Data is a mapping from keys to arbitrary values Values dont conform to any particular structure Programming Language is get/put

    get(6.01) Intro to EECS 1 by Professor Madden put(6.005,Software Engineering by Prof. Jones)

    Limited multi-record transactional consistency

    Key! Value!6.01! Intro to EECS 1 by Professor Madden!6.02! Intro to EECS 2 by Professor Stonebraker!6.033! Systems by Professor Zeldovich!6.814! Databases by Professor Smith!Key Value Representation of Course Catalog

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    Document Stores (Mongo, CouchDB, .) KeyValue++ Data maps from keys to (XML/JSON) documents Can lookup documents by key Also some ability to search contents of documents

    Typically, no joins or multi-document updates

    Key! Value!6.01! {title: Intro To EECS, prof: Madden, room: 123}!6.02! {title: Intro To EECS 2, professor: Stonebraker}!6.033! {title: Systems, room: 145}!6.814! {title: Databases, room: 154, professor: Smith}!

    Document Representation of Course Catalog

    Different fields in docs

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    Why Key Value?

    Simple to program and implement Can be much faster than legacy SQL databases

    No complicated, unpredictable SQL queries Easy to distribute across many machines

    Since no multi-record consistency

    Note: Same data can be represented in all models Next: Implications of lack of consistency Then: Can we achieve similar benefits in relational model?

  • 6

    Tackling The Challenges of Big Data Big Data Storage

    NoSQL, NewSQL Alternative Data Models

    THANK YOU

    2014 Massachusetts Institute of Technology!

    Tackling The Challenges of Big Data Big Data Storage

    NoSQL, NewSQL Understanding Eventual Consistency

    Samuel Madden

    Professor and Director of Big Data at CSAIL

    Massachusetts Institute of Technology

    2014 Massachusetts Institute of Technology!

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    NoSQL Non Transactional

    Earlier: many key value stores arent transactional Allows them run more ops per second

    Implications: On a single machine, cant guarantee all or nothing

    *E.g., DB crashes in Transfer $100 from A to B

    On multiple machines, replica inconsistency *Eventual consistency means replicas will sync up, but until they do, may contain different data

  • 7

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    Why Replicate?

    Replication is used to ensure availability

    !!

    Replica 1!!!

    !!

    Replica 2!!!

    !!

    Replica 3!!!

    query!answer!

    user!

    query! answer!

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    What About Updates?

    !!

    Replica 1!!!

    !!

    Replica 2!!!

    !!

    Replica 3!!!

    update!

    user!

    update!update!

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    Updates with Unavailable Replicas

    Option 1: Wait Until Failed Replica Comes Back Problem: Cant update system Unavailable

    Option 2: Just Keep Going (Eventual Consistency) Problem: What to do when failed node comes back? What if some reads are sent to failed node?

    Option 3: Majority Write/Majority Read + Versions

  • 8

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    Majority Read/Majority Write

    !!

    Replica 1!X:0!!

    !!

    Replica 2!X:0!!

    !!

    Replica 3!X:0!!

    Update!X:1!

    user!

    Update!X:1!

    Update!X:1!

    Do not write if less than a majority of replicas are available

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    Majority Read/Majority Write

    !!

    Replica 1!X:1!!

    !!

    Replica 2!X:1!!

    !!

    Replica 3!X:0!!

    Update!X:1!

    user!

    Update!X:1!

    Update!X:1!

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    Majority Read/Majority Write

    !!

    Replica 1!X:1!!

    !!

    Replica 2!X:1!!

    !!

    Replica 3!X:0!!

    user!

    Read!X! Read!

    X!

    X:1!

    X:0!

    If all reads and all writes go to a majority, guaranteed to see most recent version

  • 9

    2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

    Majority Read/Write vs Eventual Consistency

    Approach Pros Cons Wait For Failed Replicas on Update

    !!!

    !!

    Eventual Consistency (Just Keep Going

    !!

    !!!!

    Majority Read /

Recommended

View more >