pnuts: yahoo!’s hosted data serving platform brian f. cooper, raghu ramakrishnan, utkarsh...

38
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni Yahoo! Research With some additions by S. Sudarshan

Upload: margaret-lyons

Post on 18-Dec-2015

235 views

Category:

Documents


3 download

TRANSCRIPT

PNUTS: Yahoo!’s Hosted Data Serving Platform

Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana

Yerneni

Yahoo! ResearchWith some additions by S. Sudarshan

2

How do I build a cool new web app?

Option 1: Code it up! Make it live! Scale it later

It gets posted to slashdot Scale it now! Flickr, Twitter, MySpace, Facebook, …

3

How do I build a cool new web app?

Option 2: Make it industrial strength! Evaluate scalable database backends Evaluate scalable indexing systems Evaluate scalable caching systems Architect data partitioning schemes Architect data replication schemes Architect monitoring and reporting infrastructure Write application Go live Realize it doesn’t scale as well as you hoped Rearchitect around bottlenecks 1 year later – ready to go!

4

Example: social network updates

Brian

Sonja Jimi Brandon Kurt

What are my friends up to?

Sonja:

Brandon:

5

Example: social network updates

16 Mike <ph..

6 Jimi <ph..8 Mary <re..

12 Sonja <ph..

15 Brandon <po..

17 Bob <re..

<photo><title>Flower</title><url>www.flickr.com</url></photo>

6

What do we need from our DBMS?

Web applications need: Scalability

And the ability to scale linearly Geographic scope High availability

Web applications typically have: Simplified query needs

No joins, aggregations Relaxed consistency needs

Applications can tolerate stale or reordered data

7

What is PNUTS?

8

What is PNUTS?

E 75656 C

A 42342 EB 42521 W

C 66354 W

D 12352 E

F 15677 E

E 75656 C

A 42342 EB 42521 W

C 66354 W

D 12352 E

F 15677 E

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

Parallel databaseParallel database Geographic replicationGeographic replication

Indexes and viewsIndexes and views

Structured, flexible schemaStructured, flexible schema

Hosted, managed infrastructureHosted, managed infrastructure

A 42342 E

B 42521 W

C 66354 W

D 12352 E

E 75656 C

F 15677 E

9

Query model

Per-record operations Get Set Delete

Multi-record operations Multiget Scan Getrange

Web service (RESTful) API

10

Data-path componentsData-path components

Storage units

Routers

Tablet controller

REST API

Clients

MessageBroker

Detailed architecture

11

Storageunits

Routers

Tablet controller

REST API

Clients

Local region Remote regions

YMB

Detailed architecture

12

Tablet splitting and balancing

Each storage unit has many tablets (horizontal partitions of the table)Each storage unit has many tablets (horizontal partitions of the table)

Tablets may grow over timeTablets may grow over timeOverfull tablets splitOverfull tablets split

Storage unit may become a hotspotStorage unit may become a hotspot

Shed load by moving tablets to other serversShed load by moving tablets to other servers

Storage unitTablet

13

Query processing

16

Storage unit 1 Storage unit 2 Storage unit 3

Range queries

Router

AppleAvocadoBananaBlueberry

CanteloupeGrapeKiwiLemon

LimeMangoOrange

StrawberryTomatoWatermelon

Grapefruit…Pear?

Grapefruit…Lime?

Lime…Pear?

MIN-Canteloupe

SU1

Canteloupe-Lime

SU3

Lime-Strawberry

SU2

Strawberry-MAX

SU1

SU1Strawberry-MAX

SU2Lime-Strawberry

SU3Canteloupe-Lime

SU1MIN-Canteloupe

17

Updates

1

Write key k

2Write key k7 Sequence # for key k

8 Sequence # for key k

SU SU SU

3Write key k

4

5SUCCESS

6Write key k

RoutersMessage brokers

18

Yahoo Message Bus

Distributed publish-subscribe service Guarantees delivery once a message is

published Logging at site where message is published,

and at other sites when received Guarantees messages published to a

particular cluster will be delivered in same order at all other clusters

Record updates are published to YMB by master copy All replicas subscribe to the updates, and

get them in same order for a particular record

19

Asynchronous replication and

consistency

20

Asynchronous replication

21

Consistency model Goal: make it easier for applications to reason about

updates and cope with asynchrony

What happens to a record with primary key “Brian”?

Time

Record inserted

Update Update Update UpdateUpdate Delete

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Update Update

22

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Current version

Stale versionStale version

Read

23

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Read up-to-date

Current version

Stale versionStale version

24

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Read ≥ v.6

Current version

Stale versionStale version

Read-critical(required version):

25

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Write

Current version

Stale versionStale version

26

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Write if = v.7

ERROR

Current version

Stale versionStale version

Test-and-set-write(required version)

27

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Write if = v.7

ERROR

Current version

Stale versionStale version

Mechanism: per record mastershipMechanism: per record mastership

28

Record and Tablet Mastership

Data in PNUTS is replicated across sites Hidden field in each record stores which copy

is the master copy updates can be submitted to any copy forwarded to master, applied in order received by

master Record also contains origin of last few

updates Mastership can be changed by current master,

based on this information Mastership change is simply a record update

Tablets mastership Required to ensure primary key consistency Can be different from record mastership

29

Other Features

Per record transactions Copying a tablet (on failure, for e.g.)

Request copy Publish checkpoint message Get copy of tablet as of when checkpoint

is received Apply later updates

Tablet split Has to be coordinated across all copies

30

Query Processing

Range scan can span tablets Only one tablet scanned at a time Client may not need all results at once

Continuation object returned to client to indicate where range scan should continue

Notification One pub-sub topic per tablet Client knows about tables, does not know

about tablets Automatically subscribed to all tablets, even as

tablets are added/removed. Usual problem with pub-sub: undelivered

notifications, handled in usual way

31

Experiments

32

Experimental setup

Production PNUTS code Enhanced with ordered table type

Three PNUTS regions 2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5

array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk

Workload 1200-3600 requests/second 0-50% writes 80% locality

33

Inserts

Inserts required 75.6 ms per insert in West 1

(tablet master) 131.5 ms per insert into the non-

master West 2, and 315.5 ms per insert into the non-

master East.

34

10% writes by default

35

Scalability

0

20

40

60

80

100

120

140

160

1 2 3 4 5 6

Storage units

Ave

rag

e la

ten

cy (

ms)

Hash table Ordered table

36

Request skew

0

10

20

30

40

50

60

70

80

90

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Zipf parameter

Ave

rag

e la

ten

cy (

ms)

Hash table Ordered table

37

Size of range scans

0

1000

2000

3000

4000

5000

6000

7000

8000

0 0.02 0.04 0.06 0.08 0.1 0.12

Fraction of table scanned

Ave

rag

e la

ten

cy (

ms)

30 clients 300 clients

38

Related work

Distributed and parallel databases Especially query processing and transactions BigTable, Dynamo, S3, SimpleDB, SQL Server Data

Services, Cassandra

Distributed filesystems Ceph, Boxwood, Sinfonia

Distributed (P2P) hash tables Chord, Pastry, …

Database replication Master-slave, epidemic/gossip, synchronous…

39

Conclusions and ongoing work

PNUTS is an interesting research product Research: consistency, performance, fault

tolerance, rich functionality Product: make it work, keep it (relatively)

simple, learn from experience and real applications

Ongoing work Indexes and materialized views Bundled updates Batch query processing

40

Thanks!

[email protected] research.yahoo.com