learning to scale out by scaling down the fawn project€¦ · the fawn project david andersen,...

Learning to Scale OutBy Scaling Down

The FAWN ProjectDavid Andersen, Vijay Vasudevan, Michael Kaminsky*, Michael A. Kozuch*, Amar Phanishayee, Lawrence Tan, Jason Franklin, Iulian Moraru, Sang Kil Cha, Hyeontaek Lim, Bin Fan, Reinhard Munz, Nathan Wan, Jack Ferris, Hrishikesh Amur**, Wolfgang Richter, Michael Freedman***, Wyatt Lloyd***, Padmanabhan Pillali*,

Dong Zhou

Carnegie Mellon University *Intel Labs Pittsburgh** Princeton University *** Georgia Tech

Power limits computing

2

100%

Servers

1000W

100%

Servers

1000W

2000W

100%

Servers

1000W

2000W

Infrastructure: PUE

2005: 2–32012: ~1.1

Leave it to industry

20%

100%

Servers

1000W

750W

200W

Proportionality

2000W

Infrastructure: PUE

2005: 2–32012: ~1.1


20%

300W

100%

Servers

1000W

750W

100%

FAWNs

200W

Proportionality

Efficiency

2000W

Infrastructure: PUE

2005: 2–32012: ~1.1


(Speed)

Gigahertz is not free

0

500

1000

1500

2000

2500

1 10 100 1000 10000 100000

Inst

ruct

ions

/sec

/W in

milli

ons

Instructions/sec in millions

Custom ARM Mote

XScale 800Mhz

Xeon7350

Atom Z500InstructionsJoule

Speed and power calculated from specification sheetsPower includes “system overhead” (e.g., Ethernet)

Principles Key-Value Systems EvaluationFAWN-KV Design

1E-01

1E+00

1E+01

1E+02

1E+03

1E+04

1E+05

1E+06

1E+07

1E+08

1980 1985 1990 1995 2000 2005

Nan

osec

onds

Year

Disk Seek

CPU Cycle

DRAM Access

1ms

1μs

1ns


The Memory Wall

1E-01

1E+00

1E+01

1E+02

1E+03

1E+04

1E+05

1E+06

1E+07

1E+08

1980 1985 1990 1995 2000 2005

Nan

osec

onds

Year

Disk Seek

CPU Cycle

Bridge gap: Caching, speculation, etc.

DRAM Access

1ms

1μs

1ns


The Memory Wall

1E-01

1E+00

1E+01

1E+02

1E+03

1E+04

1E+05

1E+06

1E+07

1E+08

1980 1985 1990 1995 2000 2005

Nan

osec

onds

Year

Disk Seek

CPU Cycle


DRAM Access

1ms

1μs

1ns


The Memory Wall Transistors

Have the soul ofa capacitor

1E-01

1E+00

1E+01

1E+02

1E+03

1E+04

1E+05

1E+06

1E+07

1E+08

1980 1985 1990 1995 2000 2005

Nan

osec

onds

Year

Disk Seek

CPU Cycle


DRAM Access

1ms

1μs

1ns


The Memory Wall Transistors

Have the soul ofa capacitor

ChargeHere

Moves charge carriers here

Which lets current flow

Gigahertz hurts

Remember: Memory capacity costs you

“Wimpy” Nodes

7

1.6 GHz Dual-‐core Atom32-‐160 GB Flash SSDOnly 1 GB DRAM!

“Each decimal order of magnitude increase in parallelism requires a major redesign and rewrite of parallel code” - Kathy Yelick

8

ParallelizationLoad Balancing

Memory CapacityHardwareSpecificity

Bigger Clusters

Wimpy Nodes

ParallelizationLoad Balancing

Memory Capacity

The FAWN Quad of Pain

HardwareSpecificity

Bigger Clusters

Wimpy Nodes

It’s not just masochism

All systems will face this challenge over time

Moore Dennard

(Figures from Danowitz, Kelley, Mao, Stevenson, and Horowitz: CPU DB)

FAWN:It started

with a key-value store

11

Key-value storage systems

• Critical infrastructure service• Performance-conscious• Random-access, read-mostly, hard to cache

12


Small record, random access

13


Small record, random accessSelect name,photo from users where uid=513542;

13



Select name,photo from users where uid=818503;

13




13




Select wallpost from posts where pid=13821828188;

13





Select wallpost from posts where pid=13821828188; Select name,photo from users where uid=474488;

Select name,photo from users where uid=42223; Select name,photo from users where uid=124566;





13


FAWN-DS and -KV:Key-value Storage System

Goal: improve Queries/Joule

14

500MHz CPU256MB DRAM

4GB CompactFlash


FAWN-DS and -KV:Key-value Storage System

Goal: improve Queries/Joule

14

500MHz CPU256MB DRAM

4GB CompactFlash

Unique Challenges:• Wimpy CPUs, limited DRAM• Flash poor at small random writes • Sustain performance during membership changes


Avoiding random writes

15

Data Log In-memoryHash Index

Log Entry

KeyFrag Valid Offset

160-bit Key

KeyFrag

Key Len Data

Inserted valuesare appended

Scan and Split

ConcurrentInserts

Datastore List Datastore ListData in new rangeData in original range Atomic Update

of Datastore List

(a) (b) (c)

Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.

3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains

values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.

1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

readKey(bucket.offset)==KEY thenreturn bucket;

end if{Check next chain element...}

end forreturn NOT FOUND;

Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).

Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

Hashtable Data region


In FlashIn DRAM


15


Log Entry


160-bit Key

KeyFrag

Key Len Data


Scan and Split

ConcurrentInserts


of Datastore List

(a) (b) (c)
























In FlashIn DRAM


15


Log Entry


160-bit Key

KeyFrag

Key Len Data


Scan and Split

ConcurrentInserts


of Datastore List

(a) (b) (c)






















K1,VPut



In FlashIn DRAM


15


Log Entry


160-bit Key

KeyFrag

Key Len Data


Scan and Split

ConcurrentInserts


of Datastore List

(a) (b) (c)






















K1,V

Put



In FlashIn DRAM


15


Log Entry


160-bit Key

KeyFrag

Key Len Data


Scan and Split

ConcurrentInserts


of Datastore List

(a) (b) (c)






















K1,V

Put


All writes to Flash are sequential


Research Example

• Developed DRAM-efficient system to find location on flash

• (“Partial-key hashing”) 2008-9• We’ve continued this since then:• Partial-key cuckoo hashing 2011• Optimistic concurrent cuckoo hashing 2012

16

Evaluation Takeaways

• 2008: FAWN-based system 6x more efficient than traditional systems

• Partial-key hashing enabled memory-efficient DRAM index for flash-resident data

• Can create high-performance, predictable storage service for small key-value pairs

17


And then we moved to Atom + SSD

18

Geode500Mhz

256MB 4GB CF Card~2k IOPS

Atom1.6 Ghz

single-core2GB 120GB SSD

~60k IOPS

6x 8x 30-60x

Fawn-KV

Fawn-DS

Fawn-DS

Fawn-DS

FAWN-DS FAWN-KV Small Cache Cuckoo

Fawn-KV

SILT

SILT

SILT

backend storehyper-optimizedfor low DRAMand large flash

FAWN-DS FAWN-KV SILT Small Cache Cuckoo

20

“Practical Batch-Updtable External Hashing with Sorting”

H. Lim et al., ALENEX 2012

(Recently heard that Bing uses several state-of-the-art,

memory-efficient indexes)

Systems begat algorithms:

And now... Load imbalance

• Distributed key-‐value system

21

Backend1

FrontEnd



21

Backend1

FrontEnd

Atom CPU



21

Backend1

FrontEnd

Atom CPU SSD



21

Backend1

FrontEnd

Atom CPU

Backend2

…

Back.. 85

Back.. 88

SSD



21

Backend1

FrontEnd1. get(key)

Atom CPU

Backend2

…

Back.. 85

Back.. 88

SSD



21

Backend1

FrontEnd1. get(key)

Atom CPU

2. BackendID=hash(key)

Backend2

…

Back.. 85

Back.. 88

SSD



21

Backend1

FrontEnd1. get(key)

Atom CPU


3. val=lookup(key)

Backend2

…

Back.. 85

Back.. 88

SSD



21

Backend1

FrontEnd1. get(key)

Atom CPU


3. val=lookup(key)4. return val

Backend2

…

Back.. 85

Back.. 88

SSD



21

Backend1

FrontEnd1. get(key)

10,000 queries/sec

Atom CPU



Backend2

…

Back.. 85

Back.. 88

SSD



21

Backend1

FrontEnd1. get(key)

SLA: 850,000 queries/sec10,000 queries/sec

Atom CPU



Backend2

…

Back.. 85

Back.. 88

SSD

Measured tput on FAWN testbed

22

0

200

400

600

800

1000

10 20 30 40 50 60 70 80 90 100

Overall throughput (KQPS)

n: number of nodes

uniformZipf (1.01)adversarial

Backend1

FrontEnd

QueriesBackend2

…

Backend8

Backend8

23

0

200

400

600

800

1000

10 20 30 40 50 60 70 80 90 100


n: number of nodes


Backend1

FrontEnd

QueriesBackend2

…

Backend8

Backend8

cache

23

How many items to cache?

0

200

400

600

800

1000

10 20 30 40 50 60 70 80 90 100


n: number of nodes


small/fast cache is enough!

Backend1

FrontEnd

QueriesBackend2

…

Backend8

Backend8

cache

24

We prove that, for n nodes-‐ Only need to cache O(n log n) most popular entries

-‐ With 100 backend nodes, need only about 4,000 items in the cache. Tiny!

Worst case? Now best case

25

0

200

400

600

800

1000

10 20 30 40 50 60 70 80 90 100Overall throughput (KQPS)

n: number of nodes

uniformZipf (1.01)adversarial 0

200

400

600

800

1000

10 20 30 40 50 60 70 80 90 100


n: number of nodes


Thus...

26

SILT

SILT

SILT

FAWN-DS FAWN-KV SILT Small Cache Cuckoo

Insanely Fast Cache

“Brawny” server

“Wimpy” servers

O(N log N)

Multi-reader parallel cuckoo

hashing

Entropy-coded tries

Partial-key cuckoo hashing

Cuckoo filter

[“small cache” socc 2011]

[“MemC3” - NSDI 2013]

[FAWN, SOSP 2009]

[SILT, SOSP 2011]

[SOSP + ALENEX]

highly parallel, lower-GHz, (memory-constrained?):

Architectures, algorithms, and programming

Moore Dennard

learning to scale out by scaling down the fawn project€¦ · the fawn project david andersen,...

Documents