learning to scale out by scaling down the fawn project€¦ · the fawn project david andersen,...

60
Learning to Scale Out By Scaling Down The FAWN Project David Andersen,Vijay Vasudevan, Michael Kaminsky*, Michael A. Kozuch*, Amar Phanishayee, Lawrence Tan, Jason Franklin, Iulian Moraru, Sang Kil Cha, Hyeontaek Lim, Bin Fan, Reinhard Munz, Nathan Wan, Jack Ferris, Hrishikesh Amur**, Wolfgang Richter, Michael Freedman***, Wyatt Lloyd***, Padmanabhan Pillali*, Dong Zhou Carnegie Mellon University *Intel Labs Pittsburgh ** Princeton University *** Georgia Tech

Upload: others

Post on 31-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Learning to Scale OutBy Scaling Down

    The FAWN ProjectDavid Andersen, Vijay Vasudevan, Michael Kaminsky*, Michael A. Kozuch*, Amar Phanishayee, Lawrence Tan, Jason Franklin, Iulian Moraru, Sang Kil Cha, Hyeontaek Lim, Bin Fan, Reinhard Munz, Nathan Wan, Jack Ferris, Hrishikesh Amur**, Wolfgang Richter, Michael Freedman***, Wyatt Lloyd***, Padmanabhan Pillali*,

    Dong Zhou

    Carnegie Mellon University *Intel Labs Pittsburgh** Princeton University *** Georgia Tech

  • Power limits computing

    2

  • 100%

    Servers

    1000W

  • 100%

    Servers

    1000W

    2000W

  • 100%

    Servers

    1000W

    2000W

    Infrastructure: PUE

    2005: 2–32012: ~1.1

    Leave it to industry

  • 20%

    100%

    Servers

    1000W

    750W

    200W

    Proportionality

    2000W

    Infrastructure: PUE

    2005: 2–32012: ~1.1

    Leave it to industry

  • 20%

    300W

    100%

    Servers

    1000W

    750W

    100%

    FAWNs

    200W

    Proportionality

    Efficiency

    2000W

    Infrastructure: PUE

    2005: 2–32012: ~1.1

    Leave it to industry

  • 20%

    300W

    100%

    Servers

    1000W

    750W

    100%

    FAWNs

    200W

    Proportionality

    Efficiency

    2000W

    Infrastructure: PUE

    2005: 2–32012: ~1.1

    Leave it to industry

  • (Speed)

    Gigahertz is not free

    0

    500

    1000

    1500

    2000

    2500

    1 10 100 1000 10000 100000

    Inst

    ruct

    ions

    /sec

    /W in

    milli

    ons

    Instructions/sec in millions

    Custom ARM Mote

    XScale 800Mhz

    Xeon7350

    Atom Z500InstructionsJoule

    Speed and power calculated from specification sheetsPower includes “system overhead” (e.g., Ethernet)

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • 1E-01

    1E+00

    1E+01

    1E+02

    1E+03

    1E+04

    1E+05

    1E+06

    1E+07

    1E+08

    1980 1985 1990 1995 2000 2005

    Nan

    osec

    onds

    Year

    Disk Seek

    CPU Cycle

    DRAM Access

    1ms

    1μs

    1ns

    Principles Key-Value Systems EvaluationFAWN-KV Design

    The Memory Wall

  • 1E-01

    1E+00

    1E+01

    1E+02

    1E+03

    1E+04

    1E+05

    1E+06

    1E+07

    1E+08

    1980 1985 1990 1995 2000 2005

    Nan

    osec

    onds

    Year

    Disk Seek

    CPU Cycle

    Bridge gap: Caching, speculation, etc.

    DRAM Access

    1ms

    1μs

    1ns

    Principles Key-Value Systems EvaluationFAWN-KV Design

    The Memory Wall

  • 1E-01

    1E+00

    1E+01

    1E+02

    1E+03

    1E+04

    1E+05

    1E+06

    1E+07

    1E+08

    1980 1985 1990 1995 2000 2005

    Nan

    osec

    onds

    Year

    Disk Seek

    CPU Cycle

    Bridge gap: Caching, speculation, etc.

    DRAM Access

    1ms

    1μs

    1ns

    Principles Key-Value Systems EvaluationFAWN-KV Design

    The Memory Wall Transistors

    Have the soul ofa capacitor

  • 1E-01

    1E+00

    1E+01

    1E+02

    1E+03

    1E+04

    1E+05

    1E+06

    1E+07

    1E+08

    1980 1985 1990 1995 2000 2005

    Nan

    osec

    onds

    Year

    Disk Seek

    CPU Cycle

    Bridge gap: Caching, speculation, etc.

    DRAM Access

    1ms

    1μs

    1ns

    Principles Key-Value Systems EvaluationFAWN-KV Design

    The Memory Wall Transistors

    Have the soul ofa capacitor

    ChargeHere

    Moves charge carriers here

    Which lets current flow

  • Gigahertz hurts

    Remember: Memory capacity costs you

  • “Wimpy”  Nodes

    7

    1.6  GHz  Dual-‐core  Atom32-‐160  GB  Flash  SSDOnly  1  GB  DRAM!

  • “Each decimal order of magnitude increase in parallelism requires a major redesign and rewrite of parallel code” - Kathy Yelick

    8

  • ParallelizationLoad Balancing

    Memory CapacityHardwareSpecificity

    Bigger Clusters

    Wimpy Nodes

  • ParallelizationLoad Balancing

    Memory Capacity

    The FAWN Quad of Pain

    HardwareSpecificity

    Bigger Clusters

    Wimpy Nodes

  • It’s not just masochism

    All systems will face this challenge over time

    Moore Dennard

    (Figures from Danowitz, Kelley, Mao, Stevenson, and Horowitz: CPU DB)

  • FAWN:It started

    with a key-value store

    11

  • Key-value storage systems

    • Critical infrastructure service• Performance-conscious• Random-access, read-mostly, hard to cache

    12

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • Small record, random access

    13

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • Small record, random accessSelect name,photo from users where uid=513542;

    13

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • Small record, random access

    Select name,photo from users where uid=818503;

    13

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • Small record, random access

    Select name,photo from users where uid=468883;

    13

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • Small record, random access

    Select name,photo from users where uid=124111;

    Select wallpost from posts where pid=13821828188;

    13

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • Select wallpost from posts where pid=89888333522;

    Small record, random access

    Select name,photo from users where uid=124111;

    Select wallpost from posts where pid=13821828188; Select name,photo from users where uid=474488;

    Select name,photo from users where uid=42223; Select name,photo from users where uid=124566;

    Select name,photo from users where uid=097788;

    Select name,photo from users where uid=357845;

    Select wallpost from posts where pid=12314144887;

    Select wallpost from posts where pid=738838402;

    13

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • FAWN-DS and -KV:Key-value Storage System

    Goal: improve Queries/Joule

    14

    500MHz CPU256MB DRAM

    4GB CompactFlash

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • FAWN-DS and -KV:Key-value Storage System

    Goal: improve Queries/Joule

    14

    500MHz CPU256MB DRAM

    4GB CompactFlash

    Unique Challenges:• Wimpy CPUs, limited DRAM• Flash poor at small random writes • Sustain performance during membership changes

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • Avoiding random writes

    15

    Data Log In-memoryHash Index

    Log Entry

    KeyFrag Valid Offset

    160-bit Key

    KeyFrag

    Key Len Data

    Inserted valuesare appended

    Scan and Split

    ConcurrentInserts

    Datastore List Datastore ListData in new rangeData in original range Atomic Update

    of Datastore List

    (a) (b) (c)

    Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.

    3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

    1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

    2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

    3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

    Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

    These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

    3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains

    values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

    FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.

    1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

    /* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

    bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

    readKey(bucket.offset)==KEY thenreturn bucket;

    end if{Check next chain element...}

    end forreturn NOT FOUND;

    Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

    Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).

    Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

    Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

    The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

    Hashtable Data region

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • In FlashIn DRAM

    Avoiding random writes

    15

    Data Log In-memoryHash Index

    Log Entry

    KeyFrag Valid Offset

    160-bit Key

    KeyFrag

    Key Len Data

    Inserted valuesare appended

    Scan and Split

    ConcurrentInserts

    Datastore List Datastore ListData in new rangeData in original range Atomic Update

    of Datastore List

    (a) (b) (c)

    Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.

    3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

    1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

    2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

    3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

    Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

    These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

    3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains

    values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

    FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.

    1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

    /* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

    bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

    readKey(bucket.offset)==KEY thenreturn bucket;

    end if{Check next chain element...}

    end forreturn NOT FOUND;

    Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

    Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).

    Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

    Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

    The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

    Hashtable Data region

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • In FlashIn DRAM

    Avoiding random writes

    15

    Data Log In-memoryHash Index

    Log Entry

    KeyFrag Valid Offset

    160-bit Key

    KeyFrag

    Key Len Data

    Inserted valuesare appended

    Scan and Split

    ConcurrentInserts

    Datastore List Datastore ListData in new rangeData in original range Atomic Update

    of Datastore List

    (a) (b) (c)

    Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.

    3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

    1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

    2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

    3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

    Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

    These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

    3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains

    values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

    FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.

    1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

    /* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

    bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

    readKey(bucket.offset)==KEY thenreturn bucket;

    end if{Check next chain element...}

    end forreturn NOT FOUND;

    Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

    Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).

    Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

    Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

    The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

    K1,VPut

    Hashtable Data region

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • In FlashIn DRAM

    Avoiding random writes

    15

    Data Log In-memoryHash Index

    Log Entry

    KeyFrag Valid Offset

    160-bit Key

    KeyFrag

    Key Len Data

    Inserted valuesare appended

    Scan and Split

    ConcurrentInserts

    Datastore List Datastore ListData in new rangeData in original range Atomic Update

    of Datastore List

    (a) (b) (c)

    Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.

    3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

    1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

    2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

    3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

    Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

    These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

    3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains

    values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

    FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.

    1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

    /* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

    bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

    readKey(bucket.offset)==KEY thenreturn bucket;

    end if{Check next chain element...}

    end forreturn NOT FOUND;

    Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

    Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).

    Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

    Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

    The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

    K1,VPut

    Hashtable Data region

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • In FlashIn DRAM

    Avoiding random writes

    15

    Data Log In-memoryHash Index

    Log Entry

    KeyFrag Valid Offset

    160-bit Key

    KeyFrag

    Key Len Data

    Inserted valuesare appended

    Scan and Split

    ConcurrentInserts

    Datastore List Datastore ListData in new rangeData in original range Atomic Update

    of Datastore List

    (a) (b) (c)

    Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.

    3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

    1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

    2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

    3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

    Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

    These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

    3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains

    values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

    FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.

    1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

    /* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

    bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

    readKey(bucket.offset)==KEY thenreturn bucket;

    end if{Check next chain element...}

    end forreturn NOT FOUND;

    Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

    Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).

    Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

    Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

    The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

    K1,V

    Put

    Hashtable Data region

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • In FlashIn DRAM

    Avoiding random writes

    15

    Data Log In-memoryHash Index

    Log Entry

    KeyFrag Valid Offset

    160-bit Key

    KeyFrag

    Key Len Data

    Inserted valuesare appended

    Scan and Split

    ConcurrentInserts

    Datastore List Datastore ListData in new rangeData in original range Atomic Update

    of Datastore List

    (a) (b) (c)

    Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.

    3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

    1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

    2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

    3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

    Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

    These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

    3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains

    values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

    FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.

    1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

    /* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

    bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

    readKey(bucket.offset)==KEY thenreturn bucket;

    end if{Check next chain element...}

    end forreturn NOT FOUND;

    Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

    Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).

    Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

    Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

    The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

    K1,V

    Put

    Hashtable Data region

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • In FlashIn DRAM

    Avoiding random writes

    15

    Data Log In-memoryHash Index

    Log Entry

    KeyFrag Valid Offset

    160-bit Key

    KeyFrag

    Key Len Data

    Inserted valuesare appended

    Scan and Split

    ConcurrentInserts

    Datastore List Datastore ListData in new rangeData in original range Atomic Update

    of Datastore List

    (a) (b) (c)

    Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.

    3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:

    1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].

    2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.

    3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].

    Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].

    These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.

    3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains

    values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1

    FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.

    1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.

    /* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do

    bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&

    readKey(bucket.offset)==KEY thenreturn bucket;

    end if{Check next chain element...}

    end forreturn NOT FOUND;

    Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.

    Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).

    Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.

    Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.

    The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.

    K1,V

    Put

    Hashtable Data region

    All writes to Flash are sequential

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • Research Example

    • Developed DRAM-efficient system to find location on flash

    • (“Partial-key hashing”) 2008-9• We’ve continued this since then:• Partial-key cuckoo hashing 2011• Optimistic concurrent cuckoo hashing 2012

    16

  • Evaluation Takeaways

    • 2008: FAWN-based system 6x more efficient than traditional systems

    • Partial-key hashing enabled memory-efficient DRAM index for flash-resident data

    • Can create high-performance, predictable storage service for small key-value pairs

    17

    Principles Key-Value Systems EvaluationFAWN-KV Design

  • And then we moved to Atom + SSD

    18

    Geode500Mhz

    256MB 4GB CF Card~2k IOPS

    Atom1.6 Ghz

    single-core2GB 120GB SSD

    ~60k IOPS

    6x 8x 30-60x

  • Fawn-KV

    Fawn-DS

    Fawn-DS

    Fawn-DS

    FAWN-DS FAWN-KV Small Cache Cuckoo

  • Fawn-KV

    SILT

    SILT

    SILT

    backend storehyper-optimizedfor low DRAMand large flash

    FAWN-DS FAWN-KV SILT Small Cache Cuckoo

  • 20

    “Practical Batch-Updtable External Hashing with Sorting”

    H. Lim et al., ALENEX 2012

    (Recently heard that Bing uses several state-of-the-art,

    memory-efficient indexes)

    Systems begat algorithms:

  • And  now...  Load  imbalance

    • Distributed  key-‐value  system  

    21

    Backend1

    FrontEnd

  • And  now...  Load  imbalance

    • Distributed  key-‐value  system  

    21

    Backend1

    FrontEnd

    Atom  CPU

  • And  now...  Load  imbalance

    • Distributed  key-‐value  system  

    21

    Backend1

    FrontEnd

    Atom  CPU SSD

  • And  now...  Load  imbalance

    • Distributed  key-‐value  system  

    21

    Backend1

    FrontEnd

    Atom  CPU

    Backend2

    Back..  85

    Back..  88

    SSD

  • And  now...  Load  imbalance

    • Distributed  key-‐value  system  

    21

    Backend1

    FrontEnd1. get(key)

    Atom  CPU

    Backend2

    Back..  85

    Back..  88

    SSD

  • And  now...  Load  imbalance

    • Distributed  key-‐value  system  

    21

    Backend1

    FrontEnd1. get(key)

    Atom  CPU

    2. BackendID=hash(key)

    Backend2

    Back..  85

    Back..  88

    SSD

  • And  now...  Load  imbalance

    • Distributed  key-‐value  system  

    21

    Backend1

    FrontEnd1. get(key)

    Atom  CPU

    2. BackendID=hash(key)

    3. val=lookup(key)

    Backend2

    Back..  85

    Back..  88

    SSD

  • And  now...  Load  imbalance

    • Distributed  key-‐value  system  

    21

    Backend1

    FrontEnd1. get(key)

    Atom  CPU

    2. BackendID=hash(key)

    3. val=lookup(key)4. return val

    Backend2

    Back..  85

    Back..  88

    SSD

  • And  now...  Load  imbalance

    • Distributed  key-‐value  system  

    21

    Backend1

    FrontEnd1. get(key)

    10,000  queries/sec

    Atom  CPU

    2. BackendID=hash(key)

    3. val=lookup(key)4. return val

    Backend2

    Back..  85

    Back..  88

    SSD

  • And  now...  Load  imbalance

    • Distributed  key-‐value  system  

    21

    Backend1

    FrontEnd1. get(key)

    SLA:  850,000  queries/sec10,000  queries/sec

    Atom  CPU

    2. BackendID=hash(key)

    3. val=lookup(key)4. return val

    Backend2

    Back..  85

    Back..  88

    SSD

  • Measured  tput  on  FAWN  testbed

    22

    0

    200

    400

    600

    800

    1000

    10 20 30 40 50 60 70 80 90 100

    Overall throughput (KQPS)

    n: number of nodes

    uniformZipf (1.01)adversarial

  • Backend1

    FrontEnd

    QueriesBackend2

    Backend8

    Backend8

    23

    0

    200

    400

    600

    800

    1000

    10 20 30 40 50 60 70 80 90 100

    Overall throughput (KQPS)

    n: number of nodes

    uniformZipf (1.01)adversarial

  • Backend1

    FrontEnd

    QueriesBackend2

    Backend8

    Backend8

    cache

    23

    How  many  items  to  cache?

    0

    200

    400

    600

    800

    1000

    10 20 30 40 50 60 70 80 90 100

    Overall throughput (KQPS)

    n: number of nodes

    uniformZipf (1.01)adversarial

  • small/fast  cache  is  enough!

    Backend1

    FrontEnd

    QueriesBackend2

    Backend8

    Backend8

    cache

    24

    We  prove  that,  for  n  nodes-‐ Only  need  to  cache  O(n  log  n)  most  popular  entries

    -‐ With  100  backend  nodes,  need  only  about  4,000  items  in  the  cache.    Tiny!

  • Worst  case?    Now  best  case

    25

    0

    200

    400

    600

    800

    1000

    10 20 30 40 50 60 70 80 90 100Overall throughput (KQPS)

    n: number of nodes

    uniformZipf (1.01)adversarial 0

    200

    400

    600

    800

    1000

    10 20 30 40 50 60 70 80 90 100

    Overall throughput (KQPS)

    n: number of nodes

    uniformZipf (1.01)adversarial

  • Thus...

    26

  • SILT

    SILT

    SILT

    FAWN-DS FAWN-KV SILT Small Cache Cuckoo

    Insanely Fast Cache

    “Brawny” server

    “Wimpy” servers

    O(N log N)

    Multi-reader parallel cuckoo

    hashing

    Entropy-coded tries

    Partial-key cuckoo hashing

    Cuckoo filter

    [“small cache” socc 2011]

    [“MemC3” - NSDI 2013]

    [FAWN, SOSP 2009]

    [SILT, SOSP 2011]

    [SOSP + ALENEX]

  • highly parallel, lower-GHz, (memory-constrained?):

    Architectures, algorithms, and programming

    Moore Dennard