learning to scale out by scaling down the fawn project€¦ · the fawn project david andersen,...
TRANSCRIPT
-
Learning to Scale OutBy Scaling Down
The FAWN ProjectDavid Andersen, Vijay Vasudevan, Michael Kaminsky*, Michael A. Kozuch*, Amar Phanishayee, Lawrence Tan, Jason Franklin, Iulian Moraru, Sang Kil Cha, Hyeontaek Lim, Bin Fan, Reinhard Munz, Nathan Wan, Jack Ferris, Hrishikesh Amur**, Wolfgang Richter, Michael Freedman***, Wyatt Lloyd***, Padmanabhan Pillali*,
Dong Zhou
Carnegie Mellon University *Intel Labs Pittsburgh** Princeton University *** Georgia Tech
-
Power limits computing
2
-
100%
Servers
1000W
-
100%
Servers
1000W
2000W
-
100%
Servers
1000W
2000W
Infrastructure: PUE
2005: 2–32012: ~1.1
Leave it to industry
-
20%
100%
Servers
1000W
750W
200W
Proportionality
2000W
Infrastructure: PUE
2005: 2–32012: ~1.1
Leave it to industry
-
20%
300W
100%
Servers
1000W
750W
100%
FAWNs
200W
Proportionality
Efficiency
2000W
Infrastructure: PUE
2005: 2–32012: ~1.1
Leave it to industry
-
20%
300W
100%
Servers
1000W
750W
100%
FAWNs
200W
Proportionality
Efficiency
2000W
Infrastructure: PUE
2005: 2–32012: ~1.1
Leave it to industry
-
(Speed)
Gigahertz is not free
0
500
1000
1500
2000
2500
1 10 100 1000 10000 100000
Inst
ruct
ions
/sec
/W in
milli
ons
Instructions/sec in millions
Custom ARM Mote
XScale 800Mhz
Xeon7350
Atom Z500InstructionsJoule
Speed and power calculated from specification sheetsPower includes “system overhead” (e.g., Ethernet)
Principles Key-Value Systems EvaluationFAWN-KV Design
-
1E-01
1E+00
1E+01
1E+02
1E+03
1E+04
1E+05
1E+06
1E+07
1E+08
1980 1985 1990 1995 2000 2005
Nan
osec
onds
Year
Disk Seek
CPU Cycle
DRAM Access
1ms
1μs
1ns
Principles Key-Value Systems EvaluationFAWN-KV Design
The Memory Wall
-
1E-01
1E+00
1E+01
1E+02
1E+03
1E+04
1E+05
1E+06
1E+07
1E+08
1980 1985 1990 1995 2000 2005
Nan
osec
onds
Year
Disk Seek
CPU Cycle
Bridge gap: Caching, speculation, etc.
DRAM Access
1ms
1μs
1ns
Principles Key-Value Systems EvaluationFAWN-KV Design
The Memory Wall
-
1E-01
1E+00
1E+01
1E+02
1E+03
1E+04
1E+05
1E+06
1E+07
1E+08
1980 1985 1990 1995 2000 2005
Nan
osec
onds
Year
Disk Seek
CPU Cycle
Bridge gap: Caching, speculation, etc.
DRAM Access
1ms
1μs
1ns
Principles Key-Value Systems EvaluationFAWN-KV Design
The Memory Wall Transistors
Have the soul ofa capacitor
-
1E-01
1E+00
1E+01
1E+02
1E+03
1E+04
1E+05
1E+06
1E+07
1E+08
1980 1985 1990 1995 2000 2005
Nan
osec
onds
Year
Disk Seek
CPU Cycle
Bridge gap: Caching, speculation, etc.
DRAM Access
1ms
1μs
1ns
Principles Key-Value Systems EvaluationFAWN-KV Design
The Memory Wall Transistors
Have the soul ofa capacitor
ChargeHere
Moves charge carriers here
Which lets current flow
-
Gigahertz hurts
Remember: Memory capacity costs you
-
“Wimpy” Nodes
7
1.6 GHz Dual-‐core Atom32-‐160 GB Flash SSDOnly 1 GB DRAM!
-
“Each decimal order of magnitude increase in parallelism requires a major redesign and rewrite of parallel code” - Kathy Yelick
8
-
ParallelizationLoad Balancing
Memory CapacityHardwareSpecificity
Bigger Clusters
Wimpy Nodes
-
ParallelizationLoad Balancing
Memory Capacity
The FAWN Quad of Pain
HardwareSpecificity
Bigger Clusters
Wimpy Nodes
-
It’s not just masochism
All systems will face this challenge over time
Moore Dennard
(Figures from Danowitz, Kelley, Mao, Stevenson, and Horowitz: CPU DB)
-
FAWN:It started
with a key-value store
11
-
Key-value storage systems
• Critical infrastructure service• Performance-conscious• Random-access, read-mostly, hard to cache
12
Principles Key-Value Systems EvaluationFAWN-KV Design
-
Small record, random access
13
Principles Key-Value Systems EvaluationFAWN-KV Design
-
Small record, random accessSelect name,photo from users where uid=513542;
13
Principles Key-Value Systems EvaluationFAWN-KV Design
-
Small record, random access
Select name,photo from users where uid=818503;
13
Principles Key-Value Systems EvaluationFAWN-KV Design
-
Small record, random access
Select name,photo from users where uid=468883;
13
Principles Key-Value Systems EvaluationFAWN-KV Design
-
Small record, random access
Select name,photo from users where uid=124111;
Select wallpost from posts where pid=13821828188;
13
Principles Key-Value Systems EvaluationFAWN-KV Design
-
Select wallpost from posts where pid=89888333522;
Small record, random access
Select name,photo from users where uid=124111;
Select wallpost from posts where pid=13821828188; Select name,photo from users where uid=474488;
Select name,photo from users where uid=42223; Select name,photo from users where uid=124566;
Select name,photo from users where uid=097788;
Select name,photo from users where uid=357845;
Select wallpost from posts where pid=12314144887;
Select wallpost from posts where pid=738838402;
13
Principles Key-Value Systems EvaluationFAWN-KV Design
-
FAWN-DS and -KV:Key-value Storage System
Goal: improve Queries/Joule
14
500MHz CPU256MB DRAM
4GB CompactFlash
Principles Key-Value Systems EvaluationFAWN-KV Design
-
FAWN-DS and -KV:Key-value Storage System
Goal: improve Queries/Joule
14
500MHz CPU256MB DRAM
4GB CompactFlash
Unique Challenges:• Wimpy CPUs, limited DRAM• Flash poor at small random writes • Sustain performance during membership changes
Principles Key-Value Systems EvaluationFAWN-KV Design
-
Avoiding random writes
15
Data Log In-memoryHash Index
Log Entry
KeyFrag Valid Offset
160-bit Key
KeyFrag
Key Len Data
Inserted valuesare appended
Scan and Split
ConcurrentInserts
Datastore List Datastore ListData in new rangeData in original range Atomic Update
of Datastore List
(a) (b) (c)
Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.
3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:
1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].
2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.
3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].
Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].
These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.
3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains
values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1
FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.
1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.
/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do
bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&
readKey(bucket.offset)==KEY thenreturn bucket;
end if{Check next chain element...}
end forreturn NOT FOUND;
Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.
Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).
Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.
Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.
The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.
Hashtable Data region
Principles Key-Value Systems EvaluationFAWN-KV Design
-
In FlashIn DRAM
Avoiding random writes
15
Data Log In-memoryHash Index
Log Entry
KeyFrag Valid Offset
160-bit Key
KeyFrag
Key Len Data
Inserted valuesare appended
Scan and Split
ConcurrentInserts
Datastore List Datastore ListData in new rangeData in original range Atomic Update
of Datastore List
(a) (b) (c)
Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.
3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:
1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].
2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.
3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].
Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].
These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.
3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains
values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1
FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.
1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.
/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do
bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&
readKey(bucket.offset)==KEY thenreturn bucket;
end if{Check next chain element...}
end forreturn NOT FOUND;
Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.
Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).
Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.
Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.
The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.
Hashtable Data region
Principles Key-Value Systems EvaluationFAWN-KV Design
-
In FlashIn DRAM
Avoiding random writes
15
Data Log In-memoryHash Index
Log Entry
KeyFrag Valid Offset
160-bit Key
KeyFrag
Key Len Data
Inserted valuesare appended
Scan and Split
ConcurrentInserts
Datastore List Datastore ListData in new rangeData in original range Atomic Update
of Datastore List
(a) (b) (c)
Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.
3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:
1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].
2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.
3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].
Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].
These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.
3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains
values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1
FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.
1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.
/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do
bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&
readKey(bucket.offset)==KEY thenreturn bucket;
end if{Check next chain element...}
end forreturn NOT FOUND;
Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.
Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).
Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.
Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.
The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.
K1,VPut
Hashtable Data region
Principles Key-Value Systems EvaluationFAWN-KV Design
-
In FlashIn DRAM
Avoiding random writes
15
Data Log In-memoryHash Index
Log Entry
KeyFrag Valid Offset
160-bit Key
KeyFrag
Key Len Data
Inserted valuesare appended
Scan and Split
ConcurrentInserts
Datastore List Datastore ListData in new rangeData in original range Atomic Update
of Datastore List
(a) (b) (c)
Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.
3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:
1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].
2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.
3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].
Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].
These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.
3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains
values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1
FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.
1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.
/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do
bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&
readKey(bucket.offset)==KEY thenreturn bucket;
end if{Check next chain element...}
end forreturn NOT FOUND;
Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.
Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).
Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.
Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.
The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.
K1,VPut
Hashtable Data region
Principles Key-Value Systems EvaluationFAWN-KV Design
-
In FlashIn DRAM
Avoiding random writes
15
Data Log In-memoryHash Index
Log Entry
KeyFrag Valid Offset
160-bit Key
KeyFrag
Key Len Data
Inserted valuesare appended
Scan and Split
ConcurrentInserts
Datastore List Datastore ListData in new rangeData in original range Atomic Update
of Datastore List
(a) (b) (c)
Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.
3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:
1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].
2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.
3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].
Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].
These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.
3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains
values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1
FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.
1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.
/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do
bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&
readKey(bucket.offset)==KEY thenreturn bucket;
end if{Check next chain element...}
end forreturn NOT FOUND;
Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.
Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).
Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.
Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.
The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.
K1,V
Put
Hashtable Data region
Principles Key-Value Systems EvaluationFAWN-KV Design
-
In FlashIn DRAM
Avoiding random writes
15
Data Log In-memoryHash Index
Log Entry
KeyFrag Valid Offset
160-bit Key
KeyFrag
Key Len Data
Inserted valuesare appended
Scan and Split
ConcurrentInserts
Datastore List Datastore ListData in new rangeData in original range Atomic Update
of Datastore List
(a) (b) (c)
Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.
3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:
1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].
2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.
3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].
Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].
These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.
3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains
values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1
FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.
1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.
/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do
bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&
readKey(bucket.offset)==KEY thenreturn bucket;
end if{Check next chain element...}
end forreturn NOT FOUND;
Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.
Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).
Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.
Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.
The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.
K1,V
Put
Hashtable Data region
Principles Key-Value Systems EvaluationFAWN-KV Design
-
In FlashIn DRAM
Avoiding random writes
15
Data Log In-memoryHash Index
Log Entry
KeyFrag Valid Offset
160-bit Key
KeyFrag
Key Len Data
Inserted valuesare appended
Scan and Split
ConcurrentInserts
Datastore List Datastore ListData in new rangeData in original range Atomic Update
of Datastore List
(a) (b) (c)
Figure 2: (a) FAWN-DS appends writes to the end of the Data Log. (b) Split requires a sequential scan of the data region, transfer-ring out-of-range entries to the new store. (c) After scan is complete, the datastore list is atomically updated to add the new store.Compaction of the original store will clean up out-of-range entries.
3.2 Understanding Flash StorageFlash provides a non-volatile memory store with several signifi-cant benefits over typical magnetic hard disks for random-access,read-intensive workloads—but it also introduces several challenges.Three characteristics of flash underlie the design of the FAWN-KVsystem described throughout this section:
1. Fast random reads: (� 1 ms), up to 175 times faster thanrandom reads on magnetic disk [35, 40].
2. Efficient I/O: Flash devices consume less than one Watt evenunder heavy load, whereas mechanical disks can consume over10 W at load. Flash is over two orders of magnitude moreefficient than mechanical disks in terms of queries/Joule.
3. Slow random writes: Small writes on flash are very expen-sive. Updating a single page requires first erasing an entireerase block (128 KB–256 KB) of pages, and then writing themodified block in its entirety. As a result, updating a single byteof data is as expensive as writing an entire block of pages [37].
Modern devices improve random write performance using writebuffering and preemptive block erasure. These techniques improveperformance for short bursts of writes, but recent studies show thatsustained random writes still perform poorly on these devices [40].
These performance problems motivate log-structured techniquesfor flash filesystems and data structures [36, 37, 23]. These sameconsiderations inform the design of FAWN’s node storage manage-ment system, described next.
3.3 The FAWN Data StoreFAWN-DS is a log-structured key-value store. Each store contains
values for the key range associated with one virtual ID. It acts toclients like a disk-based hash table that supports Store, Lookup,and Delete.1
FAWN-DS is designed specifically to perform well on flash stor-age and to operate within the constrained DRAM available on wimpynodes: all writes to the datastore are sequential, and reads require asingle random access. To provide this property, FAWN-DS maintainsan in-DRAM hash table (Hash Index) that maps keys to an offset inthe append-only Data Log on flash (Figure 2a). This log-structureddesign is similar to several append-only filesystems [42, 15], whichavoid random seeks on magnetic disks for writes.
1We differentiate datastore from database to emphasize that we do not provide atransactional or relational interface.
/* KEY = 0x93df7317294b99e3e049, 16 index bits */INDEX = KEY & 0xffff; /* = 0xe049; */KEYFRAG = (KEY >> 16) & 0x7fff; /* = 0x19e3; */for i = 0 to NUM HASHES do
bucket = hash[i](INDEX);if bucket.valid && bucket.keyfrag==KEYFRAG &&
readKey(bucket.offset)==KEY thenreturn bucket;
end if{Check next chain element...}
end forreturn NOT FOUND;
Figure 3: Pseudocode for hash bucket lookup in FAWN-DS.
Mapping a Key to a Value. FAWN-DS uses an in-memory(DRAM) Hash Index to map 160-bit keys to a value stored in theData Log. It stores only a fragment of the actual key in memory tofind a location in the log; it then reads the full key (and the value)from the log and verifies that the key it read was, in fact, the correctkey. This design trades a small and configurable chance of requiringtwo reads from flash (we set it to roughly 1 in 32,768 accesses) fordrastically reduced memory requirements (only six bytes of DRAMper key-value pair).
Figure 3 shows the pseudocode that implements this design forLookup. FAWN-DS extracts two fields from the 160-bit key: the ilow order bits of the key (the index bits) and the next 15 low orderbits (the key fragment). FAWN-DS uses the index bits to select abucket from the Hash Index, which contains 2i hash buckets. Eachbucket is only six bytes: a 15-bit key fragment, a valid bit, and a4-byte pointer to the location in the Data Log where the full entry isstored.
Lookup proceeds, then, by locating a bucket using the index bitsand comparing the key against the key fragment. If the fragmentsdo not match, FAWN-DS uses hash chaining to continue searchingthe hash table. Once it finds a matching key fragment, FAWN-DSreads the record off of the flash. If the stored full key in the on-flashrecord matches the desired lookup key, the operation is complete.Otherwise, FAWN-DS resumes its hash chaining search of the in-memory hash table and searches additional records. With the 15-bitkey fragment, only 1 in 32,768 retrievals from the flash will beincorrect and require fetching an additional record.
The constants involved (15 bits of key fragment, 4 bytes of logpointer) target the prototype FAWN nodes described in Section 4.
K1,V
Put
Hashtable Data region
All writes to Flash are sequential
Principles Key-Value Systems EvaluationFAWN-KV Design
-
Research Example
• Developed DRAM-efficient system to find location on flash
• (“Partial-key hashing”) 2008-9• We’ve continued this since then:• Partial-key cuckoo hashing 2011• Optimistic concurrent cuckoo hashing 2012
16
-
Evaluation Takeaways
• 2008: FAWN-based system 6x more efficient than traditional systems
• Partial-key hashing enabled memory-efficient DRAM index for flash-resident data
• Can create high-performance, predictable storage service for small key-value pairs
17
Principles Key-Value Systems EvaluationFAWN-KV Design
-
And then we moved to Atom + SSD
18
Geode500Mhz
256MB 4GB CF Card~2k IOPS
Atom1.6 Ghz
single-core2GB 120GB SSD
~60k IOPS
6x 8x 30-60x
-
Fawn-KV
Fawn-DS
Fawn-DS
Fawn-DS
FAWN-DS FAWN-KV Small Cache Cuckoo
-
Fawn-KV
SILT
SILT
SILT
backend storehyper-optimizedfor low DRAMand large flash
FAWN-DS FAWN-KV SILT Small Cache Cuckoo
-
20
“Practical Batch-Updtable External Hashing with Sorting”
H. Lim et al., ALENEX 2012
(Recently heard that Bing uses several state-of-the-art,
memory-efficient indexes)
Systems begat algorithms:
-
And now... Load imbalance
• Distributed key-‐value system
21
Backend1
FrontEnd
-
And now... Load imbalance
• Distributed key-‐value system
21
Backend1
FrontEnd
Atom CPU
-
And now... Load imbalance
• Distributed key-‐value system
21
Backend1
FrontEnd
Atom CPU SSD
-
And now... Load imbalance
• Distributed key-‐value system
21
Backend1
FrontEnd
Atom CPU
Backend2
…
Back.. 85
Back.. 88
SSD
-
And now... Load imbalance
• Distributed key-‐value system
21
Backend1
FrontEnd1. get(key)
Atom CPU
Backend2
…
Back.. 85
Back.. 88
SSD
-
And now... Load imbalance
• Distributed key-‐value system
21
Backend1
FrontEnd1. get(key)
Atom CPU
2. BackendID=hash(key)
Backend2
…
Back.. 85
Back.. 88
SSD
-
And now... Load imbalance
• Distributed key-‐value system
21
Backend1
FrontEnd1. get(key)
Atom CPU
2. BackendID=hash(key)
3. val=lookup(key)
Backend2
…
Back.. 85
Back.. 88
SSD
-
And now... Load imbalance
• Distributed key-‐value system
21
Backend1
FrontEnd1. get(key)
Atom CPU
2. BackendID=hash(key)
3. val=lookup(key)4. return val
Backend2
…
Back.. 85
Back.. 88
SSD
-
And now... Load imbalance
• Distributed key-‐value system
21
Backend1
FrontEnd1. get(key)
10,000 queries/sec
Atom CPU
2. BackendID=hash(key)
3. val=lookup(key)4. return val
Backend2
…
Back.. 85
Back.. 88
SSD
-
And now... Load imbalance
• Distributed key-‐value system
21
Backend1
FrontEnd1. get(key)
SLA: 850,000 queries/sec10,000 queries/sec
Atom CPU
2. BackendID=hash(key)
3. val=lookup(key)4. return val
Backend2
…
Back.. 85
Back.. 88
SSD
-
Measured tput on FAWN testbed
22
0
200
400
600
800
1000
10 20 30 40 50 60 70 80 90 100
Overall throughput (KQPS)
n: number of nodes
uniformZipf (1.01)adversarial
-
Backend1
FrontEnd
QueriesBackend2
…
Backend8
Backend8
23
0
200
400
600
800
1000
10 20 30 40 50 60 70 80 90 100
Overall throughput (KQPS)
n: number of nodes
uniformZipf (1.01)adversarial
-
Backend1
FrontEnd
QueriesBackend2
…
Backend8
Backend8
cache
23
How many items to cache?
0
200
400
600
800
1000
10 20 30 40 50 60 70 80 90 100
Overall throughput (KQPS)
n: number of nodes
uniformZipf (1.01)adversarial
-
small/fast cache is enough!
Backend1
FrontEnd
QueriesBackend2
…
Backend8
Backend8
cache
24
We prove that, for n nodes-‐ Only need to cache O(n log n) most popular entries
-‐ With 100 backend nodes, need only about 4,000 items in the cache. Tiny!
-
Worst case? Now best case
25
0
200
400
600
800
1000
10 20 30 40 50 60 70 80 90 100Overall throughput (KQPS)
n: number of nodes
uniformZipf (1.01)adversarial 0
200
400
600
800
1000
10 20 30 40 50 60 70 80 90 100
Overall throughput (KQPS)
n: number of nodes
uniformZipf (1.01)adversarial
-
Thus...
26
-
SILT
SILT
SILT
FAWN-DS FAWN-KV SILT Small Cache Cuckoo
Insanely Fast Cache
“Brawny” server
“Wimpy” servers
O(N log N)
Multi-reader parallel cuckoo
hashing
Entropy-coded tries
Partial-key cuckoo hashing
Cuckoo filter
[“small cache” socc 2011]
[“MemC3” - NSDI 2013]
[FAWN, SOSP 2009]
[SILT, SOSP 2011]
[SOSP + ALENEX]
-
highly parallel, lower-GHz, (memory-constrained?):
Architectures, algorithms, and programming
Moore Dennard