latency reduction techniques for remote memory access in anemone mark lewandowski department of...
TRANSCRIPT
Latency Reduction Techniques for Remote Memory Access in ANEMONE
Mark LewandowskiDepartment of Computer ScienceFlorida State University
Outline
Introduction Architecture / Implementation
Adaptive NEtwork MemOry engiNE (ANEMONE) Reliable Memory Access Protocol (RMAP) Two Level LRU Caching Early Acknowledgments
Experimental Results Future Work Related Work Conclusions
Introduction
Virtual Memory performance is bound by slow disks
State of computers today lends to the idea of shared memory Gigabit Ethernet Machines on a LAN have lots
of free memory Improvements to ANEMONE
yield higher performance than disk and the original ANEMONE system
Cache
Memory
ANEMONEDisk
Registers
Contributions
Pseudo Block Device (PBD) Reliable Memory Access Protocol
Replace NFS
Early Acknowledgments Shortcut Communication Path
Two Level LRU-Based Caching Client Memory Engine
ANEMONE Architecture
ANEMONE (NFS) ANEMONE
ClientNFS Swapping Pseudo Block Device (PBD)
Swap Daemon Cache Client Cache
Memory EngineNo caching Engine Cache
Must wait for server to receive page
Early ACKs
Memory ServerCommunicates with Memory Engine
Communicates with Memory Engine
Architecture
Client Module
RMAP Protocol
Engine Cache
Pseudo Block Device
Provides a transparent interface for swap daemon and ANEMONE
Is not a kernel modification Begins handling READ/WRITE requests in
order of arrivalNo expensive elevator algorithm
IP
Transport
Application
RMAP
Swap Daemon
Ethernet
Reliable Memory Access Protocol (RMAP) Lightweight Reliable Flow Control Protocol sits next to
IP layer to allow swap daemon quick access to pages
RMAP• Window Based Protocol
• Requests are served as they arrive
• Messages:
•REG/UNREG – Register the client with the ANEMONE cluster
•READ/WRITE – send/receive data from ANEMONE
•STAT – retrieves statistics from the ANEMONE cluster
Why do we need cache?
It is a natural answer to on-disk buffers Caching reduces network traffic Decreases Latency
Write latencies benefit the mostBuffers requests before they are sent over the
wire
Basic Cache Structure
FIFO Queue is used to keep track of LRU page Hashtable is used for fast page lookups
FIFO Queue
Cache_entry
Hash Function
TailHead
Index (Hash Table)
struct cache_entry { struct list_head queue; /* points to the linked list that makes up the cache */ unsigned long offset; /* Offset of page */ u8 *page; /* the page */ int write; struct sk_buff *skb; /* This may or may not point to an sk_buff. If it does, * then the cache must take care to call kfree_skb when the * page is kicked out of memory (this is to avoid a memcpy). */ int answered;
};
ANEMONE Cache Details
Client Cache 16 MB Write-Back Memory allocation at load time
Engine Cache 80 MB Write-Through Partial memory allocation at load time
sk_buffs are copied when they arrive at the Engine
Early Acknowledgments
`
ClientMemory Engine
Memory Server
• Reduces client wait time• Can reduce write latency by up to 200 µs per
write request• Early ACK performance is slowed by small RMAP
window size• Small pool (~200) of sk_buffs are maintained for
forward ACKing
Experimental Testbed
Experimental testbed configured with 400,000 blocks (4KB page) of memory (~1.6 GB)
Experimental Description
Latency 100,000 Read/Write requests
Sequential/Random Application Run Times
Quicksort / POV-Ray Single/Multiple Processes
Execution Times Cache Performance
Measured cache hit rates Client / Engine
Sequential Read
Sequential Write
Random Read
Random Write
Single Process Performance
Increase single process size by 100 MB for each iteration Quicksort: 298% performance increase over disk, 226% increase
over original ANEMONE POV-Ray: 370% performance increase over disk, 263% increase
over original ANEMONE
Multiple Process Performance Increase number of 100 MB processes by 1 for each
iteration Quicksort: 710% increase over disk, and 117% increase
over original ANEMONE POV-Ray: 835% increase over disk, and 115% increase
over original ANEMONE
Client Cache Performance Hits save ~500 µs POV-Ray hit rate
saves ~270 seconds for 1200 MB test
Quicksort hit rate saves ~45 seconds for 1200 MB test
Swap daemon interferes with cache hit rates Prefetching
Engine Cache Performance
Cache performance levels out ~10%
POV-Ray does not exceed 10% because it performs over 3x the number of page swaps that Quicksort does
Engine cache saves up to 1000 seconds for 1200 MB POV-Ray
Future Work
More extensive testing Aggressive caching algorithms Data Compression Page Fragmentation P2P RDMA over Ethernet Scalability and Fault tolerance
Related Work
Global Memory System [feeley95] Implements a global memory management algorithm over ATM Does not directly address Virtual Memory
Reliable Remote Memory Pager [markatos96], Network RAM Disk [flouris99] TCP Sockets
Samson [stark03] Myrinet Does not perform caching
Remote Memory Model [comer91] Implements custom protocol Guarantees in-order delivery
Conclusions
ANEMONE does not modify client OS or applications
Performance increases by up to 263% for single processes
Performance increases by up to 117% for multiple processes
Improved caching is provocative line of research, but more aggressive algorithms are required.
Questions?
Appendix A: Quicksort Memory Access Patterns
Appendix B: POV-Ray Memory Access Patterns