latency reduction techniques for remote memory access in anemone mark lewandowski department of...

Latency Reduction Techniques for Remote Memory Access in ANEMONE

Mark LewandowskiDepartment of Computer ScienceFlorida State University

Outline

Introduction Architecture / Implementation

Adaptive NEtwork MemOry engiNE (ANEMONE) Reliable Memory Access Protocol (RMAP) Two Level LRU Caching Early Acknowledgments

Experimental Results Future Work Related Work Conclusions

Introduction

Virtual Memory performance is bound by slow disks

State of computers today lends to the idea of shared memory Gigabit Ethernet Machines on a LAN have lots

of free memory Improvements to ANEMONE

yield higher performance than disk and the original ANEMONE system

Cache

Memory

ANEMONEDisk

Registers

Contributions

Pseudo Block Device (PBD) Reliable Memory Access Protocol

Replace NFS

Early Acknowledgments Shortcut Communication Path

Two Level LRU-Based Caching Client Memory Engine

ANEMONE Architecture

ANEMONE (NFS) ANEMONE

ClientNFS Swapping Pseudo Block Device (PBD)

Swap Daemon Cache Client Cache

Memory EngineNo caching Engine Cache

Must wait for server to receive page

Early ACKs

Memory ServerCommunicates with Memory Engine

Communicates with Memory Engine

Architecture

Client Module

RMAP Protocol

Engine Cache

Pseudo Block Device

Provides a transparent interface for swap daemon and ANEMONE

Is not a kernel modification Begins handling READ/WRITE requests in

order of arrivalNo expensive elevator algorithm

IP

Transport

Application

RMAP

Swap Daemon

Ethernet

Reliable Memory Access Protocol (RMAP) Lightweight Reliable Flow Control Protocol sits next to

IP layer to allow swap daemon quick access to pages

RMAP• Window Based Protocol

• Requests are served as they arrive

• Messages:

•REG/UNREG – Register the client with the ANEMONE cluster

•READ/WRITE – send/receive data from ANEMONE

•STAT – retrieves statistics from the ANEMONE cluster

Why do we need cache?

It is a natural answer to on-disk buffers Caching reduces network traffic Decreases Latency

Write latencies benefit the mostBuffers requests before they are sent over the

wire

Basic Cache Structure

FIFO Queue is used to keep track of LRU page Hashtable is used for fast page lookups

FIFO Queue

Cache_entry

Hash Function

TailHead

Index (Hash Table)

struct cache_entry { struct list_head queue; /* points to the linked list that makes up the cache */ unsigned long offset; /* Offset of page */ u8 *page; /* the page */ int write; struct sk_buff *skb; /* This may or may not point to an sk_buff. If it does, * then the cache must take care to call kfree_skb when the * page is kicked out of memory (this is to avoid a memcpy). */ int answered;

};

ANEMONE Cache Details

Client Cache 16 MB Write-Back Memory allocation at load time

Engine Cache 80 MB Write-Through Partial memory allocation at load time

sk_buffs are copied when they arrive at the Engine

Early Acknowledgments

`

ClientMemory Engine

Memory Server

• Reduces client wait time• Can reduce write latency by up to 200 µs per

write request• Early ACK performance is slowed by small RMAP

window size• Small pool (~200) of sk_buffs are maintained for

forward ACKing

Experimental Testbed

Experimental testbed configured with 400,000 blocks (4KB page) of memory (~1.6 GB)

Experimental Description

Latency 100,000 Read/Write requests

Sequential/Random Application Run Times

Quicksort / POV-Ray Single/Multiple Processes

Execution Times Cache Performance

Measured cache hit rates Client / Engine

Sequential Read

Sequential Write

Random Read

Random Write

Single Process Performance

Increase single process size by 100 MB for each iteration Quicksort: 298% performance increase over disk, 226% increase

over original ANEMONE POV-Ray: 370% performance increase over disk, 263% increase

over original ANEMONE

Multiple Process Performance Increase number of 100 MB processes by 1 for each

iteration Quicksort: 710% increase over disk, and 117% increase

over original ANEMONE POV-Ray: 835% increase over disk, and 115% increase

over original ANEMONE

Client Cache Performance Hits save ~500 µs POV-Ray hit rate

saves ~270 seconds for 1200 MB test

Quicksort hit rate saves ~45 seconds for 1200 MB test

Swap daemon interferes with cache hit rates Prefetching

Engine Cache Performance

Cache performance levels out ~10%

POV-Ray does not exceed 10% because it performs over 3x the number of page swaps that Quicksort does

Engine cache saves up to 1000 seconds for 1200 MB POV-Ray

Future Work

More extensive testing Aggressive caching algorithms Data Compression Page Fragmentation P2P RDMA over Ethernet Scalability and Fault tolerance

Related Work

Global Memory System [feeley95] Implements a global memory management algorithm over ATM Does not directly address Virtual Memory

Reliable Remote Memory Pager [markatos96], Network RAM Disk [flouris99] TCP Sockets

Samson [stark03] Myrinet Does not perform caching

Remote Memory Model [comer91] Implements custom protocol Guarantees in-order delivery

Conclusions

ANEMONE does not modify client OS or applications

Performance increases by up to 263% for single processes

Performance increases by up to 117% for multiple processes

Improved caching is provocative line of research, but more aggressive algorithms are required.

Questions?

Appendix A: Quicksort Memory Access Patterns

Appendix B: POV-Ray Memory Access Patterns

latency reduction techniques for remote memory access in anemone mark lewandowski department of...

Documents

anemone cluster slide

pages slide

gb slide

wire slide

page of memory

sequential write slide

random write slide

rates client engine