improving the performance of storage servers yuanyuan zhou princeton university

50
Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Upload: alexandrina-berry

Post on 14-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Improving the Performance of Storage Servers

Yuanyuan Zhou

Princeton University

Page 2: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Traditional Storage

• Delivers limited performance– Locally-attached– Little processing power– Small or no internal cache– Limited scale– Limited bandwidth– Simple storage interface

Database Server

(File server)

Disk array

Page 3: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Modern Storage Servers

– Network attachable– Increasing processing power – Gigabytes memory cache – Gigabytes bandwidth – Clustering of storage– Offloading application

operationsProcessor

Memory

Storage Area Network

Database Server

File Server…

• “Disks become super-computers” --Jim Gray

Processor

Memory

Page 4: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Impact of Storage Performance

• Storage I/O remains a bottleneck in many high-end or mid-size On-line Transaction Processing (OLTP) databases (Microsoft report & SOSP’95).

• Current technology trends– Processor speed increases 60% per year

– Disk access time improves 7% per year

• Our goal: reduce I/O time0

0.2

0.4

0.6

0.8

1

MS SQL

Computation I/O

Page 5: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Approaches in Improving I/O Performance

• Improving response time and throughput

• Minimizing I/O and communication overhead

Page 6: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

My Solutions

• Effective hierarchy-aware storage caching– Improving response time

and throughput

• Using user-level communication as database-storage network– Minimizing I/O &

communication overhead

Database Server

Processor

Memory

Storage Area Network

File Server…

Processor

Memory

Page 7: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Outline

• Effective hierarchy-aware storage caching– Problem

– Access pattern & properties

– MQ algorithm

– Evaluation

– Summary

• User-level communication for database storage– Background

– Architecture & Implementations

– Results

– Summary

Page 8: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Multi-level Server Cache Hierarchy1st Level Buffer Cache

Database ServersFile Servers

Database ClientsFile Clients

Storage Servers

(4GB – 32GB) (1GB – 64GB)

NetworkNetwor

k

Storage Server Cache

Database Server Cache

Client Cache

(64MB – 128MB) << ~No need for inclusion property

Page 9: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Multi-level Server Caching

misses

Least Recently Used (LRU)

LRU?Database or

File server

Cache

Storage server

Cache

in cache?accesses

hits

Database Servers Storage Systems

(Lower level)File Servers(Higher level)

Page 10: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Analogy: Storage Box (Basement)

• Assumption for analogy: item = box• Question: do you keep the box?• If you have a basement, you can keep all the boxes

Basement(lower-level)

Living room

(higher-level)

pizzaDELL

Traditional Client-Server Cache Hierarchy

Page 11: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Analogy: Storage Box (Closet)

• If you just have a closet, you may keep only the box for your holiday decorations!

Closet(lower-level)

Living room

(higher-level)

Database-Storage Server Cache Hierarchy

hot accesscold access

hot miss

Page 12: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

But If You Use LRU for Your Closet…

• Your closet will be full of garbage!

Basement(lower-level)

Living room

(higher-level)

pizza

Page 13: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

“Your cache ain’t nothin’but trash”

• Storage server cache access patterns are not well understood– Most storage server caches still use LRU

• Muntz & Honeyman (USENIX92)– Cache hit ratios at lower level file server caches

are very low

• Willick et al (ICDCS92)– FBR outperforms LRU for disk caches

Page 14: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Questions

• What is the access pattern at storage server caches?

• What are the properties of a good storage server cache replacement algorithm?

• What algorithms are good for storage server caches?

Page 15: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Storage Cache Access Traces

Database or File Server Miss Trace

Oracle-1 Oracle-2 HP Disk Auspex Server

Description TPC-C 100 GB database

TPC-C 100 GB database

Cello, 1992

File server,1993

Database or File cache size (MB)

128 16 30 8

# Reads (millions) 7.3 3.8 0.2 1.8

#Writes (millions) 4.3 2.0 0.3 0.8

#Database or file Server

Single Single Multiple Multiple

Page 16: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Temporal Distances

• Temporal distance: Inter-reference gap from the previous reference to the same block

• Example:

access sequence A B C A D B C

temporal distances 3 - 4 4

blocks: A, B, C, D

Page 17: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Temporal Distance Distribution

0

200000

400000

600000

800000

1 8 64 512 4k 32

k25

6k 2m 16m

temporal distances

#a

cc

es

se

s

Auspex Access Trace (High level)

0

100000

200000

300000

400000

1 8 64 512 4k 32

k25

6k 2m 16m

temporal distances

minDist

Auspex Miss Trace (lower level)

Accesses to storage server have poor temporal locality

Notation: 1k = 1,000 references; 1m = 1,000,000 references

Page 18: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Why Poor Temporal Locality?

Storage Cache

access to B access to B

LRU queue

B

16K cache blocks

Assume: 20% miss ratio

Database or File Cache

>16K distinct accesses >3.2K accesses

B

LRU queue

…B

B

Page 19: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

0

200000

400000

600000

800000

1000000

1200000

1 8 64 512 4k 32

k25

6k 2m 16m

0

50000

100000

150000

200000250000

300000

350000

400000

450000

500000

1 8 64 512 4k 32

k25

6k 2m 16m

Oracle-1 (128MB Client Cache) Oracle-2 (16MB Client Cache)

0

10000

20000

30000

40000

50000

60000

1 8 64 512 4k 32

k25

6k 2m 16m

HP Disk Trace

0

50000

100000

150000

200000

250000

300000

350000

1 8 64 512 4k 32

k25

6k 2m 16m

Auspex Server Trace

A block should stay in cache for at least minDist time to be hit at the next reference

Minimal Lifetime Property

Page 20: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

What Blocks to Keep?

Oracle-1

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128

frequency

per

cen

tage

percentage of accesses percentage of blocks

A large percentage of accesses are made to a small percentage of data, but to a less extent

Page 21: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Oracle -1

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128

Oracle-2

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256

HP Disk Trace

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512

Auspex Server Trace

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512

Blocks should be prioritized based on their access frequencies

Frequency-based Priority Property

Page 22: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Replacement Algorithm Properties

• Minimal lifetime– A block should stay in cache for at least

minDist time

• Frequency-based priority– Blocks should be prioritized based on their

access frequencies

• Temporal (aged) frequency– Reference counts accumulated long time ago

should carry less weight.

Page 23: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Performance of Existing Algorithms

Cache Hit Ratios (Oracle-1)

0

20

40

60

80

100

64 128 256 512 1024

Storage Cache Size(MB)

Cac

he

Hit

Rat

io (

%)

OPT

FBR

LRU

Big gap from the off-line Optimal algorithm

Page 24: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Do They Satisfy the Properties?

No on-line algorithms satisfy all three properties

Minimal lifetime Frequency based priority

Temporal frequency

OPT Best Best Best LRU Poor with small

cache sizes Poor Well

FBR Poor Well Well

Page 25: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Our Replacement Algorithm: Multi-Queue(MQ) • Designed based on the three properties

– Minimal lifeTime: multiple LRU queues with different priorities– Frequency-based Priority: promoting based on reference counts– Aged Frequency: demoting when lifetime expires

lifetime = f (minDist)

History

BufferQ0 Q1 Q2 Q3

B : 7

C : 1

evict

C : 1 B : 8

B.expireTime = CurrentTime + lifetimepromote

D : 6

demote

D : 6

access block B

D.expireTime < CurrentTime block ID reference count

Page 26: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Simulation Evaluation

• Trace driven cache simulator– Write-through– Block size: 8 Kbytes

• MQ – m = 8; – Adaptive lifeTime setting based on-line statistic

information

Page 27: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Simulation Results

MQ performs better than others

Cache Hit Ratios (Oracle-1)

0

20

40

60

80

100

64 128 256 512 1024

Storage Cache Size(MB)

Cac

he

Hit

Rat

io (

%)

OPTMQFBRLRU

Page 28: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Temp. Distance < 64K

Temp. Distance >= 64K

Algorithms

#hits #misses #hits #misses

MQ 1553K 293K 1919K 2645K

FBR 1611K 235k 1146K 3418K

LRU 1846K 0 407K 4157K

Why MQ Performs Better?• MQ can selectively keep some blocks for a longer time

0

500000

1000000

1500000

1 256 64k 16m

Oracle-1, Storage cache size: 512 MB (64K entries)

29% 71%

Page 29: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Implementation Results (Oracle)

Storage Cache Size

MQ LRU

128 MB 19.85% 8.85%

256 MB 31.42% 17.66%

512 MB 44.34% 31.69%

Storage cache hit ratios(Database cache size: 128MB Database size: 100GB)

• MQ has a similar effect to doubling the cache size with LRU

Page 30: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

OLTP End Performance (Oracle)

0.5 0.42 0.32 0.32 0.240.42

0

0.2

0.40.6

0.8

1

LRU MQ LRU MQ LRU MQ

Storage Cache Size

Nor

mal

ized

Exe

cuti

on T

ime

Computation I/O

• MQ can reduce I/O time by 16~25% and improve overall performance by 9~11% comparing to LRU.

128 MB 256 MB 512 MB

Page 31: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Related Work

• Practice– LRU, MRU, LFU, SEQ, LFRU, 2Q, FBR,LRU-k, …

• Theory– LUP & RLUP analytical model– Competitive analysis

• Temporal locality metrics– LRU Stack distance (1970)– Distance string (1976)– IRG model (1995)

• Multi-queue process scheduling

Page 32: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Summary

• Access pattern?– Long temporal distance & frequency distributed

unevenly

• Properties?– Minimal lifetime, frequency-based priority, aged

frequency

• What algorithms are good for storage caches?– MQ performs betters than seven tested alternatives and

has similar effect to doubling the cache size with LRU.

• Details can be found in– Y.Zhou, J.F.Philbin and K.Li. The Multi-Queue Server Buffer

Cache Replacement Algorithm. USENIX’01

Page 33: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Outline

• Effective hierarchy-aware storage caching– Problem

– Access pattern & properties

– MQ algorithm

– Evaluation

– Summary

• User-level communication for database storage– Background

– Architecture & Implementations

– Results

– Summary

Page 34: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

I/O Related Host Overhead

• High-end or mid-size OLTP configurations– Tolerate disk access latency via async. I/Os– Problem: high I/O related processor overhead

• Reasons– OS overhead– Communication protocol overhead

• Our solution: Using user-level communication for database storage

Page 35: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

SCSI or Fibre Channel“Strip”

Analogy: Overhead Walkway• Overhead walkway can avoid stairs & traffic, so

you can win money faster!

Overhead walkway

Las Vegas Casinos

Kernel Space

User-space

Database

User-level Communication

Kernel Space

User-space

Storage Server

Page 36: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

User-level Communication

• High bandwidth, low latency, low overhead

• Main features– Bypass OS– Zero-copying – Remote DMA (RDMA)

• University research:– VMMC, UNet, FM, AM, Memory Channel, Myrinet …

• Industrial standard– Virtual Interface (VI) Architecture

Page 37: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Using User-level Communication

• Intra-Cluster communication– Scientific parallel applications– Application Servers (Intel CSP, Compaq

TruCluster)– Web servers

• Client-server communication– Databases (Oracle, DB/2, MS SQL)– Direct Accessed File Systems (DAFS)

Page 38: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Our goals

• User-level communication as a Database-storage interconnect:– Is user level communication effective to reduce

the host overhead for database storage?– How to use user-level communication to

connect database with storage?

Page 39: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

VI-Attached Storage Server

Databases

VI

Client Stub

VI Network

Local Disks

……

VI

Storage Server

Storage Cache

…• Database and storage communicate using VI• Storage cache is managed using MQ

Local Disks

VI

Storage Server

Storage CacheDatabases

VI

Client Stub

Page 40: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Client-stub Implementations

• Challenges– Application transparency– Take advantage of VI– Storage API

• Implementations– Kernel Driver– DLL Interceptor– Direct Attached Storage (DSA)

transparency

Page 41: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Client Stub: Kernel Driver

• Fully transparent + standard API

• Plus– Support all applications– Take advantage of VI’s zero-

copying and RDMA features

• Minus– High kernel overhead– Require kernel-space VI

Databases

Kernel-space VI

Device Driver

Kernel-space

Standard DLL

VI

Page 42: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Client Stub: DLL Interceptor

• User-level Transparent + standard API

• Plus– No modification to databases – Take advantage of user-level

communication

• Minus– High overhead to satisfy standard

I/O API semanticsExample: trigger events for I/O completions

Databases

User-space VI

DLL Interceptor

Standard DLL

Kernel-space

VI

Page 43: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Client Stub: Direct Storage Access (DSA)

• Least Transparent + new API

• DSA interface:– Minimize kernel involvement– Designed based on VI features, e.g.

polling for I/O completions

• Plus– Fully take advantage of VI

• Minus– Require small modifications to

database

Modified Databases

User-space VI

DSA Library

Kernel-space

VI

Page 44: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Limitations of User-level Communication

• Substantial enhancements to address the following issues– Lack of flow-control and reconnection

mechanisms

– High memory registration overhead

– High locking overhead

– High interrupt overhead

Page 45: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Evaluation

• Real systems – OS: Windows XP – Databases: MS SQL, TPC-C benchmark– VI network: Giganet – Tested by customers for 6 months

• Large-size configuration– Database: 32-way SMP, 32 GB memory– Storage: 8 PCs each with 3GB memory, 12 Terabytes data

• Mid-size configuration– Database: 4-way SMP, 4 GB memory– Storage: 4 PCs each with 2GB memory, total 1 Terabytes

Page 46: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

OLTP Performance (Large-size Config.)

• DSA improvement– I/O time: 40% – Overall: 18%

Normalized Execution Time

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Fibre-Channel

Driver DLL DSA

Nor

mal

ized

Exe

cuti

on T

ime

Computation I/O

Page 47: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

TPC-C I/O Overhead Breakdown

0102030405060708090

100

Fibre

-Chan

nel

Driver

DLLDSA

CP

U u

tiliz

atio

n

otherVIClient StublockOSkernel

Page 48: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Summary

• User-level communication can effectively connect database with storage, but may require substantial enhancements.

• A storage API that minimizes kernel involvement in the I/O path is necessary to fully exploit the benefits of user-level communication

• Details can be found inYuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, James F. Philbin and Kai Li. Experience with VI-based Communication for Database Storage. To appear in ISCA’02

Page 49: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

Conclusions• Effective hierarchy-aware storage caching

– MQ has a doubling-cache-size effect comparing to LRU, and can reduce the I/O time in OLTP by 15~30%

– Provide insights for other similar multi-level cache hierarchies (e.g. Web Proxy caches)

• Using User-level communication for database storage– DSA can reduce the I/O time in OLTP by 40%– Provide guidelines for the design and implementations

of new I/O interconnects (e.g. Infiniband) and other applications (e.g. DAFS)

Page 50: Improving the Performance of Storage Servers Yuanyuan Zhou Princeton University

My Other Related Research

• Improving availability– Fast cluster fail-over using memory-mapped

communication (ICS’99)

• Memory management for Networked Servers– Consistency protocols for DSMs (OSDI’96)– Coherence granularity vs. protocols (PPoPP’97)– Performance Limitations of software DSMs (HPCA’99)– Thread scheduling for locality (IOPADS’99)

• http://www.cs.princeton.edu/~yzhou/