enhancing recovery and ramp-up performance of dbms when using an ssd buffer-pool extension wang...

Post on 18-Dec-2015

222 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Enhancing recovery and ramp-up performance of DBMS when using an

SSD buffer-pool extensionWang Jiangtao

2013-10-18

Outline

• Introduction• SSD-based extension buffer• Enhancing recovery by SSD • Two related work– Enhancing recovery using…[DaMoN2011]– Fast peak-to-peak ….[ICDE2013]

• Summary

Evolution of HDD

3

• Hard disk drive (HDD)– Access rates have been flat for

~13 years– Disk density growth projection

bleak– Capacity growth is now about

to flatten significantly– Power savings not realized

Solid State Drive

• Solid State Drive (SSD)– A semiconductor device– Mechanical components free– 3D NAND flash memory

• Technical merits– High IOPS(>50000)– High bandwidth >500MB/s)– Low power: 0.06 (idle)~2.4w (active)– Shock resistance

4

SSD

Integrating SSD and HDD

• Background– Performance depends heavily on memory, I/O bandwidth,

access latency(web server)– SSD at capacity not going to be reality– Price($/GB):RAM>>SSD>Disk– Read>>write(SSD)– Only a small amount of data is hot!– Cost-effectiveness is the primary factor for large data center – ……

Outline

• Introduction• SSD-based extension buffer• Enhancing recovery by SSD • Two related work– Enhancing recovery using…[DaMoN2011]– Fast peak-to-peak ….[ICDE2013]

• Summary

• Basic FrameworkSSD as cache-buffer

1. B. Debnath, etc.Flashstore: high throught persistent key-value store. VLDB 20102. J. Do,etc. Turbocharg-ing DBMS Buffer Pool Using SSDs. SIGMOD 2011. 3. W.H. Kang,etc. Flash-based Extended Cache for Higher Throughput and Faster Recovery. VLDB 20124. J. Do, etc. Fast peak-to-peak behavior with SSD buffer pool. ICDE2013

Applications in Industry

• Intel ( Differentiated Storage Services )– Intel 的 SSD 缓存解决方案,是将所需文件临时镜像缓存在 SSD 中。

• Apple(Fusion Drive)– Fusion Drive 包含 SSD 和磁盘– 使用频繁的 app 、文档、照片和其他文件存储在闪存上,– 所有的写入操作都在 SSD ,不常用的内容转移到硬盘

• Oracle Exadata (Database Machine)– 综合了可扩展的服务器和存储、 InfiniBand 网络、智能存储、 PCI 闪存、智能内存

高速缓存和混合列式压缩等,实现了软硬件一体化的数据管理。– 智能闪存缓存通过将经常访问的热数据透明地缓存在高速固态存储系统中,来解决

磁盘随机 I/O 瓶颈问题

Outline

• Introduction• SSD-based extension buffer• Enhancing recovery by SSD • Two related work– Enhancing recovery using…[DaMoN2011]– Fast peak-to-peak ….[ICDE2013]

• Summary

Recovery for SSD-based cache system

• Problem definition– A small amount of SSD can ameliorate a large fraction of

random I/O.– A long time is needed when restarting the DBMS from a

shutdown or a crash.

– There has not been much emphasis on exploitation of the persistency of SSD.

• Challenge– How to improve the performance of recovery

without negatively impacting peak performance. – How to ensure the correctness of DBMS when

executing recovery algorithm.

Recovery for SSD-based cache system

Outline

• Introduction• SSD-based extension buffer• Enhancing recovery by SSD • Two related work– Enhancing recovery using…[DaMoN2011]– Fast peak-to-peak ….[ICDE2013]

• Summary

DaMoN 2011

Motivation

• Recovery is itself a random I/O intensive process. • The pages that need to be read and written during

recovery may be scattered over various parts of the disk.

• Preserve the state of the SSD buffer pool so that it can be used during crash recovery.

• Provide a warm buffer pool restart

TAC (VLDB2010)

• TAC – Wirte-through– Temperature-based

data prefetch

M. Canim,etc. SSD bufferpool extensions for database systems. VLDB2010

Implementing Recovery

• Metadata persistence– Store some SSD Buffer pool metadata on the

persistent SSD storage• Mapping information synchronization– When a new page is admitted to the SSD buffer pool and

an old page is evicted, the slot table must be updated. – When a dirty page is evicted from the RAM-resident buffer

pool, no modifications of the slot table are required.

Recovery for TAC• Correctness for TAC – Initially be invalidated– The slot is updated after the write is

finished – Missing some valid data on SSD

Experiment Results

• Experiment setup– 500 warehouse(TPC-C)– The RAM was kept at 2.0% of the database size.

Impact of metadata writes Impact on logging

Experiment Results• Crash performance Restart performance

Summary for recovery in TAC

• The experiment is very sufficient, and the analysis is profound.

• However, the metadata file is small(23MB), the size ratio between SSD and RAM is 3(3.6G/1,2G)

• The cost of synchronization is relatively low

ICDE 2013

Motivation

• With an SSD buffer-pool, a DBMS still treats the disks as the permanent “home” of data.

• Such scheme have a long “peak-to-peak interval” when restarting a DBMS.

• We need a fast mechanism to reduce the restart and ramp-up time

BackgroundTwo SSD buffer-pool extension designs

• DW– Write-through

• LC– Write-back

J. Do,etc. Turbocharg-ing DBMS Buffer Pool Using SSDs. SIGMOD 2011.

BackgroundTwo SSD buffer-pool extension designs

• Data structure– SSD buffer table

J. Do,etc. Turbocharg-ing DBMS Buffer Pool Using SSDs. SIGMOD 2011.

BackgroundTwo SSD buffer-pool extension designs

• Data structure– SSD buffer table

J. Do,etc. Turbocharg-ing DBMS Buffer Pool Using SSDs. SIGMOD 2011.

Background Recovery in SQL server2012

• Data structure– Transaction log

Update log (pageID,prepageLSN,…), BUF_WRITE log …..– Dirty page table

Store information about dirty pages in main memory(pageID,recLSN,lastLSN…)

– Transaction tableStores information about active transactions(beginLSN,endLSN,…)

– ……• Checkpoint

• Recovery– Analysis phase

Build dirty page tableBuild transaction tableBuild lock table……

– Redo phase– Undo phase

Background Recovery in SQL server

Restart design

• Some Pitfalls in using the SSD after a restart– Different version data in SSD and disk

In DW, delay modifying the FC until both the SSD write and the disk write have completed.

In LC, a BUF_WRITE log is generated after the lazy cleaner finishes copying a dirty SSD page to the disks.

In LC, oldestDirtyLSN is the oldest recLSN of the dirty pages in RAM and in the SSD buffer pool.

MMR designMain idea

– Stores the mapping table in SSD– Synchronously updates mapping table

Hardening the FC fields– State, pageID, lastusetime, nexttolastusetime

When to harden– When a clean SSD frame is about to be replaced, flush the

state change.– Minimize the number of flushes

Recover the SSD buffer table– Recover the state of FC (FREE, CLEAN, or DIRTY)– Rebuilt the data structures– Recover recLSN of FC after the analysis phase

LBR design Main idea

– Check the SSD buffer table during a DBMS checkpoint– Log the update for SSD buffer table through SSD log record– Figure out the protocol to checkpoint, log, and recover

Hardening the FC fields– State, pageID, lastusetime, nexttolastusetime

SSD Log record– SSD_CHKPT:

hardening the states of every 64 FCs– SSD_WRITE_INVALIDATE:

overwrite a clean SSD page when there no available free SSD frame– SSD_POST_WRITE:

after a page is written to SSD– SSD_LAZY_CLEANED:

after a dirty SSD page is cleaned

LBR designWhen to harden

– only the SSD_PRE_WRITE_INVALIDATE log record must be flushed to disk, before the thread that generates the log record can continue.

– Group Writing OptimizationRecovery– SSD_CHKPT:

If a FC is DIRTY, Recover recLSN field, update SSD hash table– SSD_WRITE_INVALIDATE:

Invalidate the corresponding FC– SSD_POST_WRITE:

the same as the one used in the processing of an SSD_CHKPT log – SSD_LAZY_CLEANED:

the FC state is changed from LAZYCLEANING to CLEAN

LVR designMain idea– Asynchronously harden the SSD buffer table– Dealing with invalid SSD buffer table records recovered

from the recent flushEnsure two properties– The databases should be consistent, if the design chooses

to reuse a page in the SSD buffer pool upon a restart:The PageID of a FC is different from the actual SSD page

– The databases should be consistent, if the design chooses to discard a page in the SSD buffer pool upon a restart, even if the SSD page is newer than the disk versionoldestdirtyLSN

LVR design Hardening the FC fields

– State,pageID,lastUseTime,NextTolastUseTime,blank,beforeHardeningLSN The FC flusher thread

— repeatedly scans the SSD buffer table in chunks, and hardens the FCs

LVR designCheckpoint

— make sure that the FC flusher thread finishes a complete pass of hardening the SSD buffer table during a checkpoint.

Recovering from shutdown

LVR designCheckpoint

— make sure that the FC flusher thread finishes a complete pass of hardening the SSD buffer table during a checkpoint.

Recovering from crash

Experiment results

• Experiment setup– 24GB RAM,140GB SSD,200GB database size– SQL Server 2012 – Dirty fraction:20%

• Throughput after restart

TPC-C TPC-E

Experiment results

• TPC-C Evaluation– Peak-to-peak interval

restarting from a shutdown. restarting from a crash.

Experiment results

• TPC-E Evaluation– Peak-to-peak interval

restarting from a shutdown. restarting from a crash.

Outline

• Introduction• SSD-based extension buffer• Enhancing recovery by SSD • Two related work– Enhancing recovery using…[DaMoN2011]– Fast peak-to-peak ….[ICDE2013]

• Summary

40

Summary

• Basic requirement– Ensure the consistency and correctness of DBMSs– Minimize the cost of hardening mapping information– Design different recovery algorithm for cache policies

Various pitfalls

– Log VS. Metadata fileslog-based scheme require a larger spacehigh complexity when designing recovery algorithm

41

Summary

• Emerging memory technology– Hardening metadata to PCM synchronously– Scan PCM and rebuilt mapping table for SSD

• Design principle– Finer-grained access granularity– minimizing PCM writes– Designing index to reduce the performance loss

PCM metdata

SSD data

CPU

L1/L2 Cache

DRAM

42

Summary

• Asynchronously hardening – The mapping file is crated in SSD– Each flash page is response for a

SSD data area– Only harden the updated SSD

data area– Alleviate the number of I/O– Quickly find the destination FC

when recovery

43

Summary

• Lower the cost of scan SSD table– Add a checkpoint for mapping

information update– A log is used to record the recent

checkpoint– Only Scan the related checkpoints

update for metadata

Thank You!

top related