benchmarking performance: benefits of pcie nvme ssds for client workloads

8
White Paper: BENEFITS OF PCIe NVMe SSDS FOR CLIENT WORKLOADS Benchmarking Performance Against Real-World Workloads

Upload: samsung-business-usa

Post on 23-Jan-2018

2.226 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Benchmarking Performance: Benefits of PCIe NVMe SSDs for Client Workloads

White Paper:

Benefits of PCie nVMe ssDs for

Client WorkloaDs

Benchmarking Performance against real-World Workloads

Page 2: Benchmarking Performance: Benefits of PCIe NVMe SSDs for Client Workloads

With the release of the 950 Pro, samsung is taking client storage to a new level by switching from serial ata (sata) to Peripheral Component interconnect express (PCie) interface and utilizing non-Volatile Memory express (nVMe) protocol designed specifically for solid state Drives (ssDs). the drive’s faster interface and lower latency protocol makes the V-nanD-equipped 950 Pro the biggest advancement in the client ssD space since

the release of the first client-oriented ssDs more than five years ago.

this whitepaper discusses the benefits that PCie nVMe ssDs, such as the 950 Pro, bring to client PC users. Client PC workloads are not always well understood in the industry, since common benchmarking utilities tend to focus on measuring maximum performance rather than performance under typical PC usage. More specifically,

benchmarking utilities often use very high queue depths to produce high performance numbers, whereas in the real world most io activity is low in queue depth. this whitepaper provides actual io traces of PC workloads to better understand how client ssDs should be benchmarked, and also tests the 950 Pro against other samsung ssDs to show how PCie and nVMe improve io performance in tests that represent real-world io activity.

sata and PCie are both electrical interfaces used to transfer data between an ssD and the rest of the system. traditionally, storage devices have used the sata interface, which connects to the CPU through a controller hub (PCH). However, due to the limits of the sata interface, the ssD industry has shifted towards using the PCie interface. PCie offers substantially more bandwidth than the sata

interface and since PCie ssDs can connect directly to the CPU, they provide lower latency than sata ssDs.

in addition to the electrical interface, the operating system and applications also need a software interface to interact with a storage device. for the past decade, ssDs and HDDs have utilized advanced Host Controller interface (aHCi), which became a bottleneck

for ssDs since it was originally designed for sata and HDDs.ssDs utilize nanD flash memory rather than rotating platters, so ssDs are inherently capable of much higher transfer speeds and lower latencies. Without an optimized software interface, though, ssDs cannot reach their full potential.

Continues on page 3

iNTRODUCTiON: BeNefiTs Of PCie NVMe ssDs

WhaT aRe PCie & NVMe?

DMI

CPU

PCH

PCIe 3.0 x4 - 32Gbps

SATA 3.0 - 6Gbps

Page 3: Benchmarking Performance: Benefits of PCIe NVMe SSDs for Client Workloads

the nVMe interface: NVMe is a new software interface that replaces AHCI and was built from the ground up for SSDs and NAND Flash. It utilizes a simplified, low latency stack between the application and the SSD, which reduces IO overhead by nearly 70%. With less overhead, NVMe SSDs are able to provide higher performance and better power efficiency than AHCI-based SSDs.

Furthermore, NVMe includes a vastly improved queueing system with support for thousands of queues each supporting up to 65,536 outstanding commands. In comparison, AHCI only supports one queue with up to 32 outstanding commands.

User App User App

File System

Device Driver

Block Driver

OS Scheduling & CTX Switch

File System

Device Driver

Block Driver

SCSI/SATA translation

OS Scheduling & CTX Switch

Linux NVMe Stack

=3x less overhead

Linux AHCI Stack

When benchmarking an ssD, one of the first and most critical steps is to understand the workload intended for the product. Without an understanding of the workload, tests may measure metrics that are irrelevant to the intended use case, resulting in inaccurate conclusions about the product.

Most of the commercially available, easy-to-use ssD benchmarking tools, such as CrystalDiskMark and as-ssD, are primarily focused on measuring maximum performance. Maximum transfer rates can be relevant in tasks like large file transfers, but they don’t illustrate performance under typical PC usage.

the best way to investigate and understand PC workloads is to trace io activity for a period of time and then perform statistical analysis for the collected data. in Windows io tracing can be done using Xperf, which is free with Windows Performance toolkit. anandtech extensively studied client PC workloads and built three traces to illustrate different workloads, ranging from a power user to very basic light usage. the details of all three traces are publicly available and provide a great insight to the io activity of typical client PC workloads.

• the Destroyer is the most intensive workload and includes tasks such as virtualization and application development, along with

more general gaming and photo editing usage. it best describes a power user workload.

• Heavy workload is a more typical enthusiast workload consisting of gaming, photo editing and content creation in Dreamweaver. it also will include general productivity tasks such as web browsing, email management, application installing and virus scanning.

• light workload illustrates basic PC usage and is focused on general productivity tasks like web browsing, email management and application installing.

DefiNiNg a ClieNT WORklOaD

Page 4: Benchmarking Performance: Benefits of PCIe NVMe SSDs for Client Workloads

the first variable that needs to be understood before benchmarking an ssD is the io size.

anandtech’s io traces show a clear pattern, as most of the ios focus on io sizes of 4kB and 64-128kB, regardless of workload intensity. this is actually in line with what most benchmarking applications measure, since benchmarks usually consist of 4kB random read/write tests and sequential tests with a large +64kB io size.

the second variable is queue depth, meaning the number

of outstanding ios. What anandtech’s traces show is that the majority of ios are happening at queue depth of one, with 75-90% of ios happening at queue depth of three or below. there are some differences between the workloads. for instance, the more io intensive “the Destroyer” and “Heavy” workloads have a higher average queue depth but even in the heavier PC workloads only a small portion of total ios are high queue depth, and only a fraction are above queue depth of 32.

Benchmarking applications tend to use high queue depths to produce better performance numbers, but as anandtech’s real-world io trace data shows, high queue depths do not illustrate a typical PC workload. in enterprise workloads, queue depths are often high because dozens of people may access one drive at the same time, whereas in PC environments, the drive access is limited to a single user, thus lowering the queue depths.

1 2 3 4-5

Queue Depth

Per

cent

age

of

Tota

l IO

s

6-10 11-20 21-32

10

0

20

30

40

50

60

70

80

The Destroyer Heavy Light

>32

UNDeRsTaNDiNg ssD BeNChMaRkiNg VaRiaBles

<4KB 4KB 8KB 16KB

IO Size

Per

cent

age

of

Tota

l IO

s

32KB 64KB 128KB

5

0

10

15

20

25

30

35

40

45

The Destroyer Heavy Light

anandtech storage Bench - Queue Depth Breakdown

anandtech storage Bench - io size Breakdown

Page 5: Benchmarking Performance: Benefits of PCIe NVMe SSDs for Client Workloads

Test SystemHardwareMotherboard asrock Z170 extreme7+Chipset intel Z170Processor intel Core i5-6600kGraphics intel HD Graphics 530Memory 16GB (2x8GB) DDr4-2400Boot Drive samsung 850 Pro 1tBSoftwareoperating system Windows 10 Pro x64test tool iometer 1.1.0nVMe Driver 950 Pro – samsung nVMe Driver 1.0aHCi Driver intel rapid storage technology 14.6.0.1029Chipset Driver 10.0.27Graphics Driver 15.40.7.4279

Electrical Interface

Software Interface

NAND Configuration

950 PROPCie 3.0 x4 (32Gbps)

nVMe128Gbit 32-layer MlC V-nanD

850 PROsata 3.0 (6Gbps)

aHCi128Gbit 32-layer MlC V-nanD

840 PRO sata 3.0 (6Gbps)

aHCi64Gbit 21nm planar MlC nanD

Based on anandtech’s storage Bench data, a basic test suite can be built to measure performance in PC workloads. With io sizes mostly split between 4kB and 64-128kB, there are essentially four tests needed to determine performance: 4kB random read, 4kB random write, 128kB sequential read and 128kB sequential write. small io patterns, such as log file updates, are typically random

by nature whereas large ios like applications tend to be sequential. therefore, it is logical to test small ios with random patterns and large ios with sequential patterns.

since queue depths in PC usage are typically very low, as shown by anandtech’s storage Bench io traces, running benchmarks at low queue depths is necessary to produce results that reflect actual usage.

Queue depth of one is the most relevant, but for more accurate results and conclusions, it is recommended to test some higher queue depths as well. for this whitepaper, we have chosen queue depths of 1, 2, 4 and 8 to show performance scaling with higher, but still relatively low, queue depths to ensure relevancy to real-world performance.

in these tests, the 950 Pro is compared against its predecessors the 850 Pro and 840 Pro to show the benefits of nVMe and PCie over sata 6Gbps. all drives are 256GB in capacity.

BeNChMaRkiNg 950 PRO aND NVMe

Page 6: Benchmarking Performance: Benefits of PCIe NVMe SSDs for Client Workloads

random read: nVMe and the 950 Pro show substantial performance gains in random read performance. at queue depth of one, the 950 Pro is more than 40% faster than the sata and aHCi based 850 Pro. at higher queue depths, the performance differences are even greater, with the 950 Pro performing up to 60% faster than the 850 Pro.

Historically there has been very little improvement in 4kB

random performance at low queue depths due to sata and aHCi latencies. While PCie reduces electrical latency through a direct connection to the CPU, nVMe is able to reduce latency overhead even more with its simplified storage stack, resulting in unprecedented ssD performance.

random Write: in random write, the performance gains at queue depth of one are even more significant, with the 950 Pro

performing more than 70% faster than the 850 Pro. at queue depth of two, the performance difference grows to 87% in favor of the 950 Pro, although at even higher queue depths the performance delta decreases. Given the rarity of queue depths over four, the 950 Pro provides substantially higher 4kB random write performance under real-world usage.

NVMe ssD PeRfORMaNCe gaiNs

4kB random Write 4kB random Write QD1

4kB random read 4kB random read QD1

1 2 4 8

Queue Depth

Tran

sfer

Rat

e in

MB

/s

50

0

100

150

200

250

300

350

400

450

500

950 PRO 850 PRO 840 PRO

1 2 4 8

Queue Depth

Tran

sfer

Rat

e in

MB

/s

50

0

100

150

200

250

300

350

400

450

500

950 PRO 850 PRO 840 PRO

50

48

46

44

42

40

38

36

34

32

30

Tran

sfer

Rat

e in

MB

/s

250

200

150

100

50

0

Tran

sfer

Rat

e in

MB

/s

50

48

46

44

42

40

38

36

34

32

30

Tran

sfer

Rat

e in

MB

/s

250

200

150

100

50

0

Tran

sfer

Rat

e in

MB

/s

Page 7: Benchmarking Performance: Benefits of PCIe NVMe SSDs for Client Workloads

1 2 4 8

Queue Depth

Tran

sfer

Rat

e in

MB

/s

0

500

1000

1500

2000

2500

950 PRO 850 PRO 840 PRO

1 2 4 8

Queue Depth

Tran

sfer

Rat

e in

MB

/s

0

200

400

600

800

1000

1200

950 PRO 850 PRO 840 PRO

sequential read and Write: in sequential read performance, the 950 Pro is more than three times faster than the 850 Pro at queue depth of one. at higher queue depths the difference grows fourfold in favor of the 950 Pro. a part of the performance

gain is due to the PCie 3.0 x4’s higher bandwidth, as compared to the sata 6Gbps interface. However, the lower latency of the nVMe stack is also a crucial contributor to sequential performance.

similarly, the 950 Pro is more than twice as fast as the 850 Pro in sequential write at queue depth of one, and the performance benefit is sustained at higher queue depths as well.

128kB sequential Write

128kB sequential read

Page 8: Benchmarking Performance: Benefits of PCIe NVMe SSDs for Client Workloads

thanks to the low latency stack, nVMe provides sizable performance improvements at low queue depths. the 950 Pro performs up to several times faster than its predecessor the 850 Pro, and since most ios in PC environments occur at low

queue depths, the performance improvements translate directly to real-world performance. With a faster ssD the system will be even more responsive, resulting in better user experience and increased productivity.

CONClUsiON

learn more: samsung.com/enterprisessd | insights.samsung.com | 1-866-saM4BiZ

follow us: youtube.com/samsungbizusa | @samsungBizUsa

©2016 samsung electronics america, inc. all rights reserved. samsung is a registered trademark of samsung electronics Co., ltd. all products, logos and brand names are trademarks or registered trademarks of their respective companies. this white paper is for informational purposes only. samsung makes no warranties, express or implied, in this white paper. WHP-ssD-nVMe-Jan16J

saMsUnG WorkstationssD Portfolio

950 Pro Series Client PC SSDs•2-bitMLCV-NAND•Designedforhigh-endPCs•PCIeInterface•NVMeprotocol•Form-factors:M.2

850 Pro Series Client PC SSDs•2-bitMLCV-NAND•SATA6Gb/sInterface•Form-factors:2.5”

About the Author kristian Vättö is a technical marketing specialist and started his career as a news editor at anandtech.com in 2011. He later became the site’s ssD editor and was responsible for producing highly-detailed and professional ssD reviews. in addition to his work with samsung, kristian is currently studying economics at the University of tampere in finland.