big data ssd architecture: digging deep to discover where ssd performance pays off

White Paper:

Big Data SSD Architecture

Digging Deep to Discover Where SSD Performance Pays Off

Big data is much more than just “lots of data”. State-of-the-art applications gather data from many different sources in varying formats with complex relationships. Analytics yield insight into events, trends, and behaviors, and may adapt in real-time depending on specific findings. These data sets can indeed grow extremely large, with captured sequences kept for future analysis.

Traditional database tools usually handle data with pre-determined structures and fixed relationships, typically on a scale-up server and storage cluster. Often, these tools link with some transactional process, providing front-end services for users while capturing back-end data. Larger deployments may use a data warehouse, which consolidates data from various

applications into one managed, reliable repository of information for enterprise decision-making.

New distributed, flexible, and scalable tools for multi-sourced and poly-structured big data typically employ numbers of less-expensive network processing and storage systems. Rather than forcing all data into a massive warehouse before examination, distributed big data processing nodes can relieve architectural stress points with localized processing. For larger tasks, local nodes can join in global processing, increasing parallelism and supporting chained jobs.

Which storage technology, hard disk drives (HDDs) or solid-state drives (SSDs), excels in big data architecture? SSDs clearly win on speed, offering both higher

sequential read/write speeds and higher input/output operations per second (IOPS). A naïve approach might replace all HDDs with SSDs everywhere in a big data system. However, deploying SSDs in hundreds or thousands of nodes could add up to a very expensive proposition. A better approach identifies critical locations where SSDs enable immediate cost-per-performance wins, and integrating HDDs used in less stressful roles to save on system costs.

This whitepaper will look at the basics of big data tools, review two performance wins with SSDs in a well-known framework, as well as present some examples of emerging opportunities on the leading edge of big data technology.

LocAtion: SSDS in Big DAtA Architecture

Adapted from “Defining the Big Data Architecture Framework,” Demchenko, University of Amsterdam, July 2013

5 Vs ofBig Data

• Terabytes• Records/Arch• Transactions• Tables, Files

• Structured• Unstructured• Multi-factor• Probabilistic

• Batch• Real/near-time• Processes• Streams

• Trustworthiness• Authenticity• Origin, Reputation• Availability• Accountability

• Statistical• Events• Correlations• Hypothetical

Veracity Value

VelocityVariety

Volume

Finding those precise locations where SSDs speed up big data operations is a hot topic. Researchers are digging deep inside architectures looking for spots where HDDs are overwhelmed and slowing things down. Before we introduce findings from some of these studies, a quick overview of big data tools and terminology is in order.

Apache Hadoop is one of the most well known tools associated with big data architecture. Its framework provides a straightforward way to distribute and manage data on multiple networked computers – nodes – and schedule processing resources for applications. First released at the end of 2011, Hadoop has gained massive popularity in search engine, social networking, and cloud computing infrastructure.1

Hadoop leverages parallelism for both storage and computational resources. Instead of storing a very large file in a single location,

Hadoop spreads files across many nodes. This effectively multiplies available file system bandwidth. Nodes also contain processing elements, allowing scheduling of jobs and tasks across multiple heterogeneous machines simultaneously. The architecture provides fault resilience; data is replicated on multiple Hadoop nodes, and if a node is down, another replaces it. Adding nodes enhances scalability, with large instances achieving hundreds of petabytes of storage.

Inside Hadoop are two primary engines: HDFS (Hadoop Distributed File System) and MapReduce. Both run on the same nodes concurrently.

HDFS manages files across a cluster containing a NameNode tracking metadata and a number of DataNodes containing the actual data. HDFS maintains data and rack (physical location) awareness between jobs and tasks, minimizing network transfers. It rebalances data

within a cluster automatically, and supports snapshots, upgrades, and rollback for maintenance. Data can be accessed via a Java API, web browsers and a REST API, or other languages via the Thrift API. Applications using HDFS include the Apache HBase non-relational distributed database modeled after Google Big Table, and the Apache Hive data warehouse manager. Hadoop is open to other distributed file systems with varying functionality.2

MapReduce is an engine for parallel, distributed processing. A job separates a large input data set into smaller chunks, with data represented in (key, value) pairs. Map tasks process these chunks in parallel, then an intermediate shuffle function sorts the results passed to Reduce tasks for mathematical operations. MapReduce is by nature disk-based, batch oriented, and non-real-time.3

BASicS of Big DAtA tooLS

MapReduceLayer

HDFSLayer

multi-node cluster

master slave

task tracker task tracker

job tracker

name node

data node data node

Adapted from “Running Hadoop on Ubuntu Linux (Multi-Node Cluster),” Michael Noll

Hadoop Multi-Node Cluster

Most computing jobs fall into one of two categories: compute intensive, or I/O intensive. Studying the MapReduce workload, a team from Cloudera4 deployed one SSD versus several HDDs to approximate the same theoretical aggregate sequential read/write bandwidth, isolating performance differences.

Hadoop clusters with all SSDs outperformed HDD clusters by as much 70% under some workloads in the Cloudera study. SSDs also achieve about twice the actual sequential I/O size with lower latency supporting more task scheduling. For some workloads, the difference is less. Chopping up large files means sequential access is not the limiting factor – exactly what Hadoop intends to achieve. However, when file granularity exceeds what node memory can hold, an important change occurs:

“In practice, independent of I/O concurrency, there is negligible disk I/O for intermediate data that fits in memory, while a large amount of intermediate data leads to severe load on the disks.”

—The Truth About Map Reduce Performance on

SSDs, 2014

MapReduce actually involves three phases: Map, shuffle, and Reduce. The intermediate shuffle operation stores its results in a single file on local disk. By targeting shuffle results to an SSD using the mapred.local.dir parameter, performance increases

dramatically. Splitting the SSD into multiple data directories also improves sequential read/write task scheduling. Hybrid configurations with HDDs and a properly configured SSD may be the most cost-effective quick win for Hadoop performance.

Hadoop also presents compute-intensive opportunities. To quote Apache, “moving computation is cheaper than moving data.”

A Microsoft study suggests that most analytics jobs do not use huge data sets, citing a data point from Facebook that 90% of jobs have input sizes under 100 GB.5 Their suggestion is to “scale-up,” but what they describe is actually a converged modular server with 32 cores and paired SSD storage, bypassing HDFS for local access. One point Microsoft makes strongly: “… Without SSDs, many Hadoop jobs become disk-bound.”

New PCIe SSDs using the latest NVMe protocol will be ideal to support compute-intensive workloads and keep the network free. This is vital for use cases such as streaming content and incoming IoT sensor data. Removing storage latency also allows more Hadoop jobs to be scheduled. Some applications, such as Bayesian classification used in machine learning, chain Hadoop jobs.

y concentrating processing on bigger multicore servers with all-SSD storage for some jobs, Microsoft finds that overall Hadoop cluster performance increases. Their suggested architecture is a mix of bigger all-SSD nodes and smaller, less expensive nodes (perhaps with hybrid drive configurations) combined in a scale-out approach, handling a range of Hadoop job sizes most efficiently.

LocAting Quick WinS With SSDS

split 0 map

sort

merge

copy

outputHDFS

inputHDFS

HDFS replication

HDFS replication

reduce part 0

reduce part 0

split 1 map

split 2 map

MapReduce Data Flow

Open source projects attract innovation, and big data is no exception. The basics of Hadoop, HDFS, and MapReduce are established, and researchers are turning to other sophisticated techniques. In many cases, SSDs still hold the key to performance improvement.

Virtualization can consolidate multiple nodes on scale-up servers, but inserts significant overhead. Samsung researchers found that Hadoop instances running over Xen virtualization suffer from 50% to 83% degradation in sort-based benchmarks using HDDs. The same configurations with SSDs fell only 10% to 21% running on a virtualized system. Further analysis reveals why – virtualization decreases I/O block size and increases the number of requests, an environment where SSDs perform better.6

At Facebook, using HBase over HDFS for the message stack simplifies development, but introduces performance issues. Messages are small files; 90% are less than 15 MB. Compounding the problem, I/O is random and “hot” frequently used data is often too large for RAM. Part of their solution

is a 60 GB SSD configured as HBase cache that more than triples performance through reduced latency.7

Several efforts are after “in-memory” benefits. Aerospike has created a hybrid RAM/SSD architecture for a NoSQL database. Smaller files are in RAM, accessed with raw read/write bypassing the Linux file system, while larger files are on low-latency SSDs. Indices also reside in RAM. Data on the SSDs is contiguous and optimized for “one read per record.”8

Teams at Virginia Tech redesigned HDFS for tiered storage clusters.9 Their study on hatS (heterogeneity-aware tiered storage) shows the value of aiming more requests at PCIe SSDs. All 27 worker nodes have a SATA HDD (3 racks, 9 nodes each), nine nodes have a SATA SSD (3 per rack), and three nodes have a PCIe SSD (1 in each rack). Running a synthetic Facebook benchmark and hatS, the average I/O rate of HDFS increases 36%, and execution time falls 26%.

Spark, a new in-memory architecture that could supplant MapReduce, allows programs to load resilient

distributed datasets (RDDs) into cluster memory.10 Spark redefines shuffle algorithms, with each processor core (in earlier releases, each map task) creating a number of shuffle result files matching the number of reduce tasks. Contrary to popular belief, Spark RDDs can spill to disk for persistence. Benchmarks by Databricks deploying Spark on 206 Amazon EC2 i2x8large nodes with Amazon’s SSD-based S3 storage recently broke sort records, three times faster using one-tenth the number of machines of an equivalent Hadoop implementation.

MIT researchers find that if in-memory servers go to disk as little as 5%, the overhead can nullify gains from RAM-based nodes.11 Their experiment uses Xilinx FPGAs with ARM processing cores to pre-process Samsung SSD access in an interconnect fabric. A small cluster with 20 servers achieves near RAM-based speeds on image search, PageRank, and Memcached algorithms. This proves the value of tightly coupled nodes combining processing with an SSD.

More PriMe SPotS on fASter routeS

These studies all show choosing the right locations for SSDs in big data architecture produces significant cost-per-performance gains. Existing Hadoop installations speed up instantly with a simple change targeting MapReduce shuffle results to an SSD instead of an HDD. New Hadoop installations can take advantage of more SSDs within asymmetric clusters, with a scale-up server taking care of intense jobs augmenting a network of smaller machines, and faster nodes with localized pre-processing for SSDs.

The big data community continues to press the boundaries, looking for architecture enhancements. Hybrid installations with enhanced file systems are showing promise. These approaches are capitalizing on the ability of SSDs to service more requests quickly, which translates to more job scheduling and improved cluster

performance. Noteworthy is the misconception that “in-memory” means no disk, while newer big data architectures are leveraging RAM performance, when they do go to disk, SSDs are essential to keeping overall cluster performance high.

One powerful observation is that big data is not necessarily operating on large, sequential files. When a cluster does its job, input files have been parsed into smaller pieces, and disk operations are more random and multi-threaded – an ideal use case where SSDs excel. As V-NAND flash and PCIe interface technology mature, the cost-per-performance metric shifts in favor of SSDs. High-performance SSDs will continue to move from just critical locations to broader use throughout big data clusters.

Learn more: samsung.com/enterprisessd | 1-866-SAM4BIZ

Follow us: youtube.com/samsungbizusa | @SamsungBizUSA | insights.samsung.com

©2016 Samsung Electronics America, Inc. All rights reserved. Samsung is a registered trademark of Samsung Electronics Co., Ltd. All products, logos and brand names are trademarks or registered trademarks of their respective companies. This white paper is for informational purposes only. Samsung makes no warranties, express or implied, in this white paper. WHP-SSD-BIGDATA-JAN16J

1Apache Hadoop project 2Apache Hadoop documentation, HDFS Architecture Guide 3Apache Hadoop documentation, MapReduce Tutorial 4“The Truth About MapReduce Performance on SSDs”, Karthik Kambatla and Yanpei Chen, Cloudera Inc. and Purdue University, presented at LISA 14 sponsored by USENIX, November 2014

5“Scale-up vs Scale-out for Hadoop: Time to rethink?”, Appuswamy et al, Microsoft Research, presented at SoCC ‘13, October 2013 6“Performance Implications of SSDs in Virtualized Hadoop Clusters”, Ahn et al, Samsung Electronics, presented at IEEE BigData Congress, June 2014 7“Analysis of HDFS under HBase: A Facebook Messages Case Study”, Harter et al, Facebook Inc. and University of Wisconsin, Madison, presented at FAST ’14 sponsored by USENIX, February 2014

8Aerospike web site 9“hatS: A Heterogeneity-Aware Tiered Storage for Hadoop”. Krish et al, Virginia Tech, presented at IEEE/ACM CCGrid2014, May 201410“Spark the fastest open source engine for sorting a petabyte”, Databricks, November 201411“Cutting cost and power consumption for big data”, Larry Hardesty, Massachusetts Institute of Technology, July 10, 2015

fASter noW, even fASter Soon

SAMSUNG ENTERPRISESSD PORTFOLIO

PM863 Series Data Center SSDs•3bitMLCNAND•Designedforread-intensive

applications•BuiltinPowerLossProtection•SATA6Gb/sInterface•Form-factors:2.5”

SM863 Series Data Center SSDs•2bitMLCNAND•Designedforwrite-intensive

applications•BuiltinPowerLossProtection•SATA6Gb/sInterface•Form-factors:2.5”

SAMSUNG WORkSTATIONSSD PORTFOLIO

950 Pro Series Client PC SSDs•2bitMLCNAND•Designedforhigh-endPCs•PCIeInterface•NVMeprotocol•Form-factors:M.2

850 Pro Series Client PC SSDs•2bitMLCNAND•SATA6Gb/sInterface•Form-factors:2.5”

http://youtube.com/samsungbizusa

http://twitter.com/samsungbizusa

http://insights.samsung.com

http://hadoop.apache.org/

http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

https://www.usenix.org/conference/lisa14/conference-program/presentation/kambatla

https://www.usenix.org/conference/lisa14/conference-program/presentation/kambatla

http://www.msr-waypoint.com/pubs/204499/a20-appuswamy.pdf

http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6906832

http://research.cs.wisc.edu/adsl/Publications/fbmessages-fast14.pdf

http://research.cs.wisc.edu/adsl/Publications/fbmessages-fast14.pdf

http://www.aerospike.com/flash-optimized/

http://people.cs.vt.edu/butta/docs/ccgrid2014-hats.pdf

https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

http://news.mit.edu/2015/cutting-cost-power-big-data-0710