accelerate big data processing (hadoop, spark, memcached...

37
Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Talk at Intel® HPC Developer Conference 2017 (SC ‘17) by Xiaoyi Lu The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~luxi

Upload: others

Post on 17-Jul-2020

10 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies

Dhabaleswar K. (DK) PandaThe Ohio State University

E-mail: [email protected]://www.cse.ohio-state.edu/~panda

Talk at Intel® HPC Developer Conference 2017 (SC ‘17)

by

Xiaoyi LuThe Ohio State University

E-mail: [email protected]://www.cse.ohio-state.edu/~luxi

Page 2: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 2Network Based Computing Laboratory

• Multiple tiers + Workflow

– Front-end data accessing and serving (Online)• Memcached + DB (e.g. MySQL), HBase, etc.

– Back-end data analytics and deep learning model training (Offline)• HDFS, MapReduce, Spark, TensorFlow, BigDL, Caffe, etc.

Big Data Processing and Deep Learning on Modern Clusters

Page 3: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 3Network Based Computing Laboratory

Drivers of Modern HPC Cluster Architectures

Tianhe – 2 Titan Stampede Tianhe – 1A

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD

• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

Accelerators / Coprocessors high compute density, high

performance/watt>1 TFlop DP on a chip

High Performance Interconnects -InfiniBand

<1usec latency, 100Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM

Page 4: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 4Network Based Computing Laboratory

Interconnects and Protocols in OpenFabrics Stack for HPC(http://openfabrics.org)

Kernel Space

Application / Middleware

Verbs

Ethernet Adapter

EthernetSwitch

Ethernet Driver

TCP/IP

1/10/40/100 GigE

InfiniBandAdapter

InfiniBandSwitch

IPoIB

IPoIB

Ethernet Adapter

Ethernet Switch

Hardware Offload

TCP/IP

10/40 GigE-TOE

InfiniBandAdapter

InfiniBandSwitch

User Space

RSockets

RSockets

iWARP Adapter

Ethernet Switch

TCP/IP

User Space

iWARP

RoCEAdapter

Ethernet Switch

RDMA

User Space

RoCE

InfiniBandSwitch

InfiniBandAdapter

RDMA

User Space

IB Native

Sockets

Application / Middleware Interface

Protocol

Adapter

Switch

InfiniBandAdapter

InfiniBandSwitch

RDMA

SDP

SDP

Page 5: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 5Network Based Computing Laboratory

• 177 IB Clusters (35%) in the Jun’17 Top500 list

– (http://www.top500.org)

• Installations in the Top 50 (18 systems):

Large-scale InfiniBand Installations

241,108 cores (Pleiades) at NASA/Ames (15th) 152,692 cores (Thunder) at AFRL/USA (36th)

220,800 cores (Pangea) in France (19th) 99,072 cores (Mistral) at DKRZ/Germany (38th)

522,080 cores (Stampede) at TACC (20th) 147,456 cores (SuperMUC) in Germany (40th)

144,900 cores (Cheyenne) at NCAR/USA (22nd) 86,016 cores (SuperMUC Phase 2) in Germany (41st)

72,800 cores Cray CS-Storm in US (27th) 74,520 cores (Tsubame 2.5) at Japan/GSIC (44th)

72,800 cores Cray CS-Storm in US (28th) 66,000 cores (HPC3) in Italy (47sth)

124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (30th) 194,616 cores (Cascade) at PNNL (49th)

60,512 cores (DGX-1) at Facebook/USA (31st) 85,824 cores (Occigen2) at GENCI/CINES in France (50th)

60,512 cores (DGX SATURNV) at NVIDIA/USA (32nd) 73,902 cores (Centennial) at ARL/USA (52nd)

72,000 cores (HPC2) in Italy (33rd) and many more!

Page 6: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 6Network Based Computing Laboratory

Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning

Convergence of HPC, Big Data, and Deep Learning!!!

Page 7: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 7Network Based Computing Laboratory

How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data and Deep Learning Applications?

Bring HPC, Big Data processing, and Deep Learning into a “convergent trajectory”!

What are the major bottlenecks in current Big

Data processing and Deep Learning

middleware (e.g. Hadoop, Spark)?

Can the bottlenecks be alleviated with new

designs by taking advantage of HPC

technologies?

Can RDMA-enabledhigh-performance

interconnectsbenefit Big Data

processing and Deep Learning?

Can HPC Clusters with high-performance

storage systems (e.g. SSD, parallel file

systems) benefit Big Data and Deep Learning

applications?

How much performance benefits

can be achieved through enhanced

designs?

How to design benchmarks for evaluating the

performance of Big Data and Deep Learningmiddleware on HPC

clusters?

Page 8: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 8Network Based Computing Laboratory

Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Page 9: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 9Network Based Computing Laboratory

Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Page 10: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 10Network Based Computing Laboratory

Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Page 11: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 11Network Based Computing Laboratory

Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Page 12: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 12Network Based Computing Laboratory

Designing Communication and I/O Libraries for Big Data Systems: Challenges

Big Data Middleware(HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies(InfiniBand, 1/10/40/100 GigE

and Intelligent NICs)

Storage Technologies(HDD, SSD, and NVMe-SSD)

Programming Models(Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O LibraryPoint-to-Point

Communication

QoS

Threaded Modelsand Synchronization

Fault-ToleranceI/O and File Systems

Virtualization

Benchmarks

Upper level Changes?

Page 13: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 13Network Based Computing Laboratory

• RDMA for Apache Spark

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

• RDMA for Apache HBase

• RDMA for Memcached (RDMA-Memcached)

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• OSU HiBD-Benchmarks (OHB)

– HDFS, Memcached, HBase, and Spark Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 260 organizations from 31 countries

• More than 23,900 downloads from the project site

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCEAlso run on Ethernet

Page 14: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 14Network Based Computing Laboratory

• Basic Designs– Hadoop– Spark– Memcached

• Advanced Designs– Memcached with Hybrid Memory and Non-blocking APIs– Efficient Indexing with RDMA-HBase– TensorFlow with RDMA-gRPC– Deep Learning over Big Data

• BigData + HPC Cloud

Acceleration Case Studies and Performance Evaluation

Page 15: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 15Network Based Computing Laboratory

• Enables high performance RDMA communication, while supporting traditional socket interface• JNI Layer bridges Java based HDFS with communication library written in native code

Design Overview of HDFS with RDMA

HDFS

Verbs

RDMA Capable Networks(IB, iWARP, RoCE ..)

Applications

1/10/40/100 GigE, IPoIB Network

Java Socket Interface Java Native Interface (JNI)WriteOthers

OSU Design

• Design Features– RDMA-based HDFS write– RDMA-based HDFS

replication– Parallel replication support– On-demand connection

setup– InfiniBand/RoCE support

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014

Page 16: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 16Network Based Computing Laboratory

Triple-H

Heterogeneous Storage

• Design Features– Three modes

• Default (HHH)

• In-Memory (HHH-M)

• Lustre-Integrated (HHH-L)

– Policies to efficiently utilize the heterogeneous storage devices

• RAM, SSD, HDD, Lustre

– Eviction/Promotion based on data usage pattern

– Hybrid Replication

– Lustre-Integrated mode:

• Lustre-based fault-tolerance

Enhanced HDFS with In-Memory and Heterogeneous Storage

Hybrid Replication

Data Placement Policies

Eviction/Promotion

RAM Disk SSD HDD

Lustre

N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015

Applications

Page 17: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 17Network Based Computing Laboratory

Design Overview of MapReduce with RDMA

MapReduce

Verbs

RDMA Capable Networks(IB, iWARP, RoCE ..)

OSU Design

Applications

1/10/40/100 GigE, IPoIB Network

Java Socket Interface Java Native Interface (JNI)

JobTracker

TaskTracker

Map

Reduce

• Enables high performance RDMA communication, while supporting traditional socket interface• JNI Layer bridges Java based MapReduce with communication library written in native code

• Design Features– RDMA-based shuffle– Prefetching and caching map output– Efficient Shuffle Algorithms– In-memory merge– On-demand Shuffle Adjustment– Advanced overlapping

• map, shuffle, and merge

• shuffle, merge, and reduce

– On-demand connection setup– InfiniBand/RoCE support

M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014

Page 18: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 18Network Based Computing Laboratory

050

100150200250300350400

80 120 160

Exec

utio

n Ti

me

(s)

Data Size (GB)

IPoIB (EDR)OSU-IB (EDR)

0100200300400500600700800

80 160 240

Exec

utio

n Ti

me

(s)

Data Size (GB)

IPoIB (EDR)OSU-IB (EDR)

Performance Numbers of RDMA for Apache Hadoop 2.x –RandomWriter & TeraGen in OSU-RI2 (EDR)

Cluster with 8 Nodes with a total of 64 maps

• RandomWriter– 3x improvement over IPoIB

for 80-160 GB file size

• TeraGen– 4x improvement over IPoIB for

80-240 GB file size

RandomWriter TeraGen

Reduced by 3x Reduced by 4x

Page 19: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 19Network Based Computing Laboratory

0100200300400500600700800

80 120 160

Exec

utio

n Ti

me

(s)

Data Size (GB)

IPoIB (EDR)OSU-IB (EDR)

Performance Numbers of RDMA for Apache Hadoop 2.x – Sort & TeraSort in OSU-RI2 (EDR)

Cluster with 8 Nodes with a total of 64 maps and 32 reduces

• Sort– 61% improvement over IPoIB for

80-160 GB data

• TeraSort– 18% improvement over IPoIB for

80-240 GB data

Reduced by 61%Reduced by 18%

Cluster with 8 Nodes with a total of 64 maps and 14 reduces

Sort TeraSort

0

100

200

300

400

500

600

80 160 240

Exec

utio

n Ti

me

(s)

Data Size (GB)

IPoIB (EDR)OSU-IB (EDR)

Page 20: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 20Network Based Computing Laboratory

• Design Features– RDMA based shuffle plugin– SEDA-based architecture– Dynamic connection

management and sharing– Non-blocking data transfer– Off-JVM-heap buffer

management– InfiniBand/RoCE support

Design Overview of Spark with RDMA

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Scala based Spark with communication library written in native codeX. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014

X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016.

Spark Core

RDMA Capable Networks(IB, iWARP, RoCE ..)

Apache Spark Benchmarks/Applications/Libraries/Frameworks

1/10/40/100 GigE, IPoIB Network

Java Socket Interface Java Native Interface (JNI)

Native RDMA-based Comm. Engine

Shuffle Manager (Sort, Hash, Tungsten-Sort)

Block Transfer Service (Netty, NIO, RDMA-Plugin)NettyServer

NIOServer

RDMAServer

NettyClient

NIOClient

RDMAClient

Page 21: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 21Network Based Computing Laboratory

• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)

• RDMA-based design for Spark 1.5.1

• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. – 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)

– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)

Performance Evaluation on SDSC Comet – HiBench PageRank

32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time

050

100150200250300350400450

Huge BigData Gigantic

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

0

100

200

300

400

500

600

700

800

Huge BigData Gigantic

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

43%37%

Page 22: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 22Network Based Computing Laboratory

1

10

100

10001 2 4 8 16 32 64 128

256

512 1K 2K 4K

Tim

e (u

s)

Message Size

OSU-IB (FDR)

0

200

400

600

800

16 32 64 128 256 512 102420484080

Thou

sand

s of

Tran

sact

ions

per

Se

cond

(TPS

)

No. of Clients

• Memcached Get latency– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us, 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

• Memcached Throughput (4bytes)– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s, Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X 2X

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid’12

Page 23: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 23Network Based Computing Laboratory

• Basic Designs– Hadoop– Spark– Memcached

• Advanced Designs– Memcached with Hybrid Memory and Non-blocking APIs– Efficient Indexing with RDMA-HBase– TensorFlow with RDMA-gRPC– Deep Learning over Big Data

• BigData + HPC Cloud

Acceleration Case Studies and Performance Evaluation

Page 24: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 24Network Based Computing Laboratory

– Memcached latency test with Zipf distribution, server with 1 GB memory, 32 KB key-value pair size, total size of data accessed is 1 GB (when data fits in memory) and 1.5 GB (when data does not fit in memory)

– When data fits in memory: RDMA-Mem/Hybrid gives 5x improvement over IPoIB-Mem

– When data does not fit in memory: RDMA-Hybrid gives 2x-2.5x over IPoIB/RDMA-Mem

Performance Evaluation on IB FDR + SATA/NVMe SSDs (Hybrid Memory)

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get

IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe

IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-NVMe

Data Fits In Memory Data Does Not Fit In Memory

Late

ncy

(us)

slab allocation (SSD write) cache check+load (SSD read) cache update server response client wait miss-penalty

Page 25: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 25Network Based Computing Laboratory

– Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve• >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty• >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid

– Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem

Performance Evaluation with Non-Blocking Memcached API

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set Get

IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonB-i

H-RDMA-Opt-NonB-b

Aver

age

Late

ncy

(us)

Miss Penalty (Backend DB Access Overhead)Client WaitServer ResponseCache UpdateCache check+Load (Memory and/or SSD read)Slab Allocation (w/ SSD write on Out-of-Mem)

H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee

Page 26: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 26Network Based Computing Laboratory

• InfiniBand QDR, 24GB RAM + PCIe-SSDs, 12 nodes, 32/48 Map/Reduce Tasks, 4-node Memcached cluster• Boldio can improve

– throughput over Lustre by about 3x for write throughput and 7x for read throughput– execution time of Hadoop benchmarks over Lustre, e.g. Wordcount, Cloudburst by >21%

• Contrasting with Alluxio (formerly Tachyon) – Performance degrades about 15x when Alluxio cannot leverage local storage (Alluxio-Local vs. Alluxio-Remote)– Boldio can improve throughput over Alluxio with all remote workers by about 3.5x - 8 .8x (Alluxio-Remote vs. Boldio)

Performance Evaluation with Boldio for Lustre + Burst-Buffer

Hadoop/Spark Workloads

21%

050

100150200250300350400450

WordCount InvIndx CloudBurst Spark TeraGen

Late

ncy

(sec

)

Lustre-Direct Alluxio-Remote Boldio

DFSIO Throughput

0.00

10000.00

20000.00

30000.00

40000.00

50000.00

60000.00

70000.00

20 GB 40 GB 20 GB 40 GB

Write Read

Agg.

Thr

ough

put (

MBp

s)

Lustre-Direct Alluxio-LocalAlluxio-Remote Boldio

~3x

~7x

D. Shankar, X. Lu, D. K. Panda, Boldio: A Hybrid and Resilient Burst-Buffer over Lustre for Accelerating Big Data I/O, IEEE Big Data 2016.

Page 27: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 27Network Based Computing Laboratory

• Challenges– Operations on Distributed Ordered Table (DOT)

with indexing techniques are network intensive

– Additional overhead of creating and maintaining secondary indices

– Can RDMA benefit indexing techniques (Apache Phoenix and CCIndex) on HBase?

• Results– Evaluation with Apache Phoenix and CCIndex

– Up to 2x improvement in query throughput

– Up to 35% reduction in application workload execution time

Collaboration with Institute of Computing Technology, Chinese Academy of Sciences

Accelerating Indexing Techniques on HBase with RDMA

S. Gugnani, X. Lu, L. Zha, and D. K. Panda, Characterizing and Accelerating Indexing Techniques on Distributed Ordered Tables, IEEE BigData, 2017.

0

5000

10000

15000

Query1 Query2

THRO

UGH

PUT

TPC-H Query BenchmarksHBase RDMA-HBase HBase-Phoenix

RDMA-HBase-Phoenix HBase-CCIndex RDMA-HBase-CCIndex

Increased by 2x

0100200300400500600

Workload1 Workload2 Workload3 Workload4

EXEC

UTI

ON

TIM

E

Ad Master Application WorkloadsHBase-Phoenix RDMA-HBase-Phoenix

HBase-CCIndex RDMA-HBase-CCIndex Reduced by 35%

Page 28: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 28Network Based Computing Laboratory

Overview of RDMA-gRPC with TensorFlow

Worker services communicate among each other using RDMA-gRPC

Client Master

Worker /job:PS/task:0

Worker /job:Worker/task:0

CPU GPU

RDMA gRPC server/ client

CPU GPU

RDMA gRPC server/ client

Page 29: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 29Network Based Computing Laboratory

Performance Benefit for TensorFlow

• TensorFlow performance evaluation on RI2– Up to 19% performance speedup over IPoIB for Sigmoid net (20 epochs).– Up to 35% and 30% performance speedup over IPoIB for resnet50 and Inception3 (batch size 8).

sigmoid net CNN

0

10

20

30

40

50

2 Nodes (2 GPUS)

Tim

e (S

econ

ds)

Default gRPCgRPC + VerbsgRPC + MPIOSU RDMA-gRPC2

0

5

10

15

20

25

30

35

resnet50 inception3

Imag

es /

Seco

nd

2 NodesDefault gRPCgRPC + VerbsgRPC + MPIOSU-RDMA-gRPC

R. Biswas, X. Lu, and D. K. Panda, Accelerating gRPC and TensorFlow with RDMA for High-Performance Deep Learning over InfiniBand, Under Review.

Page 30: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 30Network Based Computing Laboratory

X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 2017.

High-Performance Deep Learning over Big Data (DLoBD) Stacks• Challenges of Deep Learning over Big Data

(DLoBD) Can RDMA-based designs in DLoBD stacks improve

performance, scalability, and resource utilization on high-performance interconnects, GPUs, and multi-core CPUs?

What are the performance characteristics of representative DLoBD stacks on RDMA networks?

• Characterization on DLoBD Stacks CaffeOnSpark, TensorFlowOnSpark, and BigDL IPoIB vs. RDMA; In-band communication vs. Out-

of-band communication; CPU vs. GPU; etc. Performance, accuracy, scalability, and resource

utilization RDMA-based DLoBD stacks (e.g., BigDL over

RDMA-Spark) can achieve 2.6x speedup compared to the IPoIB based scheme, while maintain similar accuracy

0

20

40

60

10

1010

2010

3010

4010

5010

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Acc

urac

y (%

)

Epoc

hs T

ime

(sec

s)

Epoch Number

IPoIB-TimeRDMA-TimeIPoIB-AccuracyRDMA-Accuracy

2.6X

Page 31: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 31Network Based Computing Laboratory

• Basic Designs– Hadoop– Spark– Memcached

• Advanced Designs– Memcached with Hybrid Memory and Non-blocking APIs– Efficient Indexing with RDMA-HBase– TensorFlow with RDMA-gRPC– Deep Learning over Big Data

• BigData + HPC Cloud

Acceleration Case Studies and Performance Evaluation

Page 32: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 32Network Based Computing Laboratory

• Challenges– Existing designs in Hadoop not virtualization-

aware

– No support for automatic topology detection

• Design– Automatic Topology Detection using

MapReduce-based utility

• Requires no user input

• Can detect topology changes during runtime without affecting running jobs

– Virtualization and topology-aware communication through map task scheduling and YARN container allocation policy extensions

Virtualization-aware and Automatic Topology Detection Schemes in Hadoop on InfiniBand

S. Gugnani, X. Lu, and D. K. Panda, Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds, CloudCom’16, December 2016

0

2000

4000

6000

40 GB 60 GB 40 GB 60 GB 40 GB 60 GB

EXEC

UTI

ON

TIM

E

Hadoop BenchmarksRDMA-Hadoop Hadoop-Virt

0

100

200

300

400

DefaultMode

DistributedMode

DefaultMode

DistributedMode

EXEC

UTI

ON

TIM

E

Hadoop ApplicationsRDMA-Hadoop Hadoop-Virt

CloudBurst Self-join

Sort WordCount PageRank

Reduced by 55%

Reduced by 34%

Page 33: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 33Network Based Computing Laboratory

• Discussed challenges in accelerating Big Data middleware with HPC technologies

• Presented basic and advanced designs to take advantage of InfiniBand/RDMA for HDFS, MapReduce, RPC, HBase, Memcached, Spark, gRPC, and TensorFlow

• Results are promising

• Many other open issues need to be solved

• Will enable Big Data community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner

• Looking forward to collaboration with the community

Concluding Remarks

Page 34: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 34Network Based Computing Laboratory

• Tutorial

– Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management (Sunday, 1:30-5:00 pm, Room #201)

• BoF

– BigData and Deep Learning (Tuesday, 5:15-6:45pm, Room #702)

– SigHPC Big Data BoF (Wednesday, 12:15-1:15pm, Room #603)

– Clouds for HPC, Big Data, and Deep Learning (Wednesday, 5:15-7:00pm, Room #701)

• Booth Talks

– OSU Booth (Tuesday, 10:00-11:00am, Booth #1875)

– Mellanox Theater (Wednesday, 3:00-3:30pm, Booth #653)

– OSU Booth (Thursday, 1:00-2:00pm, Booth #1875)

• Student Poster Presentation

– Accelerating Big Data processing in Cloud (Tuesday, 5:15-7:00pm, Four Seasons Ballroom)

• Details at http://hibd.cse.ohio-state.edu

OSU Participating at Multiple Events on BigData Acceleration

Page 35: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 35Network Based Computing Laboratory

Funding AcknowledgmentsFunding Support by

Equipment Support by

Page 36: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 36Network Based Computing Laboratory

Personnel AcknowledgmentsCurrent Students

– A. Awan (Ph.D.)

– M. Bayatpour (Ph.D.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)

Past Students – A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– R. Rajachandrasekar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist– K. Hamidouche

– S. Sur

Past Post-Docs– D. Banerjee

– X. Besseron

– H.-W. Jin

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– S. Guganani (Ph.D.)

– J. Hashmi (Ph.D.)

– N. Islam (Ph.D.)

– M. Li (Ph.D.)

– J. Lin

– M. Luo

– E. Mancini

Current Research Scientists– X. Lu

– H. Subramoni

Past Programmers– D. Bureddy

– J. Perkins

Current Research Specialist– J. Smith

– M. Arnold

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc– A. Ruhela

Page 37: Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Intel® HPC Developer Conference 2017 37Network Based Computing Laboratory

{panda, luxi}@cse.ohio-state.edu

http://www.cse.ohio-state.edu/~panda

http://www.cse.ohio-state.edu/~luxi

Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/