system software for big data computing

31
System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

Upload: imala

Post on 13-Feb-2016

25 views

Category:

Documents


0 download

DESCRIPTION

System Software for Big Data Computing. Cho-Li Wang The University of Hong Kong. CS Gideon-II & CC MDRP Clusters. 2.6T. 3.1T. 20T. HKU High-Performance Computing Lab. . Total # of cores: 3004 CPU + 5376 GPU cores RAM Size: 8.34 TB Disk storage: 130 TB Peak computing power: 27.05 TFlops. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: System Software for Big Data Computing

System Software for Big Data ComputingCho-Li WangThe University of Hong Kong

Page 2: System Software for Big Data Computing

2CS Gideon-II & CC MDRP Clusters

HKU High-Performance Computing Lab. Total # of cores: 3004 CPU + 5376 GPU

coresRAM Size: 8.34 TBDisk storage: 130 TB Peak computing power: 27.05 TFlops

0

5

10

15

20

25

30

35

2007. 7 2009 2010 2011. 1

2007. 7200920102011. 12.6T 3.1T

31.45TFlops (X12 in 3.5 years)

20T

GPU-Cluster (Nvidia M2050, “Tianhe-1a”): 7.62 Tflops

Page 3: System Software for Big Data Computing

Big Data: The "3Vs" Model• High Volume (amount of data) • High Velocity (speed of data in and out)• High Variety (range of data types and sources)

2.5 x 1018

2010: 800,000 petabytes (would fill a stack of DVDs reaching from the earth to the moon and back)

By 2020, that pile of DVDs would stretch half way to Mars.

Page 4: System Software for Big Data Computing

Our Research

Page 5: System Software for Big Data Computing

(1) Heterogeneous Manycore Computing (CPUs+ GUPs)

JAPONICA : Java with Auto-Parallelization ON GraphIcs Coprocessing Architecture

Page 6: System Software for Big Data Computing

6Heterogeneous Manycore Architecture

GPU

CPUs

Page 7: System Software for Big Data Computing

New GPU & Coprocessors

7

Vendor Model Launch Date

Fab. (nm)

#Accelerator Cores (Max.)

GPU Clock (MHz)

TDP (watts) Memory Bandwidth

(GB/s)Programming

Model Remarks

Intel

Sandy Bridge 2011Q1 32

12 HD graphics 3000 EUs (8 threads/EU)

850 – 1350 95

L3: 8MBSys mem (DDR3)

21

OpenCLBandwidth is system

DDR3 memory bandwidthIvy

Bridge 2012Q2 2216 HD graphics

4000 EUs (8 threads/EU)

650 – 1150 77

L3: 8MBSys mem (DDR3)

25.6

Xeon Phi 2012H2 22

60 x86 cores(with a 512-bit vector unit)

600-1100 300 8GB

GDDR5 320OpenMP#, OpenCL*,

OpenACC%

Less sensitive to branch divergent workloads

AMD

Brazos 2.0 2012Q2 40 80 Evergreen

shader cores 488-680 18L2: 1MBSys mem (DDR3)

21OpenCL, C+

+AMPTrinity 2012Q2 32

128-384 Northern

Islands cores723-800 17-100

L2: 4MBSys mem(DDR3)

25 APU

Nvidia

Fermi 2010Q1 40512 Cuda

cores(16 SMs)

1300 238L1: 48KBL2: 768KB

6GB148

CUDA, OpenCL, OpenACC

Kepler(GK110)

2012Q4 28 2880 Cuda cores 836/876 300 6GB

GDDR5 288.5 3X Perf/Watt, Dynamic Parallelism, HyperQ

Page 8: System Software for Big Data Computing

• 18,688 AMD Opteron 6274 16-core CPUs (32GB DDR3) .

• 18,688 Nvidia Tesla K20X GPUs• Total RAM size: over 710 TB• Total Storage: 10 PB.• Peak Performance: 27 Petaflop/s

o GPU: CPU = 1.311 TF/s: 0.141 TF/s = 9.3 : 1

• Linpack: 17.59 Petaflop/s• Power Consumption: 8.2 MW

8

#1 in Top500 (11/2012):Titan @ Oak Ridge National Lab.

NVIDIA Tesla K20X (Kepler GK110) GPU: 2688 CUDA cores

Titan compute board: 4 AMD Opteron + 4 NVIDIA Tesla K20X GPUs

Page 9: System Software for Big Data Computing

Design Challenge: GPU Can’t Handle Dynamic Loops

99

Dynamic loopsfor(i=0; i<N; i++){ C[i] = A[i] + B[i];}

for(i=0; i<N; i++){ A[ w[i] ] = 3 * A[ r[i] ];}

GPU = SIMD/Vector Data Dependency Issues (RAW, WAW) Solutions?

Static loops

Non-deterministic data dependencies inhibit exploitation of inherent parallelism; only DO-ALL loops or embarrassingly parallel workload gets admitted to GPUs.

Page 10: System Software for Big Data Computing

Dynamic loops are common in scientific and engineering applications

10Source: Z. Shen, Z. Li, and P. Yew, "An Empirical Study on Array Subscripts and Data Dependencies"

Page 11: System Software for Big Data Computing

GPU-TLS : Thread-level Speculation on GPU• Incremental parallelization

o sliding window style execution. • Efficient dependency checking schemes• Deferred update

o Speculative updates are stored in the write buffer of each thread until the commit time.

• 3 phases of execution

11

intra-thread RAWvalid inter-thread RAW in GPU

GPU: lock-step execution in the same warp (32 threads per warp).

true inter-thread RAW

Page 12: System Software for Big Data Computing

JAPONICA : Profile-Guided Work Dispatching

12

2880 cores

64 x86 cores

8 high-speed x86 cores

Dynamic Profiling

HighMedium

Low/None

Highly parallel

Multi-core CPU

Massively parallelParallel

Many-core coprocessors

Scheduler

Dependency density

Inter-iteration dependence: -- Read-After-Write (RAW) -- Write-After-Read (WAR)-- Write-After-Write (WAW)

Page 13: System Software for Big Data Computing

JAPONICA : System ArchitectureSequential Java Code with user

annotation

JavaRStatic Dep.

AnalysisCode

Translation

Uncertain

Dependency Density Analysis

Intra-warp Dep.

Check

Inter-warp Dep.

Check

DO-ALL Parallelizer Speculator

GPUCPUCommunicatio

n

CPU-Multi

threads

GPU-Many

threadsGPU-TLS Privatizatio

n

Task Sharing

CPU queue: low, high, 0

GPU queue: low, 00 : CPU multithreads + GPU

Low DD : CPU+GPU-TLS

High DD : CPU single core

Profiling Results

CUDA kernels with GPU-TLS| Privatization & CPU Single-threadCUDA kernels & CPU Multi-threads

No dependence RAW WAW/WAR

13

Profiler (on GPU)

Program Dependence Graph (PDG)one loop

Task StealingTask Scheduler : CPU-GPU Co-

Scheduling Assign the tasks among CPU & GPU according to their dependency density (DD)

Page 14: System Software for Big Data Computing

(2) Crocodiles: Cloud Runtime with Object Coherence On Dynamic tILES”

Page 15: System Software for Big Data Computing

“General Purpose” Manycore

Tile-based architecture: Cores are connected through a 2D network-on-a-chip

Page 16: System Software for Big Data Computing

鳄鱼 @ HKU (01/2013-12/2015)• Crocodiles: Cloud Runtime with Object Coherence On

Dynamic tILES for future 1000-core tiled processors”

16

Mem

ory

Con

trol

ler

PCI-E

ZONE 2

ZONE 1

ZONE 3

ZONE 4

PCI-E

GbE

DRAM Controller

Memory Controller

GbE

GbEPC

I-E

Mem

ory

Con

trol

ler

GbE

PCI-

E

RAM

RAM

RAM

RAM

Page 17: System Software for Big Data Computing

• Dynamic Zoning o Multi-tenant Cloud Architecture Partition varies

over time, mimic “Data center on a Chip”.o Performance isolationo On-demand scaling. o Power efficiency (high flops/watt).

Page 18: System Software for Big Data Computing

Design Challenge: “Off-chip Memory Wall” Problem

– DRAM performance (latency) improved slowly over the past 40 years.

(a) Gap of DRAM Density & Speed (b) DRAM Latency Not Improved

Memory density has doubled nearly every two years, while performance has improved slowly (e.g. still 100+ of core clock cycles per memory access)

Page 19: System Software for Big Data Computing

• Physical memory allocation performance sorted by function. As more cores are added more processing time is spent contending for locks.

Lock Contention in Multicore System

Lock Contention

Page 20: System Software for Big Data Computing

Exim on Linux

collapse

Kernel CPU time (milliseconds/message)

Page 21: System Software for Big Data Computing

Challenges and Potential Solutions

21

• Cache-aware designo Data Locality/Working Set getting critical! o Compiler or runtime techniques to improve data reuse

• Stop multitasking o Context switching breaks data locality o Time Sharing Space Sharing

马其顿方阵众核操作系统 : Next-generation Operating System for 1000-core processor

Page 22: System Software for Big Data Computing

Thanks!

C.L. Wang’s webpage: http://www.cs.hku.hk/~clwang/

For more information:

Page 23: System Software for Big Data Computing

http://i.cs.hku.hk/~clwang/recruit2012.htm

Page 24: System Software for Big Data Computing

Multi-granularity Computation Migration

24

WAVNet Desktop CloudG-JavaMPI

JESSICA2

Fine

Coarse

Small Large

SOD

Granularity

System scale(Size of state)

Page 25: System Software for Big Data Computing

WAVNet: Live VM Migration over WAN

25

A P2P Cloud with Live VM Migration over WAN “Virtualized LAN” over the Internet”

High penetration via NAT hole punching Establish direct host-to-host connection Free from proxies, able to traverse most NATs

VMVM

Key Members

Zheming Xu, Sheng Di, Weida Zhang, Luwei Cheng, and Cho-Li Wang, WAVNet: Wide-Area Network Virtualization Technique for Virtual Private Cloud, 2011 International Conference on Parallel Processing (ICPP2011)

Page 26: System Software for Big Data Computing

26

WAVNet: Experiments at Pacific Rim Areas北京高能物理所

IHEP, Beijing

深圳先进院 (SIAT)

香港大学 (HKU)

中央研究院(Sinica, Taiwan)

静宜大学 (Providence University)

SDSC, San Diego

日本产业技术综合研究所 (AIST, Japan)

26

Page 27: System Software for Big Data Computing

27 27

Thread Migration

JESSICA2JVM

A Multithreaded Java Program

JESSICA2JVM

JESSICA2JVM

JESSICA2JVM

JESSICA2JVM

JESSICA2JVM

Master Worker Worker

Worker

JIT Compiler Mode

Portable Java Frame

JavaEnabledSingleSystemImageComputingArchitecture

JESSICA2: Distributed Java Virtual Machine

Page 28: System Software for Big Data Computing

History and Roadmap of JESSICA Project

• JESSICA V1.0 (1996-1999)– Execution mode: Interpreter Mode– JVM kernel modification (Kaffe JVM)– Global heap: built on top of TreadMarks (Lazy

Release Consistency + homeless)• JESSICA V2.0 (2000-2006)

– Execution mode: JIT-Compiler Mode– JVM kernel modification– Lazy release consistency + migrating-home

protocol• JESSICA V3.0 (2008~2010)

– Built above JVM (via JVMTI)– Support Large Object Space

• JESSICA v.4 (2010~)– Japonica : Automatic loop parallization and

speculative execution on GPU and multicore CPU– TrC-DC : a software transactional memory system

on cluster with distributed clocks (not discussed)

28

King Tin LAM,

Ricky MaKinson Chan

Chenggang Zhang

Past Members

J1 and J2 received a total of 1107 source code downloads

Page 29: System Software for Big Data Computing

29

Mobile node

Program Counter

Method Area

Heap Area

Stack frame A

Method Area

Heap Area Rebuilt

Stack frame A

Stack frame A

Method

Cloud node

objects

Local variables

Local variables

Local variables

Stack frame B

Object (Pre-)fetching

Program Counter

Program Counter

Stack-on-Demand (SOD)

Page 30: System Software for Big Data Computing

Elastic Execution Model via SOD

30

(a) “Remote Method Call”

(b) Mimic thread migration

(c) “Task Roaming”: like a mobile agent roaming over

the network or workflow

With such flexible or composable execution paths, SOD enables agile and elastic exploitation of distributed resources (storage), a Big Data Solution !

Lightweight, Portable, Adaptable

Page 31: System Software for Big Data Computing

Xen VM

JVM

Xen-aware host OS

guest OS

Xen VM

JVM

guest OS

Desktop PC

Overloaded

Load balancer

Cloud service provider

Thread migration

(JESSICA2)

Internet

Live migration

Load balancer

comm.

JVM

Stack-on-demand (SOD)

Mobile client

iOS

Stack segments

Partial Heap

Method Area

Code

Small footprint

Stacks Heap

JVM process

Method Area

Code

… …

Multi-thread Java process

trigger live migration

duplicate VM instances for scaling

eXCloud : Integrated Solution for Multi-granularity

MigrationRicky K. K. Ma, King Tin Lam, Cho-Li Wang, "eXCloud: Transparent Runtime Support for Scaling Mobile Applications," 2011 IEEE International Conference on Cloud and Service Computing (CSC2011),. (Best Paper Award)