nding big a ds on modern s using...

Understanding Big Data Workloads onUnderstanding Big Data Workloads on Modern Processors using BigDataBench

Jianfeng Zhan

INSTITUT

http://prof.ict.ac.cn/BigDataBench

TE OF CO

MPUTIN

G

Professor, ICT, Chinese Academy of Sciences and University of Chinese Academy of Sciences

G TECH

NO

LOG

Y

HPBDC 2015Ohio, USA

OutlineOutline

� BigDataBench Overview

� Workload characterization

� Multi-tenancy version� Multi-tenancy version

� Processors evaluation

BigDataBench HPBDC 2015

What is BigDataBench?What is BigDataBench? � An open source big data benchmarking project p g g p j

• http://prof.ict.ac.cn/BigDataBench

• Search Google using “BigDataBench”


BigDataBench DetailBigDataBench Detail

Methodology

• Five application domains• Propose benchmark specifications for each domain

• 14 Real world data sets & 3 kinds of big data generatorsImplementation

• 14 Real world data sets & 3 kinds of big data generators• 33 Big data workloads with diverse implementation

• BigDataBench subset versionSpecific‐

purpose Version

g


Five Application DomainsFive Application DomainsDDBJ/EMBL/GenBank database Growth

Search Engine Social NetworkEl t i C M di St i

Taking up 80% of internet services according to page new 180200

/ /Nucleotides Entries

Internet ServiceS h i S i l k E

Multimedia

40%5%

15%

Electronic Commerce Media StreamingOthers

g p gviews and daily visitors

new

VIDEOS on YouTube every minute

new

PHOTOS on FLICKR every minute

hours MUSIC streaming on PANDORA every minute

140

160

180

140160180

)on)

Search engine, Social network, E‐commerce

40%

15%

e e y ute minute on PANDORA every minute

80

100

120

80100120140

es (m

illion)

tides (b

illio

25%

VIDEO feeds from ll

data growth

are minutes VOICE calls on 20

40

60

406080

Entrie

Nucleot

Bioinformatics

http://www.alexa.com/topsites/global;0

Top 20 websiteshttp://www.oldcolony.us/wp‐content/uploads/2014/11/whatisbigdata‐DKB‐v2.pdf

surveillance cameras IMAGES, VIDEOS, documents, …

Skype every minute0

20

020

f


p // / p /g ;http://www.ddbj.nig.ac.jp/breakdown_stats/dbgrowth‐e.html#dbgrowth‐graph

Benchmark specificationBenchmark specification

� Guidelines for BigDataBench implementation� Data model � workloads

Describe data model

Model typical application scenarios

Extract important workloads


BigDataBench DetailsBigDataBench Details

Methodology

• Five application domains• Benchmark specification for each domain


• 14 Real world data sets & 3 kinds of big data generators• 33 Big data workloads with diverse implementation


purpose Version

g


BigDataBench Summaryg yBDGS(Big Data Generator Suite) for scalable data

Facebook Social Network

ImageNet

Wikipedia Entries

E‐commerce Transaction

English broadcasting audio

Amazon Movie Reviews

ProfSearch Resumes

DVD Input Streams

Google Web Graph

g p

SoGou Data

Image scene

MNIST

Genome sequence data Assembly of the human genome

ImpalaNoSql

14 Real‐world Data Sets

Shark

Impala

Search Engine

H d RDMA

SocialNetwork

E-commerce

MPI

Software Stacks33 Workloads

DataMPI

Hadoop RDMAMultimedia Bioinformatics


Software Stacks33 Workloads

Big Data Generator ToolBig Data Generator Tool

� 3 kinds of big data generators � Preserving original characteristics of real datag g� Text/Graph/Table generator


BigDataBench DetailsBigDataBench Details

Methodology

• Five application domains• Benchmark specification for each domain


• 14 Real world data sets & 3 kinds of big data generators• 33 Big data workloads with diverse implementations


purpose Version

g


BigDataBench SubsetBigDataBench Subset

�Motivation� Expensive to run all the benchmarks for system p yand architecture researches

• multiplied by different implementationsmultiplied by different implementations • BigDataBench 3.0 provides about 77 workloads

Eliminate the

Identify workload characteristics from a

Eliminate the correlation data

Clustering Subsetcharacteristics from a specific perspective

Dimensionreduction (PCA)

(K‐Means) Subset


( )

Why BigDataBench?Why BigDataBench?Specification

Application domains

Workload Types

Workloads

Scalable data sets (from real

Multipleimpleme

Multitenancy

Subsets

Simulatorca o do a s ypes oads se s ( o ea

data)p e e

ntationsa cy e s o

version

BigDataBench Y Five Four[1] 33 8 Y Y Y Y

BigBench Y One Three 10 3 N N N NCloud‐Suite N N/A Two 8 3 N N N Y

HiBench N N/A Two 10 3 N N N N

CALDA Y N/A One 5 N/A Y N N N/ /YCSB Y N/A One 6 N/A Y N N NLinkBench Y N/A One 10 N/A Y N N N

AMPBenchmarks

Y N/A One 4 N/A Y N N N


[1] The four workloads types include Offline Analytics, Cloud OLTP, Interactive Analytics and Online Service

BigDataBench UsersBigDataBench Usershtt // f i t /Bi D t B h/ /� http://prof.ict.ac.cn/BigDataBench/users/

� Industry users� Accenture, BROADCOM, SAMSUMG, Huawei, IBM

� China’s first industry‐standard big data benchmark y gsuite� http://prof.ict.ac.cn/BigDataBench/industry‐standard‐p //p / g / ybenchmarks/

� About 20 academia groups published papers using g p p p p gBigDataBench


BigDataBench PublicationsBigDataBench Publicationsh h k f h� BigDataBench: a Big Data Benchmark Suite from Internet Services. 20th IEEE

International Symposium On High Performance Computer Architecture (HPCA‐2014).

� Characterizing data analysis workloads in data centers. 2013 IEEE International Symposium on Workload Characterization (IISWC 2013)（Best paper award）

� BigOP: generating comprehensive big data workloads as a benchmarking framework. 19th International Conference on Database Systems for Advanced Applications (DASFAA 2014)Advanced Applications (DASFAA 2014)

� BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. The Fourth workshop on big data benchmarking (WBDB 2014)Id if i D f W kl d i Bi D A l i Xi i� Identifying Dwarfs Workloads in Big Data Analytics arXiv preprint arXiv:1505.06872

� BigDataBench‐MT: A Benchmark Tool for Generating Realistic Mixed Data


Center Workloads arXiv preprint arXiv:1504.02205

OutlineOutline






System BehaviorsSystem BehaviorsCPU utilization I/O wait ratio

� Diversified system level behaviors:

20%

40%

60%

80%

100%CPU utilization I/O wait ratio

erce

ntag

e

0%

20%

H-G

rep(

7)Km

eans

(1)

geR

ank(

1)dC

ount

(1)

H-B

ayes

(1)

M-B

ayes

M-K

mea

nsPa

geR

ank

-Rea

d(10

)fe

renc

e(9)

tQue

ry(9

) dC

ount

(8)

-Pro

ject

(4)

Ord

erBy

(3)

S-G

rep(

1)M

-Gre

p-T

PC-D

S-…

Ord

erBy

(7)

-TPC

-DS-

…-T

PC-D

S-…

S-So

rt(1)

Wor

dCou

ntM

-Sor

tS_

BigD

ata

Pe

10

100

o

S-K

S-Pa

gH

-Wor H M

M-P H-

H-D

iffI-S

elec

tS-

Wor S- S-O H I-O S S

M-W

AVG

_S

0.01

0.1

1

10

… … …ed I/

O ti

me

ratio

0.01H

-Gre

p(7)

S-Km

eans

(1)

S-Pa

geR

ank(

1)H

-Wor

dCou

nt(1

)H

-Bay

es(1

)M

-Bay

esM

-Km

eans

M-P

ageR

ank

H-R

ead(

10)

H-D

iffer

ence

(9)

Sele

ctQ

uery

(9)

S-W

ordC

ount

(8)

S-Pr

ojec

t(4)

S-O

rder

By(3

)S-

Gre

p(1)

M-G

rep

H-T

PC-D

S-…

I-Ord

erBy

(7)

S-TP

C-D

S-…

S-TP

C-D

S-…

S-So

rt(1)

M-W

ordC

ount

M-S

ort

AVG

_S_B

igD

ata

Wei

ghte


S H I-S S A

The Average Weighted disk I/O time ratio



20%

40%

60%

80%


erce

ntag

e

� High CPU utilization & less I/O time

0%

20%

H-G

rep(

7)Km

eans

(1)

geR

ank(

1)dC

ount

(1)

H-B

ayes

(1)

M-B

ayes

M-K

mea

nsPa

geR

ank

-Rea

d(10

)fe

renc

e(9)

tQue

ry(9

) dC

ount

(8)

-Pro

ject

(4)

Ord

erBy

(3)

S-G

rep(

1)M

-Gre

p-T

PC-D

S-…

Ord

erBy

(7)

-TPC

-DS-

…-T

PC-D

S-…

S-So

rt(1)

Wor

dCou

ntM

-Sor

tS_

BigD

ata

Pe

10

100

tio

S-K

S-Pa

gH

-Wor H M

M-P H-

H-D

iffI-S

elec

tS-


M-W

AVG

_S

0.01

0.1

1

10

… … …

ed I/

O ti

me

rat

0.01H

-Gre

p(7)

S-Km

eans

(1)

S-Pa

geR

ank(

1)H

-Wor

dCou

nt(1

)H

-Bay

es(1

)M

-Bay

esM

-Km

eans

M-P

ageR

ank

H-R

ead(

10)

H-D

iffer

ence

(9)

Sele

ctQ

uery

(9)

S-W

ordC

ount

(8)

S-Pr

ojec

t(4)

S-O

rder

By(3

)S-

Gre

p(1)

M-G

rep

H-T

PC-D

S-…

I-Ord

erBy

(7)

S-TP

C-D

S-…

S-TP

C-D

S-…

S-So

rt(1)

M-W

ordC

ount

M-S

ort

AVG

_S_B

igD

ata

Wei

ght


S H I-S S A




20%

40%

60%

80%


erce

ntag

e


� Low CPU utilization

0%

20%

H-G

rep(

7)Km

eans

(1)

geR

ank(

1)dC

ount

(1)

H-B

ayes

(1)

M-B

ayes

M-K

mea

nsPa

geR

ank

-Rea

d(10

)fe

renc

e(9)

tQue

ry(9

) dC

ount

(8)

-Pro

ject

(4)

Ord

erBy

(3)

S-G

rep(

1)M

-Gre

p-T

PC-D

S-…

Ord

erBy

(7)

-TPC

-DS-

…-T

PC-D

S-…

S-So

rt(1)

Wor

dCou

ntM

-Sor

tS_

BigD

ata

Pe

10

100

atio

Low CPU utilization relatively and lots of I/O time

S-K

S-Pa

gH

-Wor H M

M-P H-

H-D

iffI-S

elec

tS-


M-W

AVG

_S

0.01

0.1

1

10

… … …ted

I/O ti

me

ra

0.01H

-Gre

p(7)

S-Km

eans

(1)

S-Pa

geR

ank(

1)H

-Wor

dCou

nt(1

)H

-Bay

es(1

)M

-Bay

esM

-Km

eans

M-P

ageR

ank

H-R

ead(

10)

H-D

iffer

ence

(9)

Sele

ctQ

uery

(9)

S-W

ordC

ount

(8)

S-Pr

ojec

t(4)

S-O

rder

By(3

)S-

Gre

p(1)

M-G

rep

H-T

PC-D

S-…

I-Ord

erBy

(7)

S-TP

C-D

S-…

S-TP

C-D

S-…

S-So

rt(1)

M-W

ordC

ount

M-S

ort

AVG

_S_B

igD

ata

Wei

ght


S H I-S S A




20%

40%

60%

80%


erce

ntag

e


� Relatively low CPU

0%

20%

H-G

rep(

7)Km

eans

(1)

geR

ank(

1)dC

ount

(1)

H-B

ayes

(1)

M-B

ayes

M-K

mea

nsPa

geR

ank

-Rea

d(10

)fe

renc

e(9)

tQue

ry(9

) dC

ount

(8)

-Pro

ject

(4)

Ord

erBy

(3)

S-G

rep(

1)M

-Gre

p-T

PC-D

S-…

Ord

erBy

(7)

-TPC

-DS-

…-T

PC-D

S-…

S-So

rt(1)

Wor

dCou

ntM

-Sor

tS_

BigD

ata

Pe

Relatively low CPU utilization & lots of I/O timeM di CPU tili ti &

S-K

S-Pa

gH

-Wor H M

M-P H-

H-D

iffI-S

elec

tS-


M-W

AVG

_S

1

10

100

ratio

� Medium CPU utilization & I/O time

0.01

0.1

1

(7)

(1)

k(1) t(1

)(1

)ye

san

san

k10

)(9

)(9

) t(8

)t(4

)y(

3) (1)

rep

DS-

…y(

7)D

S-…

DS-

…t(1

)un

tSo

rtat

a

ghte

d I/O

tim

e

H-G

rep

S-Km

eans

S-Pa

geR

ank

H-W

ordC

ount

H-B

ayes

M-B

ayM

-Km

eaM

-Pag

eRa

H-R

ead(

1H

-Diff

eren

ceI-S

elec

tQue

ry(

S-W

ordC

ount

S-Pr

ojec

tS-

Ord

erBy

S-G

rep

M-G

rH

-TPC

-DI-O

rder

ByS-

TPC

-DS-

TPC

-DS-

Sort

M-W

ordC

ouM

-SAV

G_S

_Big

Da

Wei

g


The average Weighted disk I/O time ratios

Workloads ClassificationWorkloads Classification

� From perspective of system behaviors:� System behaviors vary across different workloads y y� Workloads are divided into 3 categories:T W kl dType Workloads

CPU Intensive H‐Grep, S‐Kmeans, S‐PageRank, H‐WordCount, H‐Bayes, M‐Bayes, M‐Kmeans and M‐PageRank

I/O Intensive H‐Read, H‐Difference, I‐SelectQuery, S‐WordCount, S‐Project, S‐OrderBy, M‐Grep and S‐Grepp

Hybrid H‐TPC‐DS‐query3, I‐OrderBy, S‐TPC‐DS‐query10, S‐TPC‐DS‐query8, S‐Sort, M‐WordCount and M‐Sort


Off Chip BandwidthOff‐Chip Bandwidth

Most of CPU‐intensive workloads have higher off‐chip bandwidth (3GB/s), Maximum is 6.2GB/s; Other workloads have lower off‐chip bandwidth (0.6GB/s).

MPI based workloads need low memory bandwidth.


IPC of BigDataBench vs. other

2

benchmarks

1

1.5

IPC

0

0.5

rep(

7)an

s(1)

ank(

1)un

t(1)

yes(

1)Ba

yes

mea

nsR

ank

d(10

)ce

(9)

ry(9

) un

t(8)

ect(4

)By

(3)

rep(

1)-G

rep

ry3(

9)By

(7)

C-D

S-…

ry8(

1)or

t(1)

Cou

ntM

-Sor

tgD

ata

PC-C

Suite

HPC

CR

SEC

PEC

fpEC

int

H-G

rS-

Kmea

S-Pa

geR

aH

-Wor

dCou

H-B

ay M-B

M-K

mM

-Pag

eH

-Rea

H-D

iffer

enI-S

elec

tQue

S-W

ordC

ouS-

Proj

eS-

Ord

erS-

Gr

M-

-TPC

-DS-

quer

I-Ord

erS-

TPC

-TPC

-DS-

quer

S-So

M-W

ordC M

AVG

_S_B

ig TPAV

G_C

loud

Avg_

HAv

g_PA

RAV

G_S

PAV

G_S

PE

H- S-

The average IPC of the big data workloads is larger than that of CloudSuite, SPECFP and SPECINT similar with PARSEC and slightly lower than HPCCSPECINT, similar with PARSEC and slightly lower than HPCC

The avrerage IPC of BigDataBench is 1.3 times of that of CloudSuite

Some workloads have high IPC (M Kmeans S TPC DS Query8)


Some workloads have high IPC (M_Kmeans, S‐TPC‐DS‐Query8)

Instructions Mix of BigDataBench vs. other benchmarks

z Bi d kl d d d i d i i h b hz Big data workloads are data movement dominated computing with more branch operationsz 92% percentage in terms of instruction mix (Load + Store + Branch + data movements of INT)


Pipeline StallsPipeline Stalls� The service workloads have more RAT (Register Allocation Table) stalls � The data analysis workloads have more RS (Reservation Station) and� The data analysis workloads have more RS (Reservation Station) and

ROB (ReOrder Buffer) full stalls� Notable front end stalls (i.e., instruction fetch stall) !

Data Service Data analysis


Cache Behaviors of BigDataBenchgCPU‐intensive I/O‐intensive hybrid

z L1I MPKIz L1I MPKIz Larger than traditional benchmarks, but lower than that of CloudSuite (12 vs.

31)Diff bi d kl dz Different among big data workloadsz CPU‐intensive(8), I/O intensive(22), and hybrid workloads(9)

z One order of magnitude differences among diverse implementations


z M_WordCount is 2, while H_WordCount is 17

Cache BehaviorsCache Behaviors

� L2 Cache：� The IO‐intensive workloads undergo more L2 MPKI

� L3 Cache:� The average L3 MPKI of the big data workloads is lower� The average L3 MPKI of the big data workloads is lower than all of the other workloads

� The underlying software stacks impact data� The underlying software stacks impact data locality� MPI workloads have better data locality and less cache misses


TLB BehaviorsTLB BehaviorsCPU‐intensive I/O‐intensive hybrid

0.1

1

10 DTLB MPKI ITLB MPKI

miss

es (M

PKI)

0.01

0.1

H-Gr

ep(7)

mean

s(1)

eRan

k(1)

dCou

nt(1)

Baye

s(1)

M-Ba

yes

-Kmea

nsag

eRan

kRe

ad(10

)ere

nce(9

)Qu

ery(9)

dCou

nt(8)

Projec

t(4)

rderBy

(3)S-G

rep(1)

M-Gr

epqu

ery3(9

)rde

rBy(7)

uery1

0(4)

query

8(1)

S-Sort

(1)ord

Coun

tM-

Sort

_BigD

ataTP

C-C

oudS

uite

vg_H

PCC

PARS

EC_S

PECfp

_SPE

Cint

TLB

z ITLB

HS-K

mS-P

age

H-Wo

rd H- M M-P H-R

H-Dif

feI-S

electQ

S-Word S-P S-O

r S

H-TP

C-DS

-q I-Or

S-TPC

-DS-q

uS-T

PC-D

S-qS

M-Wo

AVG_

S_

AVG_

Cl AvAv

g_P

AVG_

AVG_

z ITLB� IO‐intensive workloads undergo more ITLB MPKI.

DTLBz DTLBz CPU‐intensive workloads have more DTLB MPKI.


Our observations from BigDataBenchOur observations from BigDataBench� Unique characteristics� Unique characteristics

� Data movement dominated computing with more branch operations• 92% percentage in terms of instruction mix

� Notable pipeline frontend stalls

Diff b h i Bi D kl d� Different behaviors among Big Data workloads� Disparity of IPCs and memory access behaviors

• CloudSuite is a subclass of Big DataCloudSuite is a subclass of Big Data

� Software stacks impacts� The L1I cache miss rates have one order of magnitude differences among

diverse implementations with different software stacks.


Correlation AnalysisCorrelation Analysis

� Compute the correlation coefficients of CPI with other micro‐architecture level metrics.� Pearson’s correlation coefficient:

� Absolute value (from 0 to 1) shows the dependency:• The bigger the absolute value, the stronger the correlation.


Top five coefficientsTop five coefficients

①

⑤

Naive Bayes Grep WordCount Kmeans FKmeans PageRank Sort Hive IBCF HMM SVM


InsightsInsights� Frontend stall does not have high correlation� Frontend stall does not have high correlation coefficient value for most of big data analytics workloadsworkloads� Frontend stall is not the factor that affects the CPI performance mostperformance most.

� L2 cache misses and TLB misses have high correlation coefficient valuescorrelation coefficient values.� The long latency memory accesses (access L3 cache or memory) affect the CPI performance most and shouldmemory) affect the CPI performance most and should be the optimization point with highest priority.


OutlineOutline






Cloud Data CentersCloud Data Centers

� Two class of popular workloads� Long‐running servicesg g

• Search engines, E‐commerce sites

� Short‐term data analytic jobs� Short term data analytic jobs • Hadoop MapReduce, Spark jobs


ProblemProblem

� Existing benchmarks focus on specific types of workload

� Scenarios are not realistic D t t h th t i l d t t i� Does not match the typical data center scenario that mixes different percentages of tenants and

kl d h i th tiworkloads shareing the same computing infrastructure


Purpose of BigDataBench MTPurpose of BigDataBench‐MT

� Developing realistic benchmarks to reflect such practical scenarios of mixed workloads. p� Both service and data analytic workloads� Dynamic scaling up and down� Dynamic scaling up and down

� The tool is publicly available from http://prof.ict.ac.cn/BigDataBench/multi‐p //p / g /tenancyversion


What can you do with it?What can you do with it?

�We consider two dimensions of the benchmarking scenariosg� From tenants’ perspectives

kl d ’� From workloads’ perspectives


You can specify the tenantsYou can specify the tenants

Th b f t t� The number of tenants� Scalability Benchmark: How many tenants are able run in parallel ?run in parallel ?

� The priorities of tenantsF i B h k H f i i th� Fairness Benchmark: How fair is the system, i.e., are the available resources equally available to all tenants? If tenants have differentavailable to all tenants? If tenants have different priorities ?

� Time line� How the number and priorities of tenants change over time?


You can specify the workloadsYou can specify the workloads

D t h t i ti� Data characteristics� Data type, source � Input/output data volumes, distributions

� Computation semanticsp� Source code� Big data software stacks� Big data software stacks

� Job arrival patternsA i l� Arrival rate

� Arrival sequence


Two major challengesTwo major challenges� H t it f l kl d� Heterogeneity of real workloads

� Different workload types9 /9 e.g. CPU or I/O intensive workloads

� Different software stacks99 e.g. Hadoop, Spark, MPI

� Workload dynamicity hidden in real‐world traces� Arrival patterns9 Request/Job submitting time and sequences

� Job input sizes9 e.g. ranging from KB to ZB


Existing big data benchmarksg g

Benchmarks Actual workloads

Real workload traces

Mixed workloads

AMPLab benchmark LinkbenchAMPLab benchmark, Linkbench , Bigbench , YCSB, CloudSuite Yes No No

d

H t t l kl d th b i f

GridMix , SWIM No Yes No

� How to generate real workloads on the basis of real workload traces, is still an open question.


System OverviewSystem Overview

� Three modules� Benchmark User

PortalPortal• A visual interface

� Combiner of Workloads d Tand Traces

• A matcher of real workloads and traces

M l i W kl d� Multi‐tenant Workload Generator

• A multi‐tenant workload generatorworkload generator


Key technique: Combination of real and synthetic data analytic jobs

�Goal: Combining the arrival patterns t t d f l t ith lextracted from real traces with real

workloads.

�Problem: Workload traces only contain anonymous jobs whose workload typesanonymous jobs whose workload types and/or input data are unknown.


Solution: the first stepp� Deriving the workload characteristics of both greal and anonymous jobs

Metric Description

TABLE. Metrics to represent workload characteristics

Execution time Measured in secondsCPU usage Total CPU time per second

Memory usage Measured in GBCPI Cycles per instruction

h b fMAI The number of memory accesses per instruction


Solution: the second stepp�Matching both types of jobs whose workload g yp jcharacteristics are sufficiently similar


An Examplep� An example of matching Hadoop workloads

Mining Facebook/Googleworkload trace

Matching result: replaying basis

workload trace(Exact workload characteristics information

Job type Input size (GB)

StartingTime information

Profiling Hadoop

Workload matching using

( )(minutes)

Bayes 2 10

Sort 1 20g pworkloads from BigDataBench

(Collect workload

usingk‐means clustering

Sort 1 20

K‐means 0.5 25

Bayes 5 30

characteristics information)

Sort 1 40


System demostration� Three steps to generate a mix of search service and Hadoop MapReduce jobs� Traces : 24‐hour Sogou user query logs and Google cluster trace.

Step 1‐Specification of tested machines and workloadsStep 2 Selection of benchmarking period and scaleStep 2‐Selection of benchmarking period and scale

Step 3‐Generation of mixed workloads


Workloads and traces in BigDataBench MTBigDataBench‐MT

M lti t V1 0 l

f

� Multi‐tenancy V1.0 releases:

Workloads Software stack Workload traceNutch Web Search

Apache Tomcat6.0.26, Search

Sogou(http://www.sogou.com/labs/dl

Server（Nutch） /q‐e.html)

Hadoop Hadoop 1.0.2 Facebook(https://github.com/SWIMProjectUCB/SWIM/wiki)

Shark Shark 0 8 0 Google data centerShark Shark 0.8.0 Google data center(https://code.google.com/p/googleclusterdata/)


OutlineOutline






Core ArchitectureCore Architecture

� Multi brawny core (Xeon E5645, 2.4 GHz)� 6 Out‐of‐Order cores� Dynamic Multiple Issue (supper scalar)� Dynamic Overclocking� Simultaneous multithreading

� Many wimpy core architecture (Tile‐Gx36, 1.2 GHz):y py ( , )� 36 In‐Order cores� Static Multiple Issue (VLIW)p ( )


Experiment methodologyExperiment methodology

� User real hardware instead of simulation� Real power consumption measurement instead of modeling

� Saturate CPU performance by:Saturate CPU performance by: � Isolate the processor behavior

• Over‐provisions the disk I/O subsystem by using RAM diskp / y y g

� Optimize benchmarks• Tune the software stack parameters • JVM flags to performance


Execution timeExecution time

� For Hadoop based sort, the performance gap is about 1.08×.� For the other workloads, more than 2× gaps exist between

X d TilXeon and Tilera.

3.54

4.55

me

Xeon Tilera

11.52

2.53

3.5

Normalized

Tim

00.51

� From the perspective of execution time, the Xeon processor is b tt th Til ll th ti


better than Tilera processor all the time.

Cycle CountsCycle Counts

� There are huge cycle count gaps between Xeon and Tileraranging from 5.3 to 14.Til d l t l t th t f� Tilera need more cycles to complete the same amount of work.

16 Xeon Tilera

8

10

12

14

zed Cycles

2

4

6

8

Normaliz

0


Pipeline EfficiencyPipeline Efficiency• The theoretical IPC:The theoretical IPC:

¾ Xeon: 4 instructions per cycle

• Pipeline efficiency:¾ Tilera: 1 instruction bundle per cycle

Pipeline efficiency:

0.45 Tilera Xeon

0 150.2

0.250.3

0.350.4

ine Efficiency

00.050.1

0.15

Pipe

li

• OoO pipelines are more efficient than in‐order ones


Power ConsumptionPower Consumption

� Tilera is power optimized.� Xeon consumes more power.� Xeon consumes more power.

1.6

1.8 Tilera Xeon

0.8

1

1.2

1.4

alized

Pow

er

0

0.2

0.4

0.6

Norm


Energy ConsumptionEnergy Consumptiond b d l l h� Hadoop based sort consumes less energy on Tilera than on Xeon

� Hadoop sort is an extremely I/O intensive workloads. � Tilera consumes more energy than Xeon to complete the same amount ofTilera consumes more energy than Xeon to complete the same amount of

work for most big data workloads� The longer execution time offsets the lower power design

3 5 Xeon Tilera

2

2.5

3

3.5

zed En

ergy

0.5

1

1.5

Normali

0


Total Cost of Ownership (TCO) Model[*]Total Cost of Ownership (TCO) Model[ ]

� Three‐year depreciation cycle� Hardware costs associated with individual components� CPU� Memory� Disk� Board� Power � Cooling

[*] K. Lim et al. Understanding and designing new server architectures for emerging warehouse‐computing


environments. ISCA 2008

Cost modelCost model

� The cost data originate from diverse sources:Diff t d� Different vendors

� Corresponding official websites� Power and cooling:


� An activity factor of 0.75

Performance per TCOPerformance per TCO

� Haoop‐based Sort has higher performance per TCO on the Tilera.

� For other workloads, Xeon outperforms Tilera.Tilera Xeon Turbo & HT enabled

2

2.5

3

3.5

ance per TCO

0.5

1

1.5

2

malized

Perform

0Norm


Key TakeawaysKey Takeaways

� Try using an open‐source big data benchmark suite from http://prof.ict.ac.cn/BigDataBench

� Big Data: data movement dominated computing with more branch operations� 92% percentage in terms of instruction mix

� Multi‐tenancy version: replaying mixed workloadsMulti tenancy version: replaying mixed workloads according to publicly available workloads traces.

� Wimpy‐core processors only suit a part of big data� Wimpy‐core processors only suit a part of big data workloads.


nding big a ds on modern s using...

Documents