performance analysis ii - marco serafini · scaling up and out •scaling up •more powerful...

Performance Analysis II

Marco Serafini

COMPSCI 590SLecture 15

2

2

Scalability

Parallelism

SpeedupIdeal

Reality

• Ideal world• Linear scalability

• Reality• Bottlenecks• For example: central coordinator

• When do we stop scaling?

33

Scalability• Capacity of a system to improve performance by increasing the amount of resources available

• Typically, resources = processors• Strong scaling

• Fixed total problem size, more processors• Weak scaling

• Fixed per-processor problem size, more processors

44

Strong and Weak Scaling• Strong scaling

• Fixed total problem size, more processors• Weak scaling

• Fixed per-processor problem size, more processors

55

Scaling Up and Out• Scaling Up

• More powerful server (more cores, memory, disk)• Single server (or fixed number of servers)

• Scaling Out• Larger number of servers• Constant resources per server

7 7

What Does This Plot Tell You?

8 8

How About Now?

99

COST• Configuration that outperforms single thread• # cores after which we achieve speedup over 1 core

Single iteration 10 iterations

1010

Possible Reasons for High COST• Restricted API

• Limits algorithmic choice• Makes assumptions

• MapReduce: No memory-resident state• Pregel: program can be specified as “think-like-a-vertex”

• BUT also simplifies programming• Lower end nodes than laptop• Implementation adds overhead

• Coordination• Cannot use application-specific optimizations

1111

Why not Just a Laptop?• Capacity

• Large datasets, complex computations don’t fit in a laptop• Simplicity, convenience

• Nobody ever got fired for using Hadoop on a cluster • Integration with toolchain

• Example: ETL à SQL à Graph computation on Spark

1212

Disclaimers• Graph computation is peculiar

• Some algorithms are computationally complex…• Even for small datasets• Good use case for single-server implementations

• Machine Learning in too…

14 14

Logistic Regression

“While VW can immediately start to update the model as data is read, Spark spends considerable time reading and caching the data, before it can run the first L-BFGS iteration.”

1515

Gradient Boosted Trees

16

Understanding Bottlenecks

18

Monotasks• Decompose data analytics jobs into monotasks

• Monotask is basic unit of scheduling • Each monotask uses only one resource

• This is the opposite of pipelining• Parallelize use of CPU, network, disk

• MonoSpark• 9% slower than Spark• Performance predictability

19

Example: Spark• Non-uniform resource consumption• Concurrent access to same resources• Framework has no control on resource access

• Non-deterministic behavior, hard to debug/predict

2020

Monotasks Principles• Each monotask uses one resource

• CPU or network or disk• Monotasks execute in isolation

• No interaction or blocking during execution• Per-resource schedulers controls contention

• Enough concurrency to ensure full capacity, not more• For example, one CPU task per core

• Per-resource schedulers have complete control over resource

2121

Multitasks Execution

2222

Monotask Execution

2323

How to Break Task: Example

2424

Issues?• Network scheduling is difficult: requires coordination• Complex dependencies: CPU might wait for disk• Memory cost

• Cannot pipeline from disk, need to load all data

2525

Reasoning About Performance• Assume perfect parallelism/resource utilization

• They argue that it is a good approximation in MonoSpark• For each stage

• Measure utilization per monotask, take average• Estimate stage speedup with a different amount of resources

• Ignores• Skew• Dependencies and ramp-up time (networkingàCPUàdisk)

2626

Different HW Configurations• Sort with constant I/O cost and decreasing CPU cost• Effect of adding second disk

2727

Other Use Cases• Prioritizing optimizations

• Not trivial at all in concurrent workloads• Auto-configuration

performance analysis ii - marco serafini · scaling up and out •scaling up •more powerful...

Documents