performance analysis ii - marco serafini · scaling up and out •scaling up •more powerful...
TRANSCRIPT
Performance Analysis II
Marco Serafini
COMPSCI 590SLecture 15
2
2
Scalability
Parallelism
SpeedupIdeal
Reality
• Ideal world• Linear scalability
• Reality• Bottlenecks• For example: central coordinator
• When do we stop scaling?
33
Scalability• Capacity of a system to improve performance by increasing the amount of resources available
• Typically, resources = processors• Strong scaling
• Fixed total problem size, more processors• Weak scaling
• Fixed per-processor problem size, more processors
44
Strong and Weak Scaling• Strong scaling
• Fixed total problem size, more processors• Weak scaling
• Fixed per-processor problem size, more processors
55
Scaling Up and Out• Scaling Up
• More powerful server (more cores, memory, disk)• Single server (or fixed number of servers)
• Scaling Out• Larger number of servers• Constant resources per server
7 7
What Does This Plot Tell You?
8 8
How About Now?
99
COST• Configuration that outperforms single thread• # cores after which we achieve speedup over 1 core
Single iteration 10 iterations
1010
Possible Reasons for High COST• Restricted API
• Limits algorithmic choice• Makes assumptions
• MapReduce: No memory-resident state• Pregel: program can be specified as “think-like-a-vertex”
• BUT also simplifies programming• Lower end nodes than laptop• Implementation adds overhead
• Coordination• Cannot use application-specific optimizations
1111
Why not Just a Laptop?• Capacity
• Large datasets, complex computations don’t fit in a laptop• Simplicity, convenience
• Nobody ever got fired for using Hadoop on a cluster • Integration with toolchain
• Example: ETL à SQL à Graph computation on Spark
1212
Disclaimers• Graph computation is peculiar
• Some algorithms are computationally complex…• Even for small datasets• Good use case for single-server implementations
• Machine Learning in too…
14 14
Logistic Regression
“While VW can immediately start to update the model as data is read, Spark spends considerable time reading and caching the data, before it can run the first L-BFGS iteration.”
1515
Gradient Boosted Trees
16
Understanding Bottlenecks
17
18
Monotasks• Decompose data analytics jobs into monotasks
• Monotask is basic unit of scheduling • Each monotask uses only one resource
• This is the opposite of pipelining• Parallelize use of CPU, network, disk
• MonoSpark• 9% slower than Spark• Performance predictability
19
Example: Spark• Non-uniform resource consumption• Concurrent access to same resources• Framework has no control on resource access
• Non-deterministic behavior, hard to debug/predict
2020
Monotasks Principles• Each monotask uses one resource
• CPU or network or disk• Monotasks execute in isolation
• No interaction or blocking during execution• Per-resource schedulers controls contention
• Enough concurrency to ensure full capacity, not more• For example, one CPU task per core
• Per-resource schedulers have complete control over resource
2121
Multitasks Execution
2222
Monotask Execution
2323
How to Break Task: Example
2424
Issues?• Network scheduling is difficult: requires coordination• Complex dependencies: CPU might wait for disk• Memory cost
• Cannot pipeline from disk, need to load all data
2525
Reasoning About Performance• Assume perfect parallelism/resource utilization
• They argue that it is a good approximation in MonoSpark• For each stage
• Measure utilization per monotask, take average• Estimate stage speedup with a different amount of resources
• Ignores• Skew• Dependencies and ramp-up time (networkingàCPUàdisk)
2626
Different HW Configurations• Sort with constant I/O cost and decreasing CPU cost• Effect of adding second disk
2727
Other Use Cases• Prioritizing optimizations
• Not trivial at all in concurrent workloads• Auto-configuration