a comprehensive presentation on 'an analysis of linux scalability to many cores

49
An Analysis of Linux Scalability to Many Cores Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich MIT CSAIL Presented by Chandra Sekhar Sarma Akella Rutvij Talavdekar

Upload: romeotango

Post on 21-Jan-2016

143 views

Category:

Documents


0 download

DESCRIPTION

This is a comprehensive summary (in the form of Presentation) of the technical paper 'An Analysis of Linux Scalability to Many Cores'.The original paper and presentation was given by MIT CSAIL Researchers and Professors - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich

TRANSCRIPT

Page 1: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

An Analysis of Linux Scalability to Many Cores

Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek,

Robert Morris, and Nickolai Zeldovich MIT CSAIL

Presented by

Chandra Sekhar Sarma Akella Rutvij Talavdekar

Page 2: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Introduction

• This paper asks whether traditional kernel designs can be used and implemented in a way that allows applications to scale

.

• Analyze scaling a number of applications (MOSBENCH) on Linux running with a 48-core machine – Exim

– Memcached

– Apache

– PostgreSQL

– gmake

– the Psearchy file indexer

– MapReduce library

Page 3: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

What is Scalability?

• Application does N times as much work on N cores as it could on 1 core

• However, that is not the case, due to serial parts of the code

• Scalability may be better understood by Amdahl's Law

Page 4: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Amdahl's Law • Identifies performance gains from adding additional cores to an

application that has both serial and parallel components

• Serial portion of an application has disproportionate effect on performance gained by adding additional cores

• As N approaches infinity, speedup approaches 1 / S (e.g. S = 10%)

• If 25% of a program is Serial component, adding any number of cores cannot provide speedup of more than 4

Page 5: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Why look at the OS kernel? Application Amount of Time

spent in Kernel (in Percentage)

Description of Services

Exim 70% Mail Server – Uses lot of forks for each incoming SMTP connection and Message delivery

Memcached 80% Distributed memory caching system – Stresses Network stack

Apache 60% Web Server – has a process per Instance, stresses Network stack, File system

PostgreSQL 82% Open Source SQL DB – uses shared data structures, kernel locking interfaces, uses TCP sockets

• Many applications spend considerable amount of their CPU execution time in the kernel

• These applications should scale with more cores

• If OS kernel doesn't scale, apps won't scale.

Page 6: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Premise of this Work • Traditional kernel designs do not scale well on multi-core

processors

• They are rooted in uniprocessor designs

• Speculation is that fixing them is hard

Points to consider: • How serious are the scaling problems?

• Do they have alternatives?

• How hard is it to fix them?

These are answered by analysing Linux scalability on Multicores

Page 7: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Analysing Scalability of Linux

• Use a off-the-shelf 48-core x86 machine

• Run a recent version of Linux – A Traditional OS

– Popular and widely used

– Scalability is competitive to other OS’s scalabilities

– Has a big community constantly working to improve it

• Applications chosen for benchmarking: – Applications known previously as not scaling well on Linux

– Good existing Parallel Implementation

– Are System Intensive

– 7 Applications chosen, together known as MOSBENCH

Page 8: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Contribution • Analysis of Linux scalability for 7 real system intensive applications • Stock Linux limits scalability • Analysis of bottlenecks

Fixes: • 3002 lines of code, 16 patches • Most fixes improve scalability of multiple apps • Fixes made in the Kernel, minor fixes in Applications and certain

changes in the way Applications use Kernel services • Remaining bottlenecks were either due to shared Hardware resources

or in the application

Result: • Patched Kernel • No kernel problems up to 48 cores, with fixes applied • Except for Sloppy Counters, most fixes were applications of Standard

Parallel Programming Techniques

Page 9: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Iterative Method To Test Scalability and fix Bottlenecks

• Run the application

• Use of In-memory Temporary File Storage (tmpfs) file system to avoid disk IO bottlenecks

• Focus being identifying Kernel-related bottlenecks

• Find bottlenecks

• Fix bottlenecks and re-run application

• Stop when a non-trivial application fix is required, or bottleneck is by shared hardware

(e.g. DRAM or Network Cards)

Page 10: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

MOSBENCH

Application Percentage time spent in kernel Bottleneck

Exim (mail server) 69% Process creation and small file creation and deletion

memcahed (object cache)

80% Packet processing in network stack

Apache (web server) 60% Network stack, File system (directory name lookup)

PostgreSQL (database)

Upto 82% Kernel locking interfaces, network interfaces, application's internal shared data structures

gmake (parallel build) Upto 7.60% File system read/writes to multiple files Psearchy (file indexer) Upto 23% CPU intensive, file system read/writes Metis (mapreduce library)

Upto 16% Memory allocator, Soft page-fault code

Page 11: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

48 Core Server • Comprises of 8 AMD Opteron chips with 6 cores on each chip

• Each core has private 64 KB L1 cache (access in 3 CPU cycles) and a private 512 KB L2 cache (14 cycles) • 6 Cores on each chip share a 6 MB L3 cache (28 cycles)

Page 12: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Poor scaling on Stock Linux kernel

Page 13: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Exim collapses on Stock Linux

Page 14: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Exim collapses on Stock Linux

Page 15: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Bottleneck: Reading mount table

• Kernel calls function Lookup_mnt on behalf of Exim application to get Metadata about Mount point

struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); return mnt; }

Page 16: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Bottleneck: Reading mount table

struct vfsmount *lookup_mnt(struct path *path)

{

struct vfsmount *mnt;

spin_lock(&vfsmount_lock);

mnt = hash_get(mnts, path);

spin_unlock(&vfsmount_lock);

return mnt;

}

Critical section is short.

Why does it cause a scalability

bottleneck?

• spin_lock and spin_unlock use many more cycles (~400-5000) than the critical section (~in 10s) in multi-core system

Page 17: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Linux spin lock implementation

Page 18: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Linux spin lock implementation

Page 19: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Linux spin lock implementation

Page 20: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Linux spin lock implementation

Page 21: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Scalability collapse caused by Non-scalable Ticket based Spin Locks

Page 22: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Scalability collapse caused by Non-scalable Ticket based Spin Locks

• Multiple cores spend more time on lock contention and congesting interconnect with lock acquisition requests, invalidations

Page 23: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Scalability collapse caused by Non-scalable Ticket based Spin Locks

Page 24: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Scalability collapse caused by Non-scalable Ticket based Spin Locks

Page 25: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Scalability collapse caused by Non-scalable Ticket based Spin Locks

• Next Lock Holder’s Wait time α N

As this pattern repeats, for some cores waiting

• Time required to Acquire Lock α 𝑁2

N – Number of cores waiting to acquire Lock

• Previous Lock Holder responds to at least N / 2 cores before transferring

Lock to next Lock holder

Page 26: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Bottleneck: Reading mount table

struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); return mnt; }

Well known problem, many solutions:

• Use scalable locks

• Use high speed message passing

• Avoid locks

Page 27: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Solution: per-core mount caches

• Observation: Global Mount table is rarely modified • Solution: Per core Data Structure to store per core Mount Cache

struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; if ((mnt = hash_get(percore_mnts[cpu()], path))) return mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); hash_put(percore_mnts[cpu()], path, mnt); return mnt; }

• Common case: Cores access per-core tables for metadata of mount point • Modify mount table: invalidate per-core tables

Page 28: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Per-core Lookup: Scalability is better

Page 29: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Bottleneck: Reference counting

• Ref count indicates if kernel can free object

– File name cache (dentry), physical pages,…

void dput(struct dentry *dentry)

{

if (!atomic_dec_and_test(&dentry->ref))

return;

dentry_free(dentry);

}

A single atomic instruction limits scalability?!

Page 30: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Bottleneck: Reference counting

• Ref count indicates if kernel can free object

– File name cache (dentry), physical pages,…

void dput(struct dentry *dentry)

{

if (!atomic_dec_and_test(&dentry->ref))

return;

dentry_free(dentry);

}

A single atomic instruction limits scalability?!

• Reading the reference count is slow • These counters can become bottlenecks if many cores update them • Reading the reference count delays memory operations from other cores • A central Reference Count means waiting and contention for locks, cache

coherency serialization

Page 31: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Reading reference count is slow

Page 32: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Reading the reference count delays Memory Operations from other cores

Hardware Cache Line Lock

Page 33: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Solution: Sloppy counters

• Observation: Kernel rarely needs true value of reference count

• Solution: Sloppy Counters

• Each core holds a few “spare” references to an object

Page 34: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Solution: Sloppy counters • Observation: Kernel rarely needs true value of reference count

• Solution: Sloppy Counters

Benefits:

• Infrequent or avoided use of shared central reference counter, as each core holds a few “spare” references to an object

• Core usually updates a sloppy counter by modifying its per-core counter, an operation which typically only needs to touch data in the core’s local cache

• Therefore, NO waiting for locks, Lock contention or cache-coherence serialization

• Speed up increment/decrement by use of per-core counters, and resulting in faster referencing of objects

• Used sloppy counters to count references to dentrys, vfsmounts, dst entrys, and to track network protocol related parameters

Page 35: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Solution: Sloppy counters • Observation: Kernel rarely needs true value of reference count

• Solution: Sloppy Counters

• Avoids: Frequent use of shared central reference counter, as each core can hold a few “spare” references to an object

Page 36: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Sloppy counters: More scalability

Page 37: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Scalability Issues

5 scalability issues are the cause of most bottlenecks: • 1) Global lock used for a shared data structure : More cores longer lock wait time • 2) Shared memory location : More cores overhead caused by the cache coherency algorithms • 3) Tasks compete for limited size-shared hardware cache : More cores increased cache miss rates • 4) Tasks compete for shared hardware resources (interconnects, DRAM

interfaces) More cores more time wasted waiting • 5) Too few available tasks : More cores less efficiency

These issues can often be avoided (or limited) using popular parallel programming techniques

Page 38: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Kernel Optimization Benchmarking Results: Apache

Page 39: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

• Fine-grained locking – A modern kernel can contain thousands of locks, each

protecting one small resource

– Allows each processor to work on its specific task without contending for locks used by other processors

• Lock-free algorithms – ensures that threads competing for a shared resource do

not have their execution indefinitely postponed by mutual exclusion

Other Kernel Optimizations

Page 40: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

• Per-core data structures

– Kernel data structures that caused scaling bottlenecks due to lock contention, cache coherency serialization and protocol delays due to shared data structure

– bottlenecks would remain if we replaced the locks with finer grained locks, because multiple cores update the data structures

– SOLN: Split the data structures into per-core data structures (eg. Per-core Mount Cache for central vfs mount table)

– Avoids Lock Contention, Cache Coherency serialization and slashes lock acquisition wait times

– Cores now query the per core data structure rather than looking up the central data structure, and avoid lock contention and serialization over the shared data structure

Other Kernel Optimizations

Page 41: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Summary of Changes

• 3002 lines of changes to the kernel

• 60 lines of changes to the applications

• Per-core Data Structures and Sloppy Counters provide across the board improvements to 3 of the 7 applications

Page 42: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Popular Parallel Programming Techniques

• Lock-free algorithms

• Per-core data structures

• Fine-grained locking

• Cache-alignment

New Technique:

• Sloppy Counters

Page 43: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Better Scaling with Modifications • Total Throughput comparison of Patched Kernel and Stock Kernel

Page 44: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Better Scaling with Modifications • Per core Throughput comparison of Patched Kernel and Stock Kernel

Page 45: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Limitations

• Results limited to 48 cores and small set of Applications

• Results may vary on different number of cores, and different set of applications

• Concurrent modifications to address space

• In-memory Temporary File storage system used instead of Disk or I/O

• 48-core AMD machine ≠ single 48-core chip

Page 46: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Current bottlenecks

• Kernel code is not the bottleneck

• Further kernel changes might help applications or hardware

Page 47: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

Conclusion

• Stock Linux has scalability problems

• They are easy to fix or avoid up to 48 cores

• Bottlenecks can be fixed to improve scalability

• Linux Communities can provide better support in this regard

• In the context of 48 cores, no need to relook at Operating Systems and explore newer Kernel designs

Page 48: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

References

• Original Paper (pdos.csail.mit.edu/papers/linux:osdi10.pdf‎)

• Original Presentation (usenix.org)

• VFSMount (http://lxr.free-electrons.com/ident?i=vfsmount)

• MOSBENCH (pdos.csail.mit.edu/mosbench/‎)

• Usenix (https://www.usenix.org/events/osdi10/tech/slides/boyd-wickizer.pdf)

• ACM Library (http://dl.acm.org/citation.cfm?id=1924944)

• Information Week (http://www.informationweek.com)

• Wikipedia (tmpfs, Sloppy Counters)

• University of Illinois (Sloppy Counters)

• University College London (Per Core Data structures)

Page 49: A Comprehensive Presentation on 'An Analysis of Linux Scalability to Many Cores

THANK YOU !