jean-pierre lozi new mcf recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · under...

My research: an overviewJean-Pierre Lozi — New McF recruit

Jean-Pierre Lozi — Past activity

Who am I?


- From here ! (born in Nice, grew up in Antibes)

- Before my PhD :

- DEUG MIAS, Licence d’Informatique + Licence de Mathématiques in Nice

- Télécom ParisTech (Paris)

- Masters in Université Pierre et Marie Curie (Paris)

Who am I?

PhD in Computer Science (“allocataire moniteur”)

“Towards more scalable mutual exclusion for multicore architectures”

Under the supervision of Gilles Muller and Gaël Thomas

REGAL/WHISPER team: “Well-Honed Infrastructure Software for Programming Environments and Runtimes”

Laboratoire d’Informatique de Paris 6, UPMC, Paris

Postdoctoral Research Fellow, then University Research Associate

Under the supervision of Alexandra Fedorova

SYNAR team: “Systems Networking and Architecture Research”

Simon Fraser University (SFU), Vancouver, Canada

Until July 2014

Until Sept. 2015


Who am I?


- Three main projects:- Remote Core Locking: dedicating cores for the execution of critical sections

- Hector: automated fault detection in error-handling codes

- A decade of idle cores: scheduling bugs in Linux

- Domain: systems!- Multicore architectures, synchronization / lock algorithms

- Automated source code analysis, bug detection

- Schedulers (on multicore architectures, again)

- Probably not your domain…- I will just give a quick overview of my previous work, no details

- Objective: work with you on some projects ! Low-level systems stuff needed for performance…

- If interested, we can discuss things in more detail

Project 1Remote Core Locking: dedicating cores for faster

execution of critical sections


Project 1: Remote Core Locking

Context: multicore architectures

- Decades of increasing CPU clock speeds, now issues with power usage / heat

- Increasing numbers of cores to keep increasing processing power- Possible because number of transistors keep increasing



Context: multicore architectures

- Decades of increasing CPU clock speeds, now issues with power usage / heat

- Increasing numbers of cores to keep increasing processing power- Possible because number of transistors keep increasing


# transistors

Clock speed

Power consumption

Ratio power/speed


Problem:

- Many legacy applications don’t scale well on modern multicore architectures

- For instance, Memcached on an x86 48-core machine (Get/Set requests):


Hig

her

isb

ett

er


Problem:

- Many legacy applications don’t scale well on multicore architectures

- For instance, Memcached on an x86 48-core machine (Get/Set requests):


Hig

her

isb

ett

er


Poor scalability on multicore architectures: why?

- Bottleneck = critical sections, protected by locks

- High contention => lock acquisition is costly (more cores => higher contention)







0%

20%

40%

60%

80%

100%

1 4 8 16 22 32 48

SPLASH-2/Radiosity

SPLASH-2/Raytrace

Phoenix 2/LG

Phoenix 2/SM

Phoenix 2/MM

Memcached/Get

Memcached/Set

Berkeley DB/OS

Berkeley DB/SL

Number of cores

% o

f tim

e s

pe

nt in

critical section

s*





Two possible solutions:

- Redesign applications (fine-grained locking)

- Costly (millions of lines of legacy code)

- Design better locks!


Higher contention

MCS

CAS spinlock

Critical sections access 5 cache lines each

Lo

we

ris

be

tte

r

[Mellor-Crummey ASPLOS’91]


Designing better locks

- No need to redesign the application, better resistance to contention

- Custom microbenchmark to compare locks:

Jean-Pierre Lozi — Past activity Lower contention

Improvement

Higher contention

MCS

CAS spinlock


Lo

we

ris

be

tte

r



- No need to redesign the application, better resistance to contention



[Mellor-Crummey ASPLOS’91]

Higher contention


Lo

we

ris

be

tte

r





Flat Combining Blocking locks

MCS

CAS spinlock


Question : why are lock algorithms inefficient ?



Question : why are lock algorithms inefficient ?

Because critical paths are too long!

- Overhead 1: costly lock handovers

- Overhead 2: poor locality of critical sections



Overhead 1: costly lock handovers


Tim

eT1 T2 T3

CS1

CS2

CS3




Tim

eT1 T2 T3

CS1

CS2

CS3

Lock

han

do

vers




Tim

eT1 T2 T3

CS1

CS2

CS3

Lock

han

do

vers

Cri

tica

l pat

h




Tim

eT1 T2 T3

CS1

SC2

CS3

Lock

han

do

vers

Cri

tcal

pat

h

Lock handovers :Spinlocks: busy-waiting,POSIX locks: context switch,MCS: sending a message from one thread to the next,Flat combining: sometimes acquires a spinlock


Overhead 2: poor locality of critical sections


Tim

e

T1 T2 T3CS1

CS3

CS2

Shared variable 2Shared variable 1




Tim

e

T1 T2 T3CS1

CS3

CS2


Cac

he

mis

ses




Tim

e

T1 T2 T3CS1

CS3

CS2


Cri

tica

l pat

h

Cac

he

mis

ses


Idea: RCL = shorten the critical path as much as possible by dedicating a server core!





Tim

e

T1 T2 T3

CS1CS2CS3

Server core




Tim

e

T1 T2 T3

CS1CS2CS3

Server core


No cache misses!




Tim

e

T1 T2 T3

Cri

tica

l pat

hCS1CS2CS3

Server core


No cache misses!


Performance ?




Higher contention Lower contention

CAS spinlock

MCS

RCL

Flat Combining Blocking locks

Lo

we

ris

be

tte

r

Microbenchmark:



Higher contention Lower contention

CAS spinlock

MCS

RCL

Blocking locks

Lo

we

ris

be

tte

r

Microbenchmark:

Combining locks {


That was the general idea… but RCL is much more than that!

RCL offers:

- A runtime designed to work with legacy applications: i.e., it works efficiently withmultiple locks per server and/or multiple servers, it handles critical sections thatbusy-wait/block, that supports condition variables / trylocks / nested / recursivecritical sections… lots of complex problems to solve here !

- A reengineering tool that transforms applications to use RCL: can’t be used withonly lock/unlock functions, need to encapsulate critical sections, and ship them toserver cores... We don’t want to do this manually!

- A profiler to detect applications that can benefit from RCL: even with thereenginering tool, using RCL takes time… We want to make sure we can benefitfrom it! Use case : applications with highly contended locks or critical sections w/poor locality...




Focus: legacy system (C) applications. RCL offers:

- A runtime designed to work with legacy applications: i.e., it works efficiently withmultiple locks per server and / or multiple servers, it handles critical sections thatbusy-wait / block, that supports condition variables / trylocks / nested / recursivecritical sections… lots of algorithmic / engineering problems to solve here !

- A reengineering tool that transforms applications to use RCL: can’t be used withonly lock/unlock functions, need to encapsulate critical sections, and ship them toserver cores... We don’t want to do this manually!





Focus: legacy system (C) applications. RCL offers:

- A runtime designed to work with legacy applications: i.e., it works efficiently withmultiple locks per server and / or multiple servers, it handles critical sections thatbusy-wait / block, that supports condition variables / trylocks / nested / recursivecritical sections… lots of algorithmic / engineering problems to solve here !

- A reengineering tool that transforms applications to use RCL: can’t be used withonly lock / unlock functions, need to encapsulate critical sections, and ship themto server cores... We don’t want to do this manually!



void func(void) {

int a, b, x;

…

a = …;

…

pthread_mutex_lock();

a = f(a);

f(b);

pthread_mutex_unlock();

…

}

struct context { int a, b };

void func(void) {

struct context c;

int x;

…

c.a = …;

…

execute_rcl(__cs, &c);

…

}

void __cs(struct context *c) {

c->a = f(c->a)

f(c->b)

}


Reengineering tool: a simple case


Performance in legacy applications:


% in CS: 44.7%(many DCMs)

63.9% 65.7% 79.0% 81.6% 87.7% 90.2% 92.2%

Hig

he

ris

better


Improved scalability:



Very efficient when more threads than cores: server always makes progress


Hig

her

isb

ett

er


Very efficient when more threads than cores: server always makes progress


Quick collapse

Hig

her

isb

ett

er


Publications (44 citations):

- In CFSE ‘8 (national conference, best paper)

- In USENIX ATC ’12 (international conference)

- Long version of the paper submitted to TOCS (intl journal, waiting for reviews)

Several research works based on RCL already:

- [Petrovic et al., PPoPP’14] : RCL for partially cache coherent architectures

- [Pusuruki et al., PPoPP ‘14] : migrating threads to improve locality of CS

- [Hassan et al., IPDPS ’14] : dedicated server cores for transactional memory


Project 2Hector: automated fault detection in error-

handling codes


Project 2: Hector

Problem:

- System applications are complex => impossible to avoid bugs

- Large part of system applications are error-handling codes- Errors tested after most function calls


Project 2: Hector

Problem:



- A common mistake: forgetting to release a resource in an EHC- Releasing memory or a lock, unloading a device…


Project 2: Hector

Problem:



- A common mistake: forgetting to release a resource in an EHC- Releasing memory or a lock, unloading a device…

- Can have major consequences!- Memory leaks (exploits!), deadlocks, crashes...


Project 2: Hector

Existing solutions:

- Specification mining: automated search for “protocols”, i.e., ways to use an API: e.g., function Y (release) often after function X

- Problem: in practice, need to prune number of protocols found to avoid false positive: filtering using high confidence and support


Project 2: Hector

Existing solutions:


- Problem: in practice, need to prune number of protocols found to avoid false positives: filtering using high confidence and support


Project 2: Hector

Existing solutions:


- Problem: in practice, need to prune number of protocols found to avoid false positives: filtering using high confidence and support

- “Macroscopic approaches”: missing lots of acquire/release faults because many acquire/release operations used only a handful of times!


Project 2: Hector

Our solution: Hector

- Observation: when resources released in a given way in an EHC, often releasedin the same way in nearby EHCs

- We first annotate functions globally, guessing if they are acquire or release functions based on heuristics (return a parameter or not, first/last access to a variable, …)

- Local analysis of Flow Control Graph: looks for acquire / release operations


Project 2: Hector




- Local (“microscopic”) analysis of Flow Control Graph:- Look for release operation in EHC, flow graph”exemplar”

- Parts of the code with same acquire/release operations in flow graph before EHC but missing release operation: fault candidate!


Project 2: Hector




- Local (“microscopic”) analysis of Flow Control Graph:- Look for release operation in EHC, flow graph before the EHC = exemplar

- Parts of the code with same acquire/release operations in exemplar but missing release operation: fault candidate!


Expérience en recherche (8/14)


Allocation 1 ⟶

Allocation 2 ⟶

Libération 1 ⟶

Libération 2 ⟶

a

b

d

e

f Sortie ⟶

c GOTO ⟶

Basic example: from Linux



Libération 1 ⟶

Libération 2 ⟶

⟵ Sortie

d

e

f Sortie ⟶ f ‘’

c GOTO ⟶

⟵ GOTO c

Basic example: from LinuxAllocation 1 ⟶

Allocation 2 ⟶

a

b


Basic example: from Linux


Libération 1 ⟶

Libération 2 ⟶

⟵ Sortie

d

e

f Sortie ⟶ f ‘’

Same acquire/release operationsbefore EHC, missing releaseoperations: candidate fault!

c GOTO ⟶

⟵ GOTO c

Allocation 1 ⟶

Allocation 2 ⟶

a

b

Project 2: Hector


- Was the general idea: in practice, smart heuristics in the Control Flow Graph to handle more elaborate cases…

- Found 484 fault candidates in Linux, Python, Apache, Wine, PHP, and PostgreSQL… 371 were real faults! => few false positives

- Most wouldn’t have been detected by macroscopic approaches!


Project 2: Hector


- Was the general idea: in practice, smart heuristics in the Control Flow Graph to handle more elaborate cases…

- Found 484 fault candidates in Linux, Python, Apache, Wine, PHP, and PostgreSQL… 371 were real faults! => few false positives

- Most wouldn’t have been detected by macroscopic approaches!

- One of my main roles in this project: I showed that a malicious user can exploit some of these faults to crash servers!


Project 2: Hector

Publications (11 citations) :

- Dans CFSE ‘9 (national conference)

- Dans DSN ‘13 (international conference, best paper)


Project 3A decade of idle cores: scheduling bugs in Linux


Project 3: scheduling bugs in Linux

Problem:

- OS schedulers keep evolving with hardware- Completely new schedulers once in a while (the O(1) scheduler, CFS scheduler…)

- Many heuristics added over time to overcome issues

- Especially true with recent machines: multicore, NUMA...

- Programs run and end, nobody seems to think there is a problem!

- Linus said :- “And you have to realize that there are not very many things that have aged as well as the

scheduler. Which isjust another proof that scheduling is easy.”

- “I suspect that making the scheduler use per-CPU queues together with some inter-CPU load balancing logic is probably _trivial_. Patches already exist, and I don’t feel that people can screw up the few hundred lines too badly.”



So wait, what is the problem again?

- Read the scheduler code: it is actually very complex, full of heuristics thatsometimes seem contradictory, hard to really understand

- No effort to show that any of it is correct. Simply tested on a few microbenchmarks, and feedback from users.

- We were curious and we considered a set of implicit invariants.

- Simple stuff, like “no core should run several threads when there have been idlecores in the system for a long time”, or “two threads with the same load shouldrun for a similar amount of time”

- We wrote “sanity checks” to check these invariants at runtime.



So wait, what is the problem again?

- Read the scheduler code: it is actually very complex, full of heuristics thatsometimes seem contradictory, hard to really understand

- No effort to show that any of it is correct. Simply tested on a few microbenchmarks, and feedback from users.

We were curious and we considered a set of implicit invariants.

- Simple stuff, like “no core should run several threads when there have been idlecores in the system for a long time”, or “two threads with the same load shouldrun for a similar amount of time”

- We wrote “sanity checks” to check these invariants at runtime.



Our shocking news: the Linux scheduler is rife with “scheduling bugs”, i.e., the inability to maintain the most basic implicit invariants!



Example 1: when running two processes in two terminals, one with 2 threads, the other one with 64 threads, many cores idle while other cores overloaded!

- Reason: autogroups, the process with two threads has a higher load. The schedulerjust looks at average loads of nodes, which is balanced. Doesn’t try to go further down in the hierarchy!



Example 2: Oracle with TPC-H = many cores remain idle for no apparent reason.

- Reason: initial inter-node load balancing event caused by transient thread, after thatone node with more threads than cores, thread keeps waking up on busy core becauseonly node-local threads considered (faulty cache-optimization heuristic). All threads wait for each other: “holes” everywhere in the execution!



- Other bugs: Many other scheduling bugs found, including bugs where only one node in the system is used because topology not built properly, etc.

- We argue that such bugs will always be added to the scheduler, due to its constant evolution. Using formal proofs would be extremely complex, regression testingwould not find the bugs we found.

- We discuss possible ways of redesigning the scheduler to reduce bugs, but evenwith this, sanity checks needed to ensure more bugs won’t be added.



- Other bugs: Many other scheduling bugs found, including bugs where only one node in the system is used because topology not built properly, etc.

- We argue that such bugs will always be added to the scheduler, due to its constant evolution. Using formal proofs would be extremely complex, regression testingwould not find the bugs we found.

- We propose possible ways of redesigning the scheduler to reduce bugs, but evenwith this, sanity checks needed to ensure more bugs won’t be added.



Our proposed implementation of sanity checks:- Differs from watchdogs or assertions, no simple test for scheduling issues

- Sanity checks called periodically, start low-overhead recording of movements of threads if candidate issue detected, report a bug if issue persists long enough

- If issue persists long enough, high-overhead profiling of the whole machine for 20ms to generate a bug report that helps understand why the scheduler didn’t reach a satisfactory state again during that time

We argue that sanity checks are the only practical way to efficiently ensure implicitinvariants, should be added to various parts of the kernel!

Paper mostly written, will be submitted to EuroSys next month.Jean-Pierre Lozi — Past activity






Paper mostly written, will be submitted to EuroSys next monthJean-Pierre Lozi — Past activity






Paper mostly written, will be submitted to EuroSys next month.Jean-Pierre Lozi — Past activity

And now?


And now?

Position profile: heterogeneous data warehouses- My profile: more “systems” than most people in the team

- But, data warehouses use multicore processors, which need scheduling, etc.

- Objective: bringing my expertise in systems to projects in the team

Open to all suggestions for collaborations!

Jean-Pierre Lozi

jean-pierre lozi new mcf recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · under...

Documents