jean-pierre lozi new mcf recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · under...
TRANSCRIPT
My research: an overviewJean-Pierre Lozi — New McF recruit
Jean-Pierre Lozi — Past activity
Who am I?
Jean-Pierre Lozi — Past activity
- From here ! (born in Nice, grew up in Antibes)
- Before my PhD :
- DEUG MIAS, Licence d’Informatique + Licence de Mathématiques in Nice
- Télécom ParisTech (Paris)
- Masters in Université Pierre et Marie Curie (Paris)
Who am I?
PhD in Computer Science (“allocataire moniteur”)
“Towards more scalable mutual exclusion for multicore architectures”
Under the supervision of Gilles Muller and Gaël Thomas
REGAL/WHISPER team: “Well-Honed Infrastructure Software for Programming Environments and Runtimes”
Laboratoire d’Informatique de Paris 6, UPMC, Paris
Postdoctoral Research Fellow, then University Research Associate
Under the supervision of Alexandra Fedorova
SYNAR team: “Systems Networking and Architecture Research”
Simon Fraser University (SFU), Vancouver, Canada
Until July 2014
Until Sept. 2015
Jean-Pierre Lozi — Past activity
Who am I?
Jean-Pierre Lozi — Past activity
- Three main projects:- Remote Core Locking: dedicating cores for the execution of critical sections
- Hector: automated fault detection in error-handling codes
- A decade of idle cores: scheduling bugs in Linux
- Domain: systems!- Multicore architectures, synchronization / lock algorithms
- Automated source code analysis, bug detection
- Schedulers (on multicore architectures, again)
- Probably not your domain…- I will just give a quick overview of my previous work, no details
- Objective: work with you on some projects ! Low-level systems stuff needed for performance…
- If interested, we can discuss things in more detail
Project 1Remote Core Locking: dedicating cores for faster
execution of critical sections
Jean-Pierre Lozi — Past activity
Project 1: Remote Core Locking
Context: multicore architectures
- Decades of increasing CPU clock speeds, now issues with power usage / heat
- Increasing numbers of cores to keep increasing processing power- Possible because number of transistors keep increasing
Jean-Pierre Lozi — Past activity
Project 1: Remote Core Locking
Context: multicore architectures
- Decades of increasing CPU clock speeds, now issues with power usage / heat
- Increasing numbers of cores to keep increasing processing power- Possible because number of transistors keep increasing
Jean-Pierre Lozi — Past activity
# transistors
Clock speed
Power consumption
Ratio power/speed
Project 1: Remote Core Locking
Problem:
- Many legacy applications don’t scale well on modern multicore architectures
- For instance, Memcached on an x86 48-core machine (Get/Set requests):
Jean-Pierre Lozi — Past activity
Hig
her
isb
ett
er
Project 1: Remote Core Locking
Problem:
- Many legacy applications don’t scale well on multicore architectures
- For instance, Memcached on an x86 48-core machine (Get/Set requests):
Jean-Pierre Lozi — Past activity
Hig
her
isb
ett
er
Project 1: Remote Core Locking
Poor scalability on multicore architectures: why?
- Bottleneck = critical sections, protected by locks
- High contention => lock acquisition is costly (more cores => higher contention)
Jean-Pierre Lozi — Past activity
Project 1: Remote Core Locking
Poor scalability on multicore architectures: why?
- Bottleneck = critical sections, protected by locks
- High contention => lock acquisition is costly (more cores => higher contention)
Jean-Pierre Lozi — Past activity
0%
20%
40%
60%
80%
100%
1 4 8 16 22 32 48
SPLASH-2/Radiosity
SPLASH-2/Raytrace
Phoenix 2/LG
Phoenix 2/SM
Phoenix 2/MM
Memcached/Get
Memcached/Set
Berkeley DB/OS
Berkeley DB/SL
Number of cores
% o
f tim
e s
pe
nt in
critical section
s*
Project 1: Remote Core Locking
Poor scalability on multicore architectures: why?
- Bottleneck = critical sections, protected by locks
- High contention => lock acquisition is costly (more cores => higher contention)
Two possible solutions:
- Redesign applications (fine-grained locking)
- Costly (millions of lines of legacy code)
- Design better locks!
Jean-Pierre Lozi — Past activity
Project 1: Remote Core Locking
Poor scalability on multicore architectures: why?
- Bottleneck = critical sections, protected by locks
- High contention => lock acquisition is costly (more cores => higher contention)
Two possible solutions:
- Redesign applications (fine-grained locking)
- Costly (millions of lines of legacy code)
- Design better locks!
Jean-Pierre Lozi — Past activity
Higher contention
MCS
CAS spinlock
Critical sections access 5 cache lines each
Lo
we
ris
be
tte
r
[Mellor-Crummey ASPLOS’91]
Project 1: Remote Core Locking
Designing better locks
- No need to redesign the application, better resistance to contention
- Custom microbenchmark to compare locks:
Jean-Pierre Lozi — Past activity Lower contention
Improvement
Higher contention
MCS
CAS spinlock
Critical sections access 5 cache lines each
Lo
we
ris
be
tte
r
Project 1: Remote Core Locking
Designing better locks
- No need to redesign the application, better resistance to contention
- Custom microbenchmark to compare locks:
Jean-Pierre Lozi — Past activity Lower contention
[Mellor-Crummey ASPLOS’91]
Higher contention
Critical sections access 5 cache lines each
Lo
we
ris
be
tte
r
Project 1: Remote Core Locking
Designing better locks
- Custom microbenchmark to compare locks:
Jean-Pierre Lozi — Past activity Lower contention
Flat Combining Blocking locks
MCS
CAS spinlock
Project 1: Remote Core Locking
Question : why are lock algorithms inefficient ?
Jean-Pierre Lozi — Past activity
Project 1: Remote Core Locking
Question : why are lock algorithms inefficient ?
Because critical paths are too long!
- Overhead 1: costly lock handovers
- Overhead 2: poor locality of critical sections
Jean-Pierre Lozi — Past activity
Project 1: Remote Core Locking
Overhead 1: costly lock handovers
Jean-Pierre Lozi — Past activity
Tim
eT1 T2 T3
CS1
CS2
CS3
Project 1: Remote Core Locking
Overhead 1: costly lock handovers
Jean-Pierre Lozi — Past activity
Tim
eT1 T2 T3
CS1
CS2
CS3
Lock
han
do
vers
Project 1: Remote Core Locking
Overhead 1: costly lock handovers
Jean-Pierre Lozi — Past activity
Tim
eT1 T2 T3
CS1
CS2
CS3
Lock
han
do
vers
Cri
tica
l pat
h
Project 1: Remote Core Locking
Overhead 1: costly lock handovers
Jean-Pierre Lozi — Past activity
Tim
eT1 T2 T3
CS1
SC2
CS3
Lock
han
do
vers
Cri
tcal
pat
h
Lock handovers :Spinlocks: busy-waiting,POSIX locks: context switch,MCS: sending a message from one thread to the next,Flat combining: sometimes acquires a spinlock
Project 1: Remote Core Locking
Overhead 2: poor locality of critical sections
Jean-Pierre Lozi — Past activity
Tim
e
T1 T2 T3CS1
CS3
CS2
Shared variable 2Shared variable 1
Project 1: Remote Core Locking
Overhead 2: poor locality of critical sections
Jean-Pierre Lozi — Past activity
Tim
e
T1 T2 T3CS1
CS3
CS2
Shared variable 2Shared variable 1
Cac
he
mis
ses
Project 1: Remote Core Locking
Overhead 2: poor locality of critical sections
Jean-Pierre Lozi — Past activity
Tim
e
T1 T2 T3CS1
CS3
CS2
Shared variable 2Shared variable 1
Cri
tica
l pat
h
Cac
he
mis
ses
Project 1: Remote Core Locking
Idea: RCL = shorten the critical path as much as possible by dedicating a server core!
Jean-Pierre Lozi — Past activity
Project 1: Remote Core Locking
Idea: RCL = shorten the critical path as much as possible by dedicating a server core!
Jean-Pierre Lozi — Past activity
Tim
e
T1 T2 T3
CS1CS2CS3
Server core
Project 1: Remote Core Locking
Idea: RCL = shorten the critical path as much as possible by dedicating a server core!
Jean-Pierre Lozi — Past activity
Tim
e
T1 T2 T3
CS1CS2CS3
Server core
Shared variable 2Shared variable 1
No cache misses!
Project 1: Remote Core Locking
Idea: RCL = shorten the critical path as much as possible by dedicating a server core!
Jean-Pierre Lozi — Past activity
Tim
e
T1 T2 T3
Cri
tica
l pat
hCS1CS2CS3
Server core
Shared variable 2Shared variable 1
No cache misses!
Project 1: Remote Core Locking
Performance ?
Jean-Pierre Lozi — Past activity
Project 1: Remote Core Locking
Jean-Pierre Lozi — Past activity
Higher contention Lower contention
CAS spinlock
MCS
RCL
Flat Combining Blocking locks
Lo
we
ris
be
tte
r
Microbenchmark:
Project 1: Remote Core Locking
Jean-Pierre Lozi — Past activity
Higher contention Lower contention
CAS spinlock
MCS
RCL
Blocking locks
Lo
we
ris
be
tte
r
Microbenchmark:
Combining locks {
Project 1: Remote Core Locking
That was the general idea… but RCL is much more than that!
RCL offers:
- A runtime designed to work with legacy applications: i.e., it works efficiently withmultiple locks per server and/or multiple servers, it handles critical sections thatbusy-wait/block, that supports condition variables / trylocks / nested / recursivecritical sections… lots of complex problems to solve here !
- A reengineering tool that transforms applications to use RCL: can’t be used withonly lock/unlock functions, need to encapsulate critical sections, and ship them toserver cores... We don’t want to do this manually!
- A profiler to detect applications that can benefit from RCL: even with thereenginering tool, using RCL takes time… We want to make sure we can benefitfrom it! Use case : applications with highly contended locks or critical sections w/poor locality...
Jean-Pierre Lozi — Past activity
Project 1: Remote Core Locking
That was the general idea… but RCL is much more than that!
Focus: legacy system (C) applications. RCL offers:
- A runtime designed to work with legacy applications: i.e., it works efficiently withmultiple locks per server and / or multiple servers, it handles critical sections thatbusy-wait / block, that supports condition variables / trylocks / nested / recursivecritical sections… lots of algorithmic / engineering problems to solve here !
- A reengineering tool that transforms applications to use RCL: can’t be used withonly lock/unlock functions, need to encapsulate critical sections, and ship them toserver cores... We don’t want to do this manually!
- A profiler to detect applications that can benefit from RCL: even with thereenginering tool, using RCL takes time… We want to make sure we can benefitfrom it! Use case : applications with highly contended locks or critical sections w/poor locality...
Jean-Pierre Lozi — Past activity
Project 1: Remote Core Locking
That was the general idea… but RCL is much more than that!
Focus: legacy system (C) applications. RCL offers:
- A runtime designed to work with legacy applications: i.e., it works efficiently withmultiple locks per server and / or multiple servers, it handles critical sections thatbusy-wait / block, that supports condition variables / trylocks / nested / recursivecritical sections… lots of algorithmic / engineering problems to solve here !
- A reengineering tool that transforms applications to use RCL: can’t be used withonly lock / unlock functions, need to encapsulate critical sections, and ship themto server cores... We don’t want to do this manually!
- A profiler to detect applications that can benefit from RCL: even with thereenginering tool, using RCL takes time… We want to make sure we can benefitfrom it! Use case : applications with highly contended locks or critical sections w/poor locality...
Jean-Pierre Lozi — Past activity
Project 1: Remote Core Locking
That was the general idea… but RCL is much more than that!
Focus: legacy system (C) applications. RCL offers:
- A runtime designed to work with legacy applications: i.e., it works efficiently withmultiple locks per server and / or multiple servers, it handles critical sections thatbusy-wait / block, that supports condition variables / trylocks / nested / recursivecritical sections… lots of algorithmic / engineering problems to solve here !
- A reengineering tool that transforms applications to use RCL: can’t be used withonly lock / unlock functions, need to encapsulate critical sections, and ship themto server cores... We don’t want to do this manually!
- A profiler to detect applications that can benefit from RCL: even with thereenginering tool, using RCL takes time… We want to make sure we can benefitfrom it! Use case : applications with highly contended locks or critical sections w/poor locality...
Jean-Pierre Lozi — Past activity
void func(void) {
int a, b, x;
…
a = …;
…
pthread_mutex_lock();
a = f(a);
f(b);
pthread_mutex_unlock();
…
}
struct context { int a, b };
void func(void) {
struct context c;
int x;
…
c.a = …;
…
execute_rcl(__cs, &c);
…
}
void __cs(struct context *c) {
c->a = f(c->a)
f(c->b)
}
Project 1: Remote Core Locking
Reengineering tool: a simple case
void func(void) {
int a, b, x;
…
a = …;
…
pthread_mutex_lock();
a = f(a);
f(b);
pthread_mutex_unlock();
…
}
struct context { int a, b };
void func(void) {
struct context c;
int x;
…
c.a = …;
…
execute_rcl(__cs, &c);
…
}
void __cs(struct context *c) {
c->a = f(c->a)
f(c->b)
}
Project 1: Remote Core Locking
Reengineering tool: a simple case
void func(void) {
int a, b, x;
…
a = …;
…
pthread_mutex_lock();
a = f(a);
f(b);
pthread_mutex_unlock();
…
}
struct context { int a, b };
void func(void) {
struct context c;
int x;
…
c.a = …;
…
execute_rcl(__cs, &c);
…
}
void __cs(struct context *c) {
c->a = f(c->a)
f(c->b)
}
Project 1: Remote Core Locking
Reengineering tool: a simple case
void func(void) {
int a, b, x;
…
a = …;
…
pthread_mutex_lock();
a = f(a);
f(b);
pthread_mutex_unlock();
…
}
struct context { int a, b };
void func(void) {
struct context c;
int x;
…
c.a = …;
…
execute_rcl(__cs, &c);
…
}
void __cs(struct context *c) {
c->a = f(c->a)
f(c->b)
}
Project 1: Remote Core Locking
Reengineering tool: a simple case
Project 1: Remote Core Locking
Performance in legacy applications:
Jean-Pierre Lozi — Past activity
% in CS: 44.7%(many DCMs)
63.9% 65.7% 79.0% 81.6% 87.7% 90.2% 92.2%
Hig
he
ris
better
Project 1: Remote Core Locking
Performance in legacy applications:
Jean-Pierre Lozi — Past activity
% in CS: 44.7%(many DCMs)
63.9% 65.7% 79.0% 81.6% 87.7% 90.2% 92.2%
Hig
he
ris
better
Project 1: Remote Core Locking
Performance in legacy applications:
Jean-Pierre Lozi — Past activity
% in CS: 44.7%(many DCMs)
63.9% 65.7% 79.0% 81.6% 87.7% 90.2% 92.2%
Hig
he
ris
better
Project 1: Remote Core Locking
Improved scalability:
Jean-Pierre Lozi — Past activity
Project 1: Remote Core Locking
Very efficient when more threads than cores: server always makes progress
Jean-Pierre Lozi — Past activity
Hig
her
isb
ett
er
Project 1: Remote Core Locking
Very efficient when more threads than cores: server always makes progress
Jean-Pierre Lozi — Past activity
Quick collapse
Hig
her
isb
ett
er
Project 1: Remote Core Locking
Very efficient when more threads than cores: server always makes progress
Jean-Pierre Lozi — Past activity
Quick collapse
Hig
her
isb
ett
er
Project 1: Remote Core Locking
Publications (44 citations):
- In CFSE ‘8 (national conference, best paper)
- In USENIX ATC ’12 (international conference)
- Long version of the paper submitted to TOCS (intl journal, waiting for reviews)
Several research works based on RCL already:
- [Petrovic et al., PPoPP’14] : RCL for partially cache coherent architectures
- [Pusuruki et al., PPoPP ‘14] : migrating threads to improve locality of CS
- [Hassan et al., IPDPS ’14] : dedicated server cores for transactional memory
Jean-Pierre Lozi — Past activity
Project 2Hector: automated fault detection in error-
handling codes
Jean-Pierre Lozi — Past activity
Project 2: Hector
Problem:
- System applications are complex => impossible to avoid bugs
- Large part of system applications are error-handling codes- Errors tested after most function calls
Jean-Pierre Lozi — Past activity
Project 2: Hector
Problem:
- System applications are complex => impossible to avoid bugs
- Large part of system applications are error-handling codes- Errors tested after most function calls
- A common mistake: forgetting to release a resource in an EHC- Releasing memory or a lock, unloading a device…
Jean-Pierre Lozi — Past activity
Project 2: Hector
Problem:
- System applications are complex => impossible to avoid bugs
- Large part of system applications are error-handling codes- Errors tested after most function calls
- A common mistake: forgetting to release a resource in an EHC- Releasing memory or a lock, unloading a device…
- Can have major consequences!- Memory leaks (exploits!), deadlocks, crashes...
Jean-Pierre Lozi — Past activity
Project 2: Hector
Existing solutions:
- Specification mining: automated search for “protocols”, i.e., ways to use an API: e.g., function Y (release) often after function X
- Problem: in practice, need to prune number of protocols found to avoid false positive: filtering using high confidence and support
Jean-Pierre Lozi — Past activity
Project 2: Hector
Existing solutions:
- Specification mining: automated search for “protocols”, i.e., ways to use an API: e.g., function Y (release) often after function X
- Problem: in practice, need to prune number of protocols found to avoid false positives: filtering using high confidence and support
Jean-Pierre Lozi — Past activity
Project 2: Hector
Existing solutions:
- Specification mining: automated search for “protocols”, i.e., ways to use an API: e.g., function Y (release) often after function X
- Problem: in practice, need to prune number of protocols found to avoid false positives: filtering using high confidence and support
- “Macroscopic approaches”: missing lots of acquire/release faults because many acquire/release operations used only a handful of times!
Jean-Pierre Lozi — Past activity
Project 2: Hector
Our solution: Hector
- Observation: when resources released in a given way in an EHC, often releasedin the same way in nearby EHCs
- We first annotate functions globally, guessing if they are acquire or release functions based on heuristics (return a parameter or not, first/last access to a variable, …)
- Local analysis of Flow Control Graph: looks for acquire / release operations
Jean-Pierre Lozi — Past activity
Project 2: Hector
Our solution: Hector
- Observation: when resources released in a given way in an EHC, often releasedin the same way in nearby EHCs
- We first annotate functions globally, guessing if they are acquire or release functions based on heuristics (return a parameter or not, first/last access to a variable, …)
- Local (“microscopic”) analysis of Flow Control Graph:- Look for release operation in EHC, flow graph”exemplar”
- Parts of the code with same acquire/release operations in flow graph before EHC but missing release operation: fault candidate!
Jean-Pierre Lozi — Past activity
Project 2: Hector
Our solution: Hector
- Observation: when resources released in a given way in an EHC, often releasedin the same way in nearby EHCs
- We first annotate functions globally, guessing if they are acquire or release functions based on heuristics (return a parameter or not, first/last access to a variable, …)
- Local (“microscopic”) analysis of Flow Control Graph:- Look for release operation in EHC, flow graph before the EHC = exemplar
- Parts of the code with same acquire/release operations in exemplar but missing release operation: fault candidate!
Jean-Pierre Lozi — Past activity
Expérience en recherche (8/14)
Jean-Pierre Lozi — Past activity
Allocation 1 ⟶
Allocation 2 ⟶
Libération 1 ⟶
Libération 2 ⟶
a
b
d
e
f Sortie ⟶
c GOTO ⟶
Basic example: from Linux
Expérience en recherche (8/14)
Jean-Pierre Lozi — Past activity
Libération 1 ⟶
Libération 2 ⟶
⟵ Sortie
d
e
f Sortie ⟶ f ‘’
c GOTO ⟶
⟵ GOTO c
Basic example: from LinuxAllocation 1 ⟶
Allocation 2 ⟶
a
b
Expérience en recherche (8/14)
Basic example: from Linux
Jean-Pierre Lozi — Past activity
Libération 1 ⟶
Libération 2 ⟶
⟵ Sortie
d
e
f Sortie ⟶ f ‘’
Same acquire/release operationsbefore EHC, missing releaseoperations: candidate fault!
c GOTO ⟶
⟵ GOTO c
Allocation 1 ⟶
Allocation 2 ⟶
a
b
Project 2: Hector
Our solution: Hector
- Was the general idea: in practice, smart heuristics in the Control Flow Graph to handle more elaborate cases…
- Found 484 fault candidates in Linux, Python, Apache, Wine, PHP, and PostgreSQL… 371 were real faults! => few false positives
- Most wouldn’t have been detected by macroscopic approaches!
Jean-Pierre Lozi — Past activity
Project 2: Hector
Our solution: Hector
- Was the general idea: in practice, smart heuristics in the Control Flow Graph to handle more elaborate cases…
- Found 484 fault candidates in Linux, Python, Apache, Wine, PHP, and PostgreSQL… 371 were real faults! => few false positives
- Most wouldn’t have been detected by macroscopic approaches!
Jean-Pierre Lozi — Past activity
Project 2: Hector
Our solution: Hector
- Was the general idea: in practice, smart heuristics in the Control Flow Graph to handle more elaborate cases…
- Found 484 fault candidates in Linux, Python, Apache, Wine, PHP, and PostgreSQL… 371 were real faults! => few false positives
- Most wouldn’t have been detected by macroscopic approaches!
Jean-Pierre Lozi — Past activity
Project 2: Hector
Our solution: Hector
- Was the general idea: in practice, smart heuristics in the Control Flow Graph to handle more elaborate cases…
- Found 484 fault candidates in Linux, Python, Apache, Wine, PHP, and PostgreSQL… 371 were real faults! => few false positives
- Most wouldn’t have been detected by macroscopic approaches!
- One of my main roles in this project: I showed that a malicious user can exploit some of these faults to crash servers!
Jean-Pierre Lozi — Past activity
Project 2: Hector
Publications (11 citations) :
- Dans CFSE ‘9 (national conference)
- Dans DSN ‘13 (international conference, best paper)
Jean-Pierre Lozi — Past activity
Project 3A decade of idle cores: scheduling bugs in Linux
Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
Problem:
- OS schedulers keep evolving with hardware- Completely new schedulers once in a while (the O(1) scheduler, CFS scheduler…)
- Many heuristics added over time to overcome issues
- Especially true with recent machines: multicore, NUMA...
- Programs run and end, nobody seems to think there is a problem!
- Linus said :- “And you have to realize that there are not very many things that have aged as well as the
scheduler. Which isjust another proof that scheduling is easy.”
- “I suspect that making the scheduler use per-CPU queues together with some inter-CPU load balancing logic is probably _trivial_. Patches already exist, and I don’t feel that people can screw up the few hundred lines too badly.”
Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
Problem:
- OS schedulers keep evolving with hardware- Completely new schedulers once in a while (the O(1) scheduler, CFS scheduler…)
- Many heuristics added over time to overcome issues
- Especially true with recent machines: multicore, NUMA...
- Programs run and end, nobody seems to think there is a problem!
- Linus said :- “And you have to realize that there are not very many things that have aged as well as the
scheduler. Which isjust another proof that scheduling is easy.”
- “I suspect that making the scheduler use per-CPU queues together with some inter-CPU load balancing logic is probably _trivial_. Patches already exist, and I don’t feel that people can screw up the few hundred lines too badly.”
Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
Problem:
- OS schedulers keep evolving with hardware- Completely new schedulers once in a while (the O(1) scheduler, CFS scheduler…)
- Many heuristics added over time to overcome issues
- Especially true with recent machines: multicore, NUMA...
- Programs run and end, nobody seems to think there is a problem!
- Linus said :- “And you have to realize that there are not very many things that have aged as well as the
scheduler. Which isjust another proof that scheduling is easy.”
- “I suspect that making the scheduler use per-CPU queues together with some inter-CPU load balancing logic is probably _trivial_. Patches already exist, and I don’t feel that people can screw up the few hundred lines too badly.”
Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
So wait, what is the problem again?
- Read the scheduler code: it is actually very complex, full of heuristics thatsometimes seem contradictory, hard to really understand
- No effort to show that any of it is correct. Simply tested on a few microbenchmarks, and feedback from users.
- We were curious and we considered a set of implicit invariants.
- Simple stuff, like “no core should run several threads when there have been idlecores in the system for a long time”, or “two threads with the same load shouldrun for a similar amount of time”
- We wrote “sanity checks” to check these invariants at runtime.
Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
So wait, what is the problem again?
- Read the scheduler code: it is actually very complex, full of heuristics thatsometimes seem contradictory, hard to really understand
- No effort to show that any of it is correct. Simply tested on a few microbenchmarks, and feedback from users.
We were curious and we considered a set of implicit invariants.
- Simple stuff, like “no core should run several threads when there have been idlecores in the system for a long time”, or “two threads with the same load shouldrun for a similar amount of time”
- We wrote “sanity checks” to check these invariants at runtime.
Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
Our shocking news: the Linux scheduler is rife with “scheduling bugs”, i.e., the inability to maintain the most basic implicit invariants!
Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
Example 1: when running two processes in two terminals, one with 2 threads, the other one with 64 threads, many cores idle while other cores overloaded!
- Reason: autogroups, the process with two threads has a higher load. The schedulerjust looks at average loads of nodes, which is balanced. Doesn’t try to go further down in the hierarchy!
Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
Example 1: when running two processes in two terminals, one with 2 threads, the other one with 64 threads, many cores idle while other cores overloaded!
- Reason: autogroups, the process with two threads has a higher load. The schedulerjust looks at average loads of nodes, which is balanced. Doesn’t try to go further down in the hierarchy!
Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
Example 2: Oracle with TPC-H = many cores remain idle for no apparent reason.
- Reason: initial inter-node load balancing event caused by transient thread, after thatone node with more threads than cores, thread keeps waking up on busy core becauseonly node-local threads considered (faulty cache-optimization heuristic). All threads wait for each other: “holes” everywhere in the execution!
Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
Example 2: Oracle with TPC-H = many cores remain idle for no apparent reason.
- Reason: initial inter-node load balancing event caused by transient thread, after thatone node with more threads than cores, thread keeps waking up on busy core becauseonly node-local threads considered (faulty cache-optimization heuristic). All threads wait for each other: “holes” everywhere in the execution!
Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
- Other bugs: Many other scheduling bugs found, including bugs where only one node in the system is used because topology not built properly, etc.
- We argue that such bugs will always be added to the scheduler, due to its constant evolution. Using formal proofs would be extremely complex, regression testingwould not find the bugs we found.
- We discuss possible ways of redesigning the scheduler to reduce bugs, but evenwith this, sanity checks needed to ensure more bugs won’t be added.
Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
- Other bugs: Many other scheduling bugs found, including bugs where only one node in the system is used because topology not built properly, etc.
- We argue that such bugs will always be added to the scheduler, due to its constant evolution. Using formal proofs would be extremely complex, regression testingwould not find the bugs we found.
- We discuss possible ways of redesigning the scheduler to reduce bugs, but evenwith this, sanity checks needed to ensure more bugs won’t be added.
Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
- Other bugs: Many other scheduling bugs found, including bugs where only one node in the system is used because topology not built properly, etc.
- We argue that such bugs will always be added to the scheduler, due to its constant evolution. Using formal proofs would be extremely complex, regression testingwould not find the bugs we found.
- We propose possible ways of redesigning the scheduler to reduce bugs, but evenwith this, sanity checks needed to ensure more bugs won’t be added.
Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
Our proposed implementation of sanity checks:- Differs from watchdogs or assertions, no simple test for scheduling issues
- Sanity checks called periodically, start low-overhead recording of movements of threads if candidate issue detected, report a bug if issue persists long enough
- If issue persists long enough, high-overhead profiling of the whole machine for 20ms to generate a bug report that helps understand why the scheduler didn’t reach a satisfactory state again during that time
We argue that sanity checks are the only practical way to efficiently ensure implicitinvariants, should be added to various parts of the kernel!
Paper mostly written, will be submitted to EuroSys next month.Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
Our proposed implementation of sanity checks:- Differs from watchdogs or assertions, no simple test for scheduling issues
- Sanity checks called periodically, start low-overhead recording of movements of threads if candidate issue detected, report a bug if issue persists long enough
- If issue persists long enough, high-overhead profiling of the whole machine for 20ms to generate a bug report that helps understand why the scheduler didn’t reach a satisfactory state again during that time
We argue that sanity checks are the only practical way to efficiently ensure implicitinvariants, should be added to various parts of the kernel!
Paper mostly written, will be submitted to EuroSys next month.Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
Our proposed implementation of sanity checks:- Differs from watchdogs or assertions, no simple test for scheduling issues
- Sanity checks called periodically, start low-overhead recording of movements of threads if candidate issue detected, report a bug if issue persists long enough
- If issue persists long enough, high-overhead profiling of the whole machine for 20ms to generate a bug report that helps understand why the scheduler didn’t reach a satisfactory state again during that time
We argue that sanity checks are the only practical way to efficiently ensure implicitinvariants, should be added to various parts of the kernel!
Paper mostly written, will be submitted to EuroSys next month.Jean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
Our proposed implementation of sanity checks:- Differs from watchdogs or assertions, no simple test for scheduling issues
- Sanity checks called periodically, start low-overhead recording of movements of threads if candidate issue detected, report a bug if issue persists long enough
- If issue persists long enough, high-overhead profiling of the whole machine for 20ms to generate a bug report that helps understand why the scheduler didn’t reach a satisfactory state again during that time
We argue that sanity checks are the only practical way to efficiently ensure implicitinvariants, should be added to various parts of the kernel!
Paper mostly written, will be submitted to EuroSys next monthJean-Pierre Lozi — Past activity
Project 3: scheduling bugs in Linux
Our proposed implementation of sanity checks:- Differs from watchdogs or assertions, no simple test for scheduling issues
- Sanity checks called periodically, start low-overhead recording of movements of threads if candidate issue detected, report a bug if issue persists long enough
- If issue persists long enough, high-overhead profiling of the whole machine for 20ms to generate a bug report that helps understand why the scheduler didn’t reach a satisfactory state again during that time
We argue that sanity checks are the only practical way to efficiently ensure implicitinvariants, should be added to various parts of the kernel!
Paper mostly written, will be submitted to EuroSys next month.Jean-Pierre Lozi — Past activity
And now?
Jean-Pierre Lozi — Past activity
And now?
Position profile: heterogeneous data warehouses- My profile: more “systems” than most people in the team
- But, data warehouses use multicore processors, which need scheduling, etc.
- Objective: bringing my expertise in systems to projects in the team
Open to all suggestions for collaborations!
Jean-Pierre Lozi
And now?
Position profile: heterogeneous data warehouses- My profile: more “systems” than most people in the team
- But, data warehouses use multicore processors, which need scheduling, etc.
- Objective: bringing my expertise in systems to projects in the team
Open to all suggestions for collaborations!
Jean-Pierre Lozi