jean-pierre lozi new mcf recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · under...

88
My research: an overview Jean-Pierre Lozi — New McF recruit Jean-Pierre Lozi — Past activity

Upload: others

Post on 22-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

My research: an overviewJean-Pierre Lozi — New McF recruit

Jean-Pierre Lozi — Past activity

Page 2: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Who am I?

Jean-Pierre Lozi — Past activity

- From here ! (born in Nice, grew up in Antibes)

- Before my PhD :

- DEUG MIAS, Licence d’Informatique + Licence de Mathématiques in Nice

- Télécom ParisTech (Paris)

- Masters in Université Pierre et Marie Curie (Paris)

Page 3: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Who am I?

PhD in Computer Science (“allocataire moniteur”)

“Towards more scalable mutual exclusion for multicore architectures”

Under the supervision of Gilles Muller and Gaël Thomas

REGAL/WHISPER team: “Well-Honed Infrastructure Software for Programming Environments and Runtimes”

Laboratoire d’Informatique de Paris 6, UPMC, Paris

Postdoctoral Research Fellow, then University Research Associate

Under the supervision of Alexandra Fedorova

SYNAR team: “Systems Networking and Architecture Research”

Simon Fraser University (SFU), Vancouver, Canada

Until July 2014

Until Sept. 2015

Jean-Pierre Lozi — Past activity

Page 4: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Who am I?

Jean-Pierre Lozi — Past activity

- Three main projects:- Remote Core Locking: dedicating cores for the execution of critical sections

- Hector: automated fault detection in error-handling codes

- A decade of idle cores: scheduling bugs in Linux

- Domain: systems!- Multicore architectures, synchronization / lock algorithms

- Automated source code analysis, bug detection

- Schedulers (on multicore architectures, again)

- Probably not your domain…- I will just give a quick overview of my previous work, no details

- Objective: work with you on some projects ! Low-level systems stuff needed for performance…

- If interested, we can discuss things in more detail

Page 5: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1Remote Core Locking: dedicating cores for faster

execution of critical sections

Jean-Pierre Lozi — Past activity

Page 6: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Context: multicore architectures

- Decades of increasing CPU clock speeds, now issues with power usage / heat

- Increasing numbers of cores to keep increasing processing power- Possible because number of transistors keep increasing

Jean-Pierre Lozi — Past activity

Page 7: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Context: multicore architectures

- Decades of increasing CPU clock speeds, now issues with power usage / heat

- Increasing numbers of cores to keep increasing processing power- Possible because number of transistors keep increasing

Jean-Pierre Lozi — Past activity

# transistors

Clock speed

Power consumption

Ratio power/speed

Page 8: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Problem:

- Many legacy applications don’t scale well on modern multicore architectures

- For instance, Memcached on an x86 48-core machine (Get/Set requests):

Jean-Pierre Lozi — Past activity

Hig

her

isb

ett

er

Page 9: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Problem:

- Many legacy applications don’t scale well on multicore architectures

- For instance, Memcached on an x86 48-core machine (Get/Set requests):

Jean-Pierre Lozi — Past activity

Hig

her

isb

ett

er

Page 10: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Poor scalability on multicore architectures: why?

- Bottleneck = critical sections, protected by locks

- High contention => lock acquisition is costly (more cores => higher contention)

Jean-Pierre Lozi — Past activity

Page 11: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Poor scalability on multicore architectures: why?

- Bottleneck = critical sections, protected by locks

- High contention => lock acquisition is costly (more cores => higher contention)

Jean-Pierre Lozi — Past activity

0%

20%

40%

60%

80%

100%

1 4 8 16 22 32 48

SPLASH-2/Radiosity

SPLASH-2/Raytrace

Phoenix 2/LG

Phoenix 2/SM

Phoenix 2/MM

Memcached/Get

Memcached/Set

Berkeley DB/OS

Berkeley DB/SL

Number of cores

% o

f tim

e s

pe

nt in

critical section

s*

Page 12: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Poor scalability on multicore architectures: why?

- Bottleneck = critical sections, protected by locks

- High contention => lock acquisition is costly (more cores => higher contention)

Two possible solutions:

- Redesign applications (fine-grained locking)

- Costly (millions of lines of legacy code)

- Design better locks!

Jean-Pierre Lozi — Past activity

Page 13: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Poor scalability on multicore architectures: why?

- Bottleneck = critical sections, protected by locks

- High contention => lock acquisition is costly (more cores => higher contention)

Two possible solutions:

- Redesign applications (fine-grained locking)

- Costly (millions of lines of legacy code)

- Design better locks!

Jean-Pierre Lozi — Past activity

Page 14: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Higher contention

MCS

CAS spinlock

Critical sections access 5 cache lines each

Lo

we

ris

be

tte

r

[Mellor-Crummey ASPLOS’91]

Project 1: Remote Core Locking

Designing better locks

- No need to redesign the application, better resistance to contention

- Custom microbenchmark to compare locks:

Jean-Pierre Lozi — Past activity Lower contention

Page 15: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Improvement

Higher contention

MCS

CAS spinlock

Critical sections access 5 cache lines each

Lo

we

ris

be

tte

r

Project 1: Remote Core Locking

Designing better locks

- No need to redesign the application, better resistance to contention

- Custom microbenchmark to compare locks:

Jean-Pierre Lozi — Past activity Lower contention

[Mellor-Crummey ASPLOS’91]

Page 16: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Higher contention

Critical sections access 5 cache lines each

Lo

we

ris

be

tte

r

Project 1: Remote Core Locking

Designing better locks

- Custom microbenchmark to compare locks:

Jean-Pierre Lozi — Past activity Lower contention

Flat Combining Blocking locks

MCS

CAS spinlock

Page 17: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Question : why are lock algorithms inefficient ?

Jean-Pierre Lozi — Past activity

Page 18: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Question : why are lock algorithms inefficient ?

Because critical paths are too long!

- Overhead 1: costly lock handovers

- Overhead 2: poor locality of critical sections

Jean-Pierre Lozi — Past activity

Page 19: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Overhead 1: costly lock handovers

Jean-Pierre Lozi — Past activity

Tim

eT1 T2 T3

CS1

CS2

CS3

Page 20: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Overhead 1: costly lock handovers

Jean-Pierre Lozi — Past activity

Tim

eT1 T2 T3

CS1

CS2

CS3

Lock

han

do

vers

Page 21: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Overhead 1: costly lock handovers

Jean-Pierre Lozi — Past activity

Tim

eT1 T2 T3

CS1

CS2

CS3

Lock

han

do

vers

Cri

tica

l pat

h

Page 22: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Overhead 1: costly lock handovers

Jean-Pierre Lozi — Past activity

Tim

eT1 T2 T3

CS1

SC2

CS3

Lock

han

do

vers

Cri

tcal

pat

h

Lock handovers :Spinlocks: busy-waiting,POSIX locks: context switch,MCS: sending a message from one thread to the next,Flat combining: sometimes acquires a spinlock

Page 23: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Overhead 2: poor locality of critical sections

Jean-Pierre Lozi — Past activity

Tim

e

T1 T2 T3CS1

CS3

CS2

Shared variable 2Shared variable 1

Page 24: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Overhead 2: poor locality of critical sections

Jean-Pierre Lozi — Past activity

Tim

e

T1 T2 T3CS1

CS3

CS2

Shared variable 2Shared variable 1

Cac

he

mis

ses

Page 25: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Overhead 2: poor locality of critical sections

Jean-Pierre Lozi — Past activity

Tim

e

T1 T2 T3CS1

CS3

CS2

Shared variable 2Shared variable 1

Cri

tica

l pat

h

Cac

he

mis

ses

Page 26: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Idea: RCL = shorten the critical path as much as possible by dedicating a server core!

Jean-Pierre Lozi — Past activity

Page 27: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Idea: RCL = shorten the critical path as much as possible by dedicating a server core!

Jean-Pierre Lozi — Past activity

Tim

e

T1 T2 T3

CS1CS2CS3

Server core

Page 28: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Idea: RCL = shorten the critical path as much as possible by dedicating a server core!

Jean-Pierre Lozi — Past activity

Tim

e

T1 T2 T3

CS1CS2CS3

Server core

Shared variable 2Shared variable 1

No cache misses!

Page 29: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Idea: RCL = shorten the critical path as much as possible by dedicating a server core!

Jean-Pierre Lozi — Past activity

Tim

e

T1 T2 T3

Cri

tica

l pat

hCS1CS2CS3

Server core

Shared variable 2Shared variable 1

No cache misses!

Page 30: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Performance ?

Jean-Pierre Lozi — Past activity

Page 31: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Jean-Pierre Lozi — Past activity

Higher contention Lower contention

CAS spinlock

MCS

RCL

Flat Combining Blocking locks

Lo

we

ris

be

tte

r

Microbenchmark:

Page 32: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Jean-Pierre Lozi — Past activity

Higher contention Lower contention

CAS spinlock

MCS

RCL

Blocking locks

Lo

we

ris

be

tte

r

Microbenchmark:

Combining locks {

Page 33: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

That was the general idea… but RCL is much more than that!

RCL offers:

- A runtime designed to work with legacy applications: i.e., it works efficiently withmultiple locks per server and/or multiple servers, it handles critical sections thatbusy-wait/block, that supports condition variables / trylocks / nested / recursivecritical sections… lots of complex problems to solve here !

- A reengineering tool that transforms applications to use RCL: can’t be used withonly lock/unlock functions, need to encapsulate critical sections, and ship them toserver cores... We don’t want to do this manually!

- A profiler to detect applications that can benefit from RCL: even with thereenginering tool, using RCL takes time… We want to make sure we can benefitfrom it! Use case : applications with highly contended locks or critical sections w/poor locality...

Jean-Pierre Lozi — Past activity

Page 34: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

That was the general idea… but RCL is much more than that!

Focus: legacy system (C) applications. RCL offers:

- A runtime designed to work with legacy applications: i.e., it works efficiently withmultiple locks per server and / or multiple servers, it handles critical sections thatbusy-wait / block, that supports condition variables / trylocks / nested / recursivecritical sections… lots of algorithmic / engineering problems to solve here !

- A reengineering tool that transforms applications to use RCL: can’t be used withonly lock/unlock functions, need to encapsulate critical sections, and ship them toserver cores... We don’t want to do this manually!

- A profiler to detect applications that can benefit from RCL: even with thereenginering tool, using RCL takes time… We want to make sure we can benefitfrom it! Use case : applications with highly contended locks or critical sections w/poor locality...

Jean-Pierre Lozi — Past activity

Page 35: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

That was the general idea… but RCL is much more than that!

Focus: legacy system (C) applications. RCL offers:

- A runtime designed to work with legacy applications: i.e., it works efficiently withmultiple locks per server and / or multiple servers, it handles critical sections thatbusy-wait / block, that supports condition variables / trylocks / nested / recursivecritical sections… lots of algorithmic / engineering problems to solve here !

- A reengineering tool that transforms applications to use RCL: can’t be used withonly lock / unlock functions, need to encapsulate critical sections, and ship themto server cores... We don’t want to do this manually!

- A profiler to detect applications that can benefit from RCL: even with thereenginering tool, using RCL takes time… We want to make sure we can benefitfrom it! Use case : applications with highly contended locks or critical sections w/poor locality...

Jean-Pierre Lozi — Past activity

Page 36: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

That was the general idea… but RCL is much more than that!

Focus: legacy system (C) applications. RCL offers:

- A runtime designed to work with legacy applications: i.e., it works efficiently withmultiple locks per server and / or multiple servers, it handles critical sections thatbusy-wait / block, that supports condition variables / trylocks / nested / recursivecritical sections… lots of algorithmic / engineering problems to solve here !

- A reengineering tool that transforms applications to use RCL: can’t be used withonly lock / unlock functions, need to encapsulate critical sections, and ship themto server cores... We don’t want to do this manually!

- A profiler to detect applications that can benefit from RCL: even with thereenginering tool, using RCL takes time… We want to make sure we can benefitfrom it! Use case : applications with highly contended locks or critical sections w/poor locality...

Jean-Pierre Lozi — Past activity

Page 37: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

void func(void) {

int a, b, x;

a = …;

pthread_mutex_lock();

a = f(a);

f(b);

pthread_mutex_unlock();

}

struct context { int a, b };

void func(void) {

struct context c;

int x;

c.a = …;

execute_rcl(__cs, &c);

}

void __cs(struct context *c) {

c->a = f(c->a)

f(c->b)

}

Project 1: Remote Core Locking

Reengineering tool: a simple case

Page 38: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

void func(void) {

int a, b, x;

a = …;

pthread_mutex_lock();

a = f(a);

f(b);

pthread_mutex_unlock();

}

struct context { int a, b };

void func(void) {

struct context c;

int x;

c.a = …;

execute_rcl(__cs, &c);

}

void __cs(struct context *c) {

c->a = f(c->a)

f(c->b)

}

Project 1: Remote Core Locking

Reengineering tool: a simple case

Page 39: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

void func(void) {

int a, b, x;

a = …;

pthread_mutex_lock();

a = f(a);

f(b);

pthread_mutex_unlock();

}

struct context { int a, b };

void func(void) {

struct context c;

int x;

c.a = …;

execute_rcl(__cs, &c);

}

void __cs(struct context *c) {

c->a = f(c->a)

f(c->b)

}

Project 1: Remote Core Locking

Reengineering tool: a simple case

Page 40: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

void func(void) {

int a, b, x;

a = …;

pthread_mutex_lock();

a = f(a);

f(b);

pthread_mutex_unlock();

}

struct context { int a, b };

void func(void) {

struct context c;

int x;

c.a = …;

execute_rcl(__cs, &c);

}

void __cs(struct context *c) {

c->a = f(c->a)

f(c->b)

}

Project 1: Remote Core Locking

Reengineering tool: a simple case

Page 41: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Performance in legacy applications:

Jean-Pierre Lozi — Past activity

% in CS: 44.7%(many DCMs)

63.9% 65.7% 79.0% 81.6% 87.7% 90.2% 92.2%

Hig

he

ris

better

Page 42: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Performance in legacy applications:

Jean-Pierre Lozi — Past activity

% in CS: 44.7%(many DCMs)

63.9% 65.7% 79.0% 81.6% 87.7% 90.2% 92.2%

Hig

he

ris

better

Page 43: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Performance in legacy applications:

Jean-Pierre Lozi — Past activity

% in CS: 44.7%(many DCMs)

63.9% 65.7% 79.0% 81.6% 87.7% 90.2% 92.2%

Hig

he

ris

better

Page 44: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Improved scalability:

Jean-Pierre Lozi — Past activity

Page 45: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Very efficient when more threads than cores: server always makes progress

Jean-Pierre Lozi — Past activity

Hig

her

isb

ett

er

Page 46: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Very efficient when more threads than cores: server always makes progress

Jean-Pierre Lozi — Past activity

Quick collapse

Hig

her

isb

ett

er

Page 47: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Very efficient when more threads than cores: server always makes progress

Jean-Pierre Lozi — Past activity

Quick collapse

Hig

her

isb

ett

er

Page 48: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 1: Remote Core Locking

Publications (44 citations):

- In CFSE ‘8 (national conference, best paper)

- In USENIX ATC ’12 (international conference)

- Long version of the paper submitted to TOCS (intl journal, waiting for reviews)

Several research works based on RCL already:

- [Petrovic et al., PPoPP’14] : RCL for partially cache coherent architectures

- [Pusuruki et al., PPoPP ‘14] : migrating threads to improve locality of CS

- [Hassan et al., IPDPS ’14] : dedicated server cores for transactional memory

Jean-Pierre Lozi — Past activity

Page 49: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2Hector: automated fault detection in error-

handling codes

Jean-Pierre Lozi — Past activity

Page 50: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2: Hector

Problem:

- System applications are complex => impossible to avoid bugs

- Large part of system applications are error-handling codes- Errors tested after most function calls

Jean-Pierre Lozi — Past activity

Page 51: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2: Hector

Problem:

- System applications are complex => impossible to avoid bugs

- Large part of system applications are error-handling codes- Errors tested after most function calls

- A common mistake: forgetting to release a resource in an EHC- Releasing memory or a lock, unloading a device…

Jean-Pierre Lozi — Past activity

Page 52: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2: Hector

Problem:

- System applications are complex => impossible to avoid bugs

- Large part of system applications are error-handling codes- Errors tested after most function calls

- A common mistake: forgetting to release a resource in an EHC- Releasing memory or a lock, unloading a device…

- Can have major consequences!- Memory leaks (exploits!), deadlocks, crashes...

Jean-Pierre Lozi — Past activity

Page 53: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2: Hector

Existing solutions:

- Specification mining: automated search for “protocols”, i.e., ways to use an API: e.g., function Y (release) often after function X

- Problem: in practice, need to prune number of protocols found to avoid false positive: filtering using high confidence and support

Jean-Pierre Lozi — Past activity

Page 54: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2: Hector

Existing solutions:

- Specification mining: automated search for “protocols”, i.e., ways to use an API: e.g., function Y (release) often after function X

- Problem: in practice, need to prune number of protocols found to avoid false positives: filtering using high confidence and support

Jean-Pierre Lozi — Past activity

Page 55: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2: Hector

Existing solutions:

- Specification mining: automated search for “protocols”, i.e., ways to use an API: e.g., function Y (release) often after function X

- Problem: in practice, need to prune number of protocols found to avoid false positives: filtering using high confidence and support

- “Macroscopic approaches”: missing lots of acquire/release faults because many acquire/release operations used only a handful of times!

Jean-Pierre Lozi — Past activity

Page 56: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2: Hector

Our solution: Hector

- Observation: when resources released in a given way in an EHC, often releasedin the same way in nearby EHCs

- We first annotate functions globally, guessing if they are acquire or release functions based on heuristics (return a parameter or not, first/last access to a variable, …)

- Local analysis of Flow Control Graph: looks for acquire / release operations

Jean-Pierre Lozi — Past activity

Page 57: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2: Hector

Our solution: Hector

- Observation: when resources released in a given way in an EHC, often releasedin the same way in nearby EHCs

- We first annotate functions globally, guessing if they are acquire or release functions based on heuristics (return a parameter or not, first/last access to a variable, …)

- Local (“microscopic”) analysis of Flow Control Graph:- Look for release operation in EHC, flow graph”exemplar”

- Parts of the code with same acquire/release operations in flow graph before EHC but missing release operation: fault candidate!

Jean-Pierre Lozi — Past activity

Page 58: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2: Hector

Our solution: Hector

- Observation: when resources released in a given way in an EHC, often releasedin the same way in nearby EHCs

- We first annotate functions globally, guessing if they are acquire or release functions based on heuristics (return a parameter or not, first/last access to a variable, …)

- Local (“microscopic”) analysis of Flow Control Graph:- Look for release operation in EHC, flow graph before the EHC = exemplar

- Parts of the code with same acquire/release operations in exemplar but missing release operation: fault candidate!

Jean-Pierre Lozi — Past activity

Page 59: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Expérience en recherche (8/14)

Jean-Pierre Lozi — Past activity

Allocation 1 ⟶

Allocation 2 ⟶

Libération 1 ⟶

Libération 2 ⟶

a

b

d

e

f Sortie ⟶

c GOTO ⟶

Basic example: from Linux

Page 60: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Expérience en recherche (8/14)

Jean-Pierre Lozi — Past activity

Libération 1 ⟶

Libération 2 ⟶

⟵ Sortie

d

e

f Sortie ⟶ f ‘’

c GOTO ⟶

⟵ GOTO c

Basic example: from LinuxAllocation 1 ⟶

Allocation 2 ⟶

a

b

Page 61: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Expérience en recherche (8/14)

Basic example: from Linux

Jean-Pierre Lozi — Past activity

Libération 1 ⟶

Libération 2 ⟶

⟵ Sortie

d

e

f Sortie ⟶ f ‘’

Same acquire/release operationsbefore EHC, missing releaseoperations: candidate fault!

c GOTO ⟶

⟵ GOTO c

Allocation 1 ⟶

Allocation 2 ⟶

a

b

Page 62: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2: Hector

Our solution: Hector

- Was the general idea: in practice, smart heuristics in the Control Flow Graph to handle more elaborate cases…

- Found 484 fault candidates in Linux, Python, Apache, Wine, PHP, and PostgreSQL… 371 were real faults! => few false positives

- Most wouldn’t have been detected by macroscopic approaches!

Jean-Pierre Lozi — Past activity

Page 63: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2: Hector

Our solution: Hector

- Was the general idea: in practice, smart heuristics in the Control Flow Graph to handle more elaborate cases…

- Found 484 fault candidates in Linux, Python, Apache, Wine, PHP, and PostgreSQL… 371 were real faults! => few false positives

- Most wouldn’t have been detected by macroscopic approaches!

Jean-Pierre Lozi — Past activity

Page 64: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2: Hector

Our solution: Hector

- Was the general idea: in practice, smart heuristics in the Control Flow Graph to handle more elaborate cases…

- Found 484 fault candidates in Linux, Python, Apache, Wine, PHP, and PostgreSQL… 371 were real faults! => few false positives

- Most wouldn’t have been detected by macroscopic approaches!

Jean-Pierre Lozi — Past activity

Page 65: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2: Hector

Our solution: Hector

- Was the general idea: in practice, smart heuristics in the Control Flow Graph to handle more elaborate cases…

- Found 484 fault candidates in Linux, Python, Apache, Wine, PHP, and PostgreSQL… 371 were real faults! => few false positives

- Most wouldn’t have been detected by macroscopic approaches!

- One of my main roles in this project: I showed that a malicious user can exploit some of these faults to crash servers!

Jean-Pierre Lozi — Past activity

Page 66: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 2: Hector

Publications (11 citations) :

- Dans CFSE ‘9 (national conference)

- Dans DSN ‘13 (international conference, best paper)

Jean-Pierre Lozi — Past activity

Page 67: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3A decade of idle cores: scheduling bugs in Linux

Jean-Pierre Lozi — Past activity

Page 68: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

Problem:

- OS schedulers keep evolving with hardware- Completely new schedulers once in a while (the O(1) scheduler, CFS scheduler…)

- Many heuristics added over time to overcome issues

- Especially true with recent machines: multicore, NUMA...

- Programs run and end, nobody seems to think there is a problem!

- Linus said :- “And you have to realize that there are not very many things that have aged as well as the

scheduler. Which isjust another proof that scheduling is easy.”

- “I suspect that making the scheduler use per-CPU queues together with some inter-CPU load balancing logic is probably _trivial_. Patches already exist, and I don’t feel that people can screw up the few hundred lines too badly.”

Jean-Pierre Lozi — Past activity

Page 69: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

Problem:

- OS schedulers keep evolving with hardware- Completely new schedulers once in a while (the O(1) scheduler, CFS scheduler…)

- Many heuristics added over time to overcome issues

- Especially true with recent machines: multicore, NUMA...

- Programs run and end, nobody seems to think there is a problem!

- Linus said :- “And you have to realize that there are not very many things that have aged as well as the

scheduler. Which isjust another proof that scheduling is easy.”

- “I suspect that making the scheduler use per-CPU queues together with some inter-CPU load balancing logic is probably _trivial_. Patches already exist, and I don’t feel that people can screw up the few hundred lines too badly.”

Jean-Pierre Lozi — Past activity

Page 70: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

Problem:

- OS schedulers keep evolving with hardware- Completely new schedulers once in a while (the O(1) scheduler, CFS scheduler…)

- Many heuristics added over time to overcome issues

- Especially true with recent machines: multicore, NUMA...

- Programs run and end, nobody seems to think there is a problem!

- Linus said :- “And you have to realize that there are not very many things that have aged as well as the

scheduler. Which isjust another proof that scheduling is easy.”

- “I suspect that making the scheduler use per-CPU queues together with some inter-CPU load balancing logic is probably _trivial_. Patches already exist, and I don’t feel that people can screw up the few hundred lines too badly.”

Jean-Pierre Lozi — Past activity

Page 71: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

So wait, what is the problem again?

- Read the scheduler code: it is actually very complex, full of heuristics thatsometimes seem contradictory, hard to really understand

- No effort to show that any of it is correct. Simply tested on a few microbenchmarks, and feedback from users.

- We were curious and we considered a set of implicit invariants.

- Simple stuff, like “no core should run several threads when there have been idlecores in the system for a long time”, or “two threads with the same load shouldrun for a similar amount of time”

- We wrote “sanity checks” to check these invariants at runtime.

Jean-Pierre Lozi — Past activity

Page 72: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

So wait, what is the problem again?

- Read the scheduler code: it is actually very complex, full of heuristics thatsometimes seem contradictory, hard to really understand

- No effort to show that any of it is correct. Simply tested on a few microbenchmarks, and feedback from users.

We were curious and we considered a set of implicit invariants.

- Simple stuff, like “no core should run several threads when there have been idlecores in the system for a long time”, or “two threads with the same load shouldrun for a similar amount of time”

- We wrote “sanity checks” to check these invariants at runtime.

Jean-Pierre Lozi — Past activity

Page 73: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

Our shocking news: the Linux scheduler is rife with “scheduling bugs”, i.e., the inability to maintain the most basic implicit invariants!

Jean-Pierre Lozi — Past activity

Page 74: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

Example 1: when running two processes in two terminals, one with 2 threads, the other one with 64 threads, many cores idle while other cores overloaded!

- Reason: autogroups, the process with two threads has a higher load. The schedulerjust looks at average loads of nodes, which is balanced. Doesn’t try to go further down in the hierarchy!

Jean-Pierre Lozi — Past activity

Page 75: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

Example 1: when running two processes in two terminals, one with 2 threads, the other one with 64 threads, many cores idle while other cores overloaded!

- Reason: autogroups, the process with two threads has a higher load. The schedulerjust looks at average loads of nodes, which is balanced. Doesn’t try to go further down in the hierarchy!

Jean-Pierre Lozi — Past activity

Page 76: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

Example 2: Oracle with TPC-H = many cores remain idle for no apparent reason.

- Reason: initial inter-node load balancing event caused by transient thread, after thatone node with more threads than cores, thread keeps waking up on busy core becauseonly node-local threads considered (faulty cache-optimization heuristic). All threads wait for each other: “holes” everywhere in the execution!

Jean-Pierre Lozi — Past activity

Page 77: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

Example 2: Oracle with TPC-H = many cores remain idle for no apparent reason.

- Reason: initial inter-node load balancing event caused by transient thread, after thatone node with more threads than cores, thread keeps waking up on busy core becauseonly node-local threads considered (faulty cache-optimization heuristic). All threads wait for each other: “holes” everywhere in the execution!

Jean-Pierre Lozi — Past activity

Page 78: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

- Other bugs: Many other scheduling bugs found, including bugs where only one node in the system is used because topology not built properly, etc.

- We argue that such bugs will always be added to the scheduler, due to its constant evolution. Using formal proofs would be extremely complex, regression testingwould not find the bugs we found.

- We discuss possible ways of redesigning the scheduler to reduce bugs, but evenwith this, sanity checks needed to ensure more bugs won’t be added.

Jean-Pierre Lozi — Past activity

Page 79: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

- Other bugs: Many other scheduling bugs found, including bugs where only one node in the system is used because topology not built properly, etc.

- We argue that such bugs will always be added to the scheduler, due to its constant evolution. Using formal proofs would be extremely complex, regression testingwould not find the bugs we found.

- We discuss possible ways of redesigning the scheduler to reduce bugs, but evenwith this, sanity checks needed to ensure more bugs won’t be added.

Jean-Pierre Lozi — Past activity

Page 80: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

- Other bugs: Many other scheduling bugs found, including bugs where only one node in the system is used because topology not built properly, etc.

- We argue that such bugs will always be added to the scheduler, due to its constant evolution. Using formal proofs would be extremely complex, regression testingwould not find the bugs we found.

- We propose possible ways of redesigning the scheduler to reduce bugs, but evenwith this, sanity checks needed to ensure more bugs won’t be added.

Jean-Pierre Lozi — Past activity

Page 81: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

Our proposed implementation of sanity checks:- Differs from watchdogs or assertions, no simple test for scheduling issues

- Sanity checks called periodically, start low-overhead recording of movements of threads if candidate issue detected, report a bug if issue persists long enough

- If issue persists long enough, high-overhead profiling of the whole machine for 20ms to generate a bug report that helps understand why the scheduler didn’t reach a satisfactory state again during that time

We argue that sanity checks are the only practical way to efficiently ensure implicitinvariants, should be added to various parts of the kernel!

Paper mostly written, will be submitted to EuroSys next month.Jean-Pierre Lozi — Past activity

Page 82: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

Our proposed implementation of sanity checks:- Differs from watchdogs or assertions, no simple test for scheduling issues

- Sanity checks called periodically, start low-overhead recording of movements of threads if candidate issue detected, report a bug if issue persists long enough

- If issue persists long enough, high-overhead profiling of the whole machine for 20ms to generate a bug report that helps understand why the scheduler didn’t reach a satisfactory state again during that time

We argue that sanity checks are the only practical way to efficiently ensure implicitinvariants, should be added to various parts of the kernel!

Paper mostly written, will be submitted to EuroSys next month.Jean-Pierre Lozi — Past activity

Page 83: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

Our proposed implementation of sanity checks:- Differs from watchdogs or assertions, no simple test for scheduling issues

- Sanity checks called periodically, start low-overhead recording of movements of threads if candidate issue detected, report a bug if issue persists long enough

- If issue persists long enough, high-overhead profiling of the whole machine for 20ms to generate a bug report that helps understand why the scheduler didn’t reach a satisfactory state again during that time

We argue that sanity checks are the only practical way to efficiently ensure implicitinvariants, should be added to various parts of the kernel!

Paper mostly written, will be submitted to EuroSys next month.Jean-Pierre Lozi — Past activity

Page 84: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

Our proposed implementation of sanity checks:- Differs from watchdogs or assertions, no simple test for scheduling issues

- Sanity checks called periodically, start low-overhead recording of movements of threads if candidate issue detected, report a bug if issue persists long enough

- If issue persists long enough, high-overhead profiling of the whole machine for 20ms to generate a bug report that helps understand why the scheduler didn’t reach a satisfactory state again during that time

We argue that sanity checks are the only practical way to efficiently ensure implicitinvariants, should be added to various parts of the kernel!

Paper mostly written, will be submitted to EuroSys next monthJean-Pierre Lozi — Past activity

Page 85: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

Project 3: scheduling bugs in Linux

Our proposed implementation of sanity checks:- Differs from watchdogs or assertions, no simple test for scheduling issues

- Sanity checks called periodically, start low-overhead recording of movements of threads if candidate issue detected, report a bug if issue persists long enough

- If issue persists long enough, high-overhead profiling of the whole machine for 20ms to generate a bug report that helps understand why the scheduler didn’t reach a satisfactory state again during that time

We argue that sanity checks are the only practical way to efficiently ensure implicitinvariants, should be added to various parts of the kernel!

Paper mostly written, will be submitted to EuroSys next month.Jean-Pierre Lozi — Past activity

Page 86: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

And now?

Jean-Pierre Lozi — Past activity

Page 87: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

And now?

Position profile: heterogeneous data warehouses- My profile: more “systems” than most people in the team

- But, data warehouses use multicore processors, which need scheduling, etc.

- Objective: bringing my expertise in systems to projects in the team

Open to all suggestions for collaborations!

Jean-Pierre Lozi

Page 88: Jean-Pierre Lozi New McF recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · Under the supervision of Alexandra Fedorova SYNAR team: “Systems Networking and Architecture

And now?

Position profile: heterogeneous data warehouses- My profile: more “systems” than most people in the team

- But, data warehouses use multicore processors, which need scheduling, etc.

- Objective: bringing my expertise in systems to projects in the team

Open to all suggestions for collaborations!

Jean-Pierre Lozi