jean-pierre lozi new mcf recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · under...

My research: an overviewJean-Pierre Lozi — New McF recruit

Jean-Pierre Lozi — Past activity

Who am I?

- From here ! (born in Nice, grew up in Antibes)

- Before my PhD :

- DEUG MIAS, Licence d’Informatique + Licence de Mathématiques in Nice

- Télécom ParisTech (Paris)

- Masters in Université Pierre et Marie Curie (Paris)

Who am I?

PhD in Computer Science (“allocataire moniteur”)

“Towards more scalable mutual exclusion for multicore architectures”

Under the supervision of Gilles Muller and Gaël Thomas

REGAL/WHISPER team: “Well-Honed Infrastructure Software for Programming Environments and Runtimes”

Laboratoire d’Informatique de Paris 6, UPMC, Paris

Postdoctoral Research Fellow, then University Research Associate

Under the supervision of Alexandra Fedorova

SYNAR team: “Systems Networking and Architecture Research”

Simon Fraser University (SFU), Vancouver, Canada

Until July 2014

Until Sept. 2015

Who am I?

- Three main projects:- Remote Core Locking: dedicating cores for the execution of critical sections

- Hector: automated fault detection in error-handling codes

- A decade of idle cores: scheduling bugs in Linux

- Domain: systems!- Multicore architectures, synchronization / lock algorithms

- Automated source code analysis, bug detection

- Schedulers (on multicore architectures, again)

- Probably not your domain…- I will just give a quick overview of my previous work, no details

- Objective: work with you on some projects ! Low-level systems stuff needed for performance…

- If interested, we can discuss things in more detail

Project 1Remote Core Locking: dedicating cores for faster

execution of critical sections

Project 1: Remote Core Locking

Context: multicore architectures

- Decades of increasing CPU clock speeds, now issues with power usage / heat

- Increasing numbers of cores to keep increasing processing power- Possible because number of transistors keep increasing

Context: multicore architectures

- Decades of increasing CPU clock speeds, now issues with power usage / heat

- Increasing numbers of cores to keep increasing processing power- Possible because number of transistors keep increasing

# transistors

Clock speed

Power consumption

Ratio power/speed

Problem:

- Many legacy applications don’t scale well on modern multicore architectures

- For instance, Memcached on an x86 48-core machine (Get/Set requests):

Problem:

- Many legacy applications don’t scale well on multicore architectures

- For instance, Memcached on an x86 48-core machine (Get/Set requests):

Poor scalability on multicore architectures: why?

- Bottleneck = critical sections, protected by locks

- High contention => lock acquisition is costly (more cores => higher contention)

1 4 8 16 22 32 48

SPLASH-2/Radiosity

SPLASH-2/Raytrace

Phoenix 2/LG

Phoenix 2/SM

Phoenix 2/MM

Memcached/Get

Memcached/Set

Berkeley DB/OS

Berkeley DB/SL

Number of cores

critical section

Two possible solutions:

- Redesign applications (fine-grained locking)

- Costly (millions of lines of legacy code)

- Design better locks!

Two possible solutions:

- Redesign applications (fine-grained locking)

- Costly (millions of lines of legacy code)

- Design better locks!

Higher contention

CAS spinlock

Critical sections access 5 cache lines each

[Mellor-Crummey ASPLOS’91]

Designing better locks

- No need to redesign the application, better resistance to contention

- Custom microbenchmark to compare locks:

Jean-Pierre Lozi — Past activity Lower contention

Improvement

Higher contention

CAS spinlock

- No need to redesign the application, better resistance to contention

[Mellor-Crummey ASPLOS’91]

Higher contention

Flat Combining Blocking locks

CAS spinlock

Question : why are lock algorithms inefficient ?

Because critical paths are too long!

- Overhead 1: costly lock handovers

- Overhead 2: poor locality of critical sections

Overhead 1: costly lock handovers

eT1 T2 T3

Lock handovers :Spinlocks: busy-waiting,POSIX locks: context switch,MCS: sending a message from one thread to the next,Flat combining: sometimes acquires a spinlock

Overhead 2: poor locality of critical sections

T1 T2 T3CS1

Shared variable 2Shared variable 1

T1 T2 T3CS1

Idea: RCL = shorten the critical path as much as possible by dedicating a server core!

T1 T2 T3

CS1CS2CS3

Server core

T1 T2 T3

CS1CS2CS3

Server core

No cache misses!

T1 T2 T3

hCS1CS2CS3

Server core

No cache misses!

Performance ?

Higher contention Lower contention

CAS spinlock

Flat Combining Blocking locks

Microbenchmark:

Higher contention Lower contention

CAS spinlock

Blocking locks

Microbenchmark:

Combining locks {

That was the general idea… but RCL is much more than that!

RCL offers:

- A runtime designed to work with legacy applications: i.e., it works efficiently withmultiple locks per server and/or multiple servers, it handles critical sections thatbusy-wait/block, that supports condition variables / trylocks / nested / recursivecritical sections… lots of complex problems to solve here !

- A reengineering tool that transforms applications to use RCL: can’t be used withonly lock/unlock functions, need to encapsulate critical sections, and ship them toserver cores... We don’t want to do this manually!

- A profiler to detect applications that can benefit from RCL: even with thereenginering tool, using RCL takes time… We want to make sure we can benefitfrom it! Use case : applications with highly contended locks or critical sections w/poor locality...

Focus: legacy system (C) applications. RCL offers:

- A runtime designed to work with legacy applications: i.e., it works efficiently withmultiple locks per server and / or multiple servers, it handles critical sections thatbusy-wait / block, that supports condition variables / trylocks / nested / recursivecritical sections… lots of algorithmic / engineering problems to solve here !

- A reengineering tool that transforms applications to use RCL: can’t be used withonly lock/unlock functions, need to encapsulate critical sections, and ship them toserver cores... We don’t want to do this manually!

- A reengineering tool that transforms applications to use RCL: can’t be used withonly lock / unlock functions, need to encapsulate critical sections, and ship themto server cores... We don’t want to do this manually!

void func(void) {

int a, b, x;

a = …;

pthread_mutex_lock();

a = f(a);

pthread_mutex_unlock();

struct context { int a, b };

void func(void) {

struct context c;

int x;

c.a = …;

execute_rcl(__cs, &c);

void __cs(struct context *c) {

c->a = f(c->a)

f(c->b)

Reengineering tool: a simple case

void func(void) {

int a, b, x;

a = …;

a = f(a);

void func(void) {

struct context c;

int x;

c.a = …;

c->a = f(c->a)

f(c->b)

void func(void) {

int a, b, x;

a = …;

a = f(a);

void func(void) {

struct context c;

int x;

c.a = …;

c->a = f(c->a)

f(c->b)

void func(void) {

int a, b, x;

a = …;

a = f(a);

void func(void) {

struct context c;

int x;

c.a = …;

c->a = f(c->a)

f(c->b)

Performance in legacy applications:

% in CS: 44.7%(many DCMs)

63.9% 65.7% 79.0% 81.6% 87.7% 90.2% 92.2%

better

63.9% 65.7% 79.0% 81.6% 87.7% 90.2% 92.2%

better

63.9% 65.7% 79.0% 81.6% 87.7% 90.2% 92.2%

better

Improved scalability:

Very efficient when more threads than cores: server always makes progress

Quick collapse

Publications (44 citations):

- In CFSE ‘8 (national conference, best paper)

- In USENIX ATC ’12 (international conference)

- Long version of the paper submitted to TOCS (intl journal, waiting for reviews)

Several research works based on RCL already:

- [Petrovic et al., PPoPP’14] : RCL for partially cache coherent architectures

- [Pusuruki et al., PPoPP ‘14] : migrating threads to improve locality of CS

- [Hassan et al., IPDPS ’14] : dedicated server cores for transactional memory

Project 2Hector: automated fault detection in error-

handling codes

Project 2: Hector

Problem:

- System applications are complex => impossible to avoid bugs

- Large part of system applications are error-handling codes- Errors tested after most function calls

Project 2: Hector

Problem:

- A common mistake: forgetting to release a resource in an EHC- Releasing memory or a lock, unloading a device…

Project 2: Hector

Problem:

- A common mistake: forgetting to release a resource in an EHC- Releasing memory or a lock, unloading a device…

- Can have major consequences!- Memory leaks (exploits!), deadlocks, crashes...

Project 2: Hector

Existing solutions:

- Specification mining: automated search for “protocols”, i.e., ways to use an API: e.g., function Y (release) often after function X

- Problem: in practice, need to prune number of protocols found to avoid false positive: filtering using high confidence and support

Project 2: Hector

Existing solutions:

- Problem: in practice, need to prune number of protocols found to avoid false positives: filtering using high confidence and support

Project 2: Hector

Existing solutions:

- Problem: in practice, need to prune number of protocols found to avoid false positives: filtering using high confidence and support

- “Macroscopic approaches”: missing lots of acquire/release faults because many acquire/release operations used only a handful of times!

Project 2: Hector

Our solution: Hector

- Observation: when resources released in a given way in an EHC, often releasedin the same way in nearby EHCs

- We first annotate functions globally, guessing if they are acquire or release functions based on heuristics (return a parameter or not, first/last access to a variable, …)

- Local analysis of Flow Control Graph: looks for acquire / release operations

Project 2: Hector

- Local (“microscopic”) analysis of Flow Control Graph:- Look for release operation in EHC, flow graph”exemplar”

- Parts of the code with same acquire/release operations in flow graph before EHC but missing release operation: fault candidate!

Project 2: Hector

- Local (“microscopic”) analysis of Flow Control Graph:- Look for release operation in EHC, flow graph before the EHC = exemplar

- Parts of the code with same acquire/release operations in exemplar but missing release operation: fault candidate!

Expérience en recherche (8/14)

Allocation 1 ⟶

Allocation 2 ⟶

Libération 1 ⟶

Libération 2 ⟶

f Sortie ⟶

c GOTO ⟶

Basic example: from Linux

Libération 1 ⟶

Libération 2 ⟶

⟵ Sortie

f Sortie ⟶ f ‘’

c GOTO ⟶

⟵ GOTO c

Basic example: from LinuxAllocation 1 ⟶

Allocation 2 ⟶

Basic example: from Linux

Libération 1 ⟶

Libération 2 ⟶

⟵ Sortie

f Sortie ⟶ f ‘’

Same acquire/release operationsbefore EHC, missing releaseoperations: candidate fault!

c GOTO ⟶

⟵ GOTO c

Allocation 1 ⟶

Allocation 2 ⟶

Project 2: Hector

- Was the general idea: in practice, smart heuristics in the Control Flow Graph to handle more elaborate cases…

- Found 484 fault candidates in Linux, Python, Apache, Wine, PHP, and PostgreSQL… 371 were real faults! => few false positives

- Most wouldn’t have been detected by macroscopic approaches!

Project 2: Hector

- One of my main roles in this project: I showed that a malicious user can exploit some of these faults to crash servers!

Project 2: Hector

Publications (11 citations) :

- Dans CFSE ‘9 (national conference)

- Dans DSN ‘13 (international conference, best paper)

Project 3A decade of idle cores: scheduling bugs in Linux

Project 3: scheduling bugs in Linux

Problem:

- OS schedulers keep evolving with hardware- Completely new schedulers once in a while (the O(1) scheduler, CFS scheduler…)

- Many heuristics added over time to overcome issues

- Especially true with recent machines: multicore, NUMA...

- Programs run and end, nobody seems to think there is a problem!

- Linus said :- “And you have to realize that there are not very many things that have aged as well as the

scheduler. Which isjust another proof that scheduling is easy.”

- “I suspect that making the scheduler use per-CPU queues together with some inter-CPU load balancing logic is probably _trivial_. Patches already exist, and I don’t feel that people can screw up the few hundred lines too badly.”

Problem:

So wait, what is the problem again?

- Read the scheduler code: it is actually very complex, full of heuristics thatsometimes seem contradictory, hard to really understand

- No effort to show that any of it is correct. Simply tested on a few microbenchmarks, and feedback from users.

- We were curious and we considered a set of implicit invariants.

- Simple stuff, like “no core should run several threads when there have been idlecores in the system for a long time”, or “two threads with the same load shouldrun for a similar amount of time”

- We wrote “sanity checks” to check these invariants at runtime.

So wait, what is the problem again?

- Read the scheduler code: it is actually very complex, full of heuristics thatsometimes seem contradictory, hard to really understand

- No effort to show that any of it is correct. Simply tested on a few microbenchmarks, and feedback from users.

We were curious and we considered a set of implicit invariants.

- Simple stuff, like “no core should run several threads when there have been idlecores in the system for a long time”, or “two threads with the same load shouldrun for a similar amount of time”

- We wrote “sanity checks” to check these invariants at runtime.

Our shocking news: the Linux scheduler is rife with “scheduling bugs”, i.e., the inability to maintain the most basic implicit invariants!

Example 1: when running two processes in two terminals, one with 2 threads, the other one with 64 threads, many cores idle while other cores overloaded!

- Reason: autogroups, the process with two threads has a higher load. The schedulerjust looks at average loads of nodes, which is balanced. Doesn’t try to go further down in the hierarchy!

Example 1: when running two processes in two terminals, one with 2 threads, the other one with 64 threads, many cores idle while other cores overloaded!

- Reason: autogroups, the process with two threads has a higher load. The schedulerjust looks at average loads of nodes, which is balanced. Doesn’t try to go further down in the hierarchy!

Example 2: Oracle with TPC-H = many cores remain idle for no apparent reason.

- Reason: initial inter-node load balancing event caused by transient thread, after thatone node with more threads than cores, thread keeps waking up on busy core becauseonly node-local threads considered (faulty cache-optimization heuristic). All threads wait for each other: “holes” everywhere in the execution!

Example 2: Oracle with TPC-H = many cores remain idle for no apparent reason.

- Reason: initial inter-node load balancing event caused by transient thread, after thatone node with more threads than cores, thread keeps waking up on busy core becauseonly node-local threads considered (faulty cache-optimization heuristic). All threads wait for each other: “holes” everywhere in the execution!

- Other bugs: Many other scheduling bugs found, including bugs where only one node in the system is used because topology not built properly, etc.

- We argue that such bugs will always be added to the scheduler, due to its constant evolution. Using formal proofs would be extremely complex, regression testingwould not find the bugs we found.

- We discuss possible ways of redesigning the scheduler to reduce bugs, but evenwith this, sanity checks needed to ensure more bugs won’t be added.

- We propose possible ways of redesigning the scheduler to reduce bugs, but evenwith this, sanity checks needed to ensure more bugs won’t be added.

Our proposed implementation of sanity checks:- Differs from watchdogs or assertions, no simple test for scheduling issues

- Sanity checks called periodically, start low-overhead recording of movements of threads if candidate issue detected, report a bug if issue persists long enough

- If issue persists long enough, high-overhead profiling of the whole machine for 20ms to generate a bug report that helps understand why the scheduler didn’t reach a satisfactory state again during that time

We argue that sanity checks are the only practical way to efficiently ensure implicitinvariants, should be added to various parts of the kernel!

Paper mostly written, will be submitted to EuroSys next month.Jean-Pierre Lozi — Past activity

Paper mostly written, will be submitted to EuroSys next monthJean-Pierre Lozi — Past activity

And now?

Position profile: heterogeneous data warehouses- My profile: more “systems” than most people in the team

- But, data warehouses use multicore processors, which need scheduling, etc.

- Objective: bringing my expertise in systems to projects in the team

Open to all suggestions for collaborations!

Jean-Pierre Lozi

And now?

Position profile: heterogeneous data warehouses- My profile: more “systems” than most people in the team

- But, data warehouses use multicore processors, which need scheduling, etc.

- Objective: bringing my expertise in systems to projects in the team

Open to all suggestions for collaborations!

Jean-Pierre Lozi

jean-pierre lozi new mcf recruiti3s.unice.fr/sites/default/files/seminaires/seminar.pdf · under...

Documents

martina bitunjac_ intelektualke u Ženskoj lozi hrvatskog...

annual synar report - tennessee · the annual synar report...

webbgas system 2016-2017 navigation manual state synar...

synar systems networking and architecture group cmpt 886:...

annual synar report - dhssannual synar report 42 u.s.c....

cmpt 401 2008 dr. alexandra fedorova distributed systems

annual synar report - wyoming department of …€¦ · web...

annual synar report - dcf.state.fl.us annual synar report...

annual synar report - oklahoma 2015 annual synar rep… ·...

identity, catholicism, and lozi culture in zambia...

actions. olga fedorova

moj novi mmojoj novi minimulti mminimultiinimulti...divlja...

gao-02-74 synar amendment implementation: quality of state

ФЁДОРОВА НАТАЛЬЯ ВЛАДИМИРОВНА /...

juan carlos saez, alexandra fedorova, david koufaty,...

ts no.: 2021-00289-tx lozi aug i 2 00

annual synar report - alabama department of public...

annual synar report - minnesota · web viewthe annual synar...

synar systems networking and architecture group cmpt 886:...

masha fedorova / about