advanced operating systems #1 - pflab...general information a bit of linux scheduler history from...

54
Advanced Operating Systems #5 Shinpei Kato Associate Professor Department of Computer Science Graduate School of Information Science and Technology The University of Tokyo <Slides download> http://www.pf.is.s.u-tokyo.ac.jp/class.html

Upload: others

Post on 23-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Advanced Operating Systems

#5

Shinpei Kato

Associate Professor

Department of Computer Science

Graduate School of Information Science and Technology

The University of Tokyo

<Slides download>http://www.pf.is.s.u-tokyo.ac.jp/class.html

Page 2: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Course Plan

• Multi-core Resource Management

• Many-core Resource Management

• GPU Resource Management

• Virtual Machines

• Distributed File Systems

• High-performance Networking

• Memory Management

• Network on a Chip

• Embedded Real-time OS

• Device Drivers

• Linux Kernel

Page 3: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Schedule1. 2018.9.28 Introduction + Linux Kernel (Kato)

2. 2018.10.5 Linux Kernel (Chishiro)

3. 2018.10.12 Linux Kernel (Kato)

4. 2018.10.19 Linux Kernel (Kato)

5. 2018.10.26 Linux Kernel (Kato)

6. 2018.11.2 Advanced Research (Chishiro)

7. 2018.11.9 Advanced Research (Chishiro)

8. 2018.11.16 (No Class)

9. 2018.11.23 (Holiday)

10. 2018.11.30 Advanced Research (Kato)

11. 2018.12.7 Advanced Research (Kato)

12. 2018.12.14 Advanced Research (Chishiro)

13. 2018.12.21 Linux Kernel

14. 2019.1.11 Linux Kernel

15. 2019.1.18 (No Class)

16. 2019.1.25 (No Class)

Page 4: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Process Scheduling

Abstracting Computing Resources

/* The case for Linux */

Acknowledgement:Prof. Pierre Olivier, ECE 4984, Linux Kernel Programming, Virginia Tech

Page 5: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Outline

1 General information

2 Linux Completely Fair Scheduler

3 CFS implementation

4 Preemption and context switching

5 Real-time scheduling policies

6 Scheduling-related syscalls

Page 6: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Outline

1 General information

2 Linux Completely Fair Scheduler

3 CFS implementation

4 Preemption and context switching

5 Real-time scheduling policies

6 Scheduling-related syscalls

Page 7: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

General informationScheduling

� Scheduler: OS entity that decide which process should run,

when, and for how long

� Multiplex processes in time on the processor: enables

multitasking

.., Gives the user the illusion that processes are executing at the same

time

� Scheduler is responsible for making the best use of the resource

that is the CPU time

� Basic principle:.., When in the system there are more ready-to-run processes than

the number of cores.., The scheduler decides which process should run

Page 8: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

General informationMultitasking

� Single core: gives the illusion that multiple processes are running

concurrently

� Multi-cores: enable true parallelism

� 2 types of multitasking OS:.., Cooperative multitasking

.., A process does not stop running until it decides to do so (yield the

CPU).., The operating system cannot enforce fair scheduling

- For example in the case of a process that never yields

.., Preemptive multitasking

.., The OS can interrupt the execution of a process: preemption

.., Generally after the process expires its timeslice

.., And/or based on tasks priorities

Page 9: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

General informationA bit of Linux scheduler history

� From v1.0 to v2.4: simple implementation

.., But it did not scale to numerous processes and processors

� V2.5 introduced the O(1) scheduler.., Constant time scheduling decisions

.., Scalability and execution time determinism

.., More info in [5]

.., Issues with latency-sensitive applications (Desktop computers)

� O(1) scheduler was replaced in 2.6.23 by what is still now the standard Linux scheduler:

.., Completely Fair Scheduler (CFS)

.., Evolution of the Rotating Staircase Deadline scheduler [2, 3]

Page 10: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

General informationScheduling policy - I/O vs compute-bound tasks

� Scheduling policy are the set of rules determining the choices

made by a given model of scheduler

� I/O-bound processes:.., Spend most of their time waiting for I/O: disk, network, but also

keyboard, mouse, etc..., Filesystem, network intensive, GUI applications, etc..., Response time is important

.., Should run often and for a small time frame

� Compute-bound processes:.., Heavy use of the CPU

.., SSH key generation, scientific computations, etc.

.., Caches stay hot when they run for a long time

.., Should not run often, but for a long time

Page 11: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

General informationScheduling policy - Priority

� Priority

.., Order process according to their ”importance” from the scheduler

standpoint.., A process with a higher priority will execute before a process with a

lower one

� Linux has 2 priority ranges:.., Nice value: ranges from -20 to +19, default is 0

.., High values of nice means lower priority

.., List process and their nice values with p s ax - o p i d , n i , c m d

.., Real-time priority: range configurable (default 0 to 99).., Higher values mean higher priority.., For processes labeled real-time.., Real-time processes always executes before standard (nice)

processes.., List processes and their real-time priority using

p s ax - o p i d , r t p r i o , c m d

Page 12: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

General informationScheduling policy - Priority (2)

� User space to kernel priorities mapping:

Page 13: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

General informationScheduling policy - Timeslice

� Timeslice (quantum):.., How much time a process should execute before being preempted.., Defining the default timeslice in an absolute way is tricky:

.., Too long → bad interactive performance

.., Too short → high context switching overhead

� Linux CFS does not use an absolute timeslice.., The timeslice a process receives is function of the load of the

system.., it is a proportion of the CPU

.., In addition, that timeslice is weighted by the process priority

� When a process P becomes runnable:

.., P will preempt the currently running process C is P consumed a

smaller proportion of the CPU than C

Page 14: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

General informationScheduling policy - Policy application example

� 2 tasks in the system:.., Text editor: I/O-bound, latency sensitive (interactive).., Video encoder: CPU-bound, background job

� Text editor:.., A. Needs a large amount of CPU time

.., Does not need to run for long, but needs to have CPU time available

whenever it needs to run

.., B. When ready to run, needs to preempt the video encoder

.., A + B = good interactive performance

� On a classical UNIX system, needs to set a correct combination of

priority and timeslice

� Different with Linux: the OS guarantee the text editor a specific

proportion of the CPU time

Page 15: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

General informationScheduling policy - Policy application example (2)

� Imagine only the two processes are present in the system and run at the same priority

.., Linux gives 50% of CPU time to each

� Considering an absolute timeframe:.., Text editor does not use fully its 50% as it often blocks waiting

for I/O.., Keyboard key pressed

.., CFS keeps track of the actual CPU time used by each program

.., When the text editor wakes up:.., CFS sees that it actually used less CPU time than the video

encoder.., Text editor preempts the video encoder

Page 16: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

General informationScheduling policy - Policy application example (3)

� Good interactive performance

� Good background, CPU-bound performance

Page 17: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Outline

1 General information

2 Linux Completely Fair Scheduler

3 CFS implementation

4 Preemption and context switching

5 Real-time scheduling policies

6 Scheduling-related syscalls

Page 18: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Linux Completely Fair SchedulerScheduling classes

� CPU classes: coexisting CPU algorithms

.., Each task belongs to a class

� CFS: SCHED OTHER, implemented in k e r n e l / s c h e d / f a i r . c

� Real-time classes: SCHED RR, SCHED FIFO, SCHED DEADLINE

.., For predictable schedule

� s c hed c l a s s datastructure:

1

23

4

56

7

89

10

s t r u c t s c h e d _ c l a s s {

v o i d ( * e n q u e u e _ t a s k ) ( / * . . . * / ) ;

v o i d ( * d e q u e u e _ t a s k ) ( / * . . . * / ) ;

v o i d ( * y i e l d _ t a s k ) ( / * . . . * / ) ;v o i d ( * c h e c k _ p r e e m p t _ c u r r ) ( / * . . . * / ) ;

s t r u c t t a s k _ s t r u c t * ( * p i c k _ n e x t _ t a s k ) ( / * . . . * / ) ;

vo id ( * s e t _ c u r _ t a s k ) ( / * . . . * / ) ;v o i d ( * t a s k _ t i c k ) ( / * . . . * / ) ;/ * . . . * /

}

Page 19: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Linux Completely Fair Schedulers c h e d c l a s s hooks

� Functions descriptions:.., enqueue t a s k ( . . . )

.., Called when a task enters a runnable state.., dequeue t a s k ( . . . )

.., Called when a task becomes unrunnable.., y i e l d t a s k ( . . . )

.., Yield the processor (dequeue then enqueue back immediatly).., ch eck p r eemp t c u r r ( . . . )

.., Checks if a task that entered the runnable state should preempt the

currently running task.., p i c k n e x t t a s k ( . . . )

.., Chooses the next task to run.., s e t c u r r t a s k ( . . . )

.., Called when the currentluy running task changes its scheduling class

or task group to the related scheduler.., t a s k t i c k ( . . . )

.., Called regularly (default: 10 ms) from the system timer tickhandler, might lead to context switch

Page 20: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Linux Completely Fair SchedulerUnix scheduling

� Classical UNIX systems map priorities (nice values) to absolute

timeslices

� Leads to several issues:.., What is the absolute timeslice that should be mapped to a

given nice value?.., Sub-optimal switching behavior for low priority processes (small

timeslices)

.., Relative nice values and their mapping to timeslices.., Nicing down a process by one can have very different effects

according to the tasks priorities

.., Timeslice must be some integer multiple of the timer tick.., Minimum timeslice and difference between two consecutive

timeslices are bounded by the timer tick frequency

Page 21: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Linux Completely Fair SchedulerFair scheduling

� Perfect multitasking:.., From a single core standpoint

.., At each moment, each process of the same priority has received an

exact amount of the CPU time.., What we would get if we could run n tasks in parallel on the CPU

while giving them 1/n of the CPU processing power → not possible

in reality.., Or if we could schedule tasks for infinitely small amounts of time

→ context switch overhead issue

� 3 main (high-level) CFS concepts:

1

2

3

CFS runs a process for some times, then swaps it for the runnable

process that has run the least

No default timeslice, CFS calculates how long a process should run

according to the number of runnable processes

That dynamic timeslice is weighted by the process priority(nice)

Page 22: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Linux Completely Fair SchedulerFair scheduling (2)

� Targeted latency : period during which all runnable processes

should be scheduled at least once

� Example: processes with the same priority

� Example: processes with different priorities

� Minimum granularity : floor at 1 ms (default)

Page 23: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Outline

1 General information

2 Linux Completely Fair Scheduler

3 CFS implementation

4 Preemption and context switching

5 Real-time scheduling policies

6 Scheduling-related syscalls

Page 24: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementation

� 4 main components:

1

2

3

4

Time accounting

Process selection

Scheduler entry point (calling the scheduler)

Sleeping & waking up

Page 25: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationTime accounting

� s c h e d e n t i t y structure in the t a s k s t r u c t (se field)

e x e c _ s t a r t ;

sum_exec_ run t i me ;

v r u n t i m e ;

p r ev_sum_exec _ run t i me ;

/ * a d d i t i o n a l s t a t i s t i c s no t shown here * /

1 s t r u c t s c h e d _ e n t i t y

2 {

3 s t r u c t l o a d _ we i gh t l o a d ;

4 s t r u c t rb_node r u n _ n o d e ;

5 s t r u c t l i s t _ h e a d gro u p _ n o d e ;

6 uns igned i n t o n _ r q ;

78 u64

9 u64

10 u64

11 u64

12

1314 }

� Virtual runtime

.., How much time a process has been executed (ns)

Page 26: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationTime accounting (2)

1

2

3

45

6

78

9

1011

12

1314

15

1617

18

s t a t i c vo id u p d a t e _ c u r r ( s t r u c t c f s _ r q * c f s _ r q )

{

s t r u c t s c h e d _ e n t i t y * c u r r =

c f s _ r q - > c u r r ;u64 now = r q _ c l o c k _ t a s k ( r q _ o f ( c f s _ r q ) ) ;

u64 d e l t a _ e x e c ;

i f ( u n l i k e l y ( ! c u r r ) )

r e t u r n ;

d e l t a _ e x e c = now - c u r r - > e x e c _ s t a r t ;

i f ( u n l i k e l y ( ( s 6 4 ) d e l t a _ e x e c <= 0 ) )

r e t u r n ;

c u r r - > e x e c _ s t a r t = now;

s c h e d s t a t _ s e t ( c u r r - > s t a t i s t i c s . e x e c _ m a x ,

m a x ( d e l t a _ e x e c , c u r r - > s t a t i s t i c s

. e x e c _ m a x ) ) ;

18

19

2021

22

23

2425

26

2728

29

30

31

3233

34

cu r r ->sum_ex ec_ ru n t ime += d e l t a _ e x e c ;

s c h e d s t a t _ a d d ( c f s _ r q - > e x e c _ c l o c k ,d e l t a _ e x e c ) ;

c u r r - > v r u n t i m e += c a l c _ d e l t a _ f a i r (

d e l t a _ e x e c , c u r r ) ;u p d a t e _ m i n _ v r u n t i m e ( c f s _ r q ) ;

i f ( e n t i t y _ i s _ t a s k ( c u r r ) ) {

s t r u c t t a s k _ s t r u c t * c u r t a s k

= t a s k _ o f ( c u r r ) ;

t r a c e _ s c h e d _ s t a t _ r u n t i m e ( c u r t a s k ,

d e l t a _ e x e c , c u r r - > v r u n t i m e ) ;c p u a c c t _ c h a r g e ( c u r t a s k , d e l t a _ e x e c ) ;

a c c o u n t _ g r o u p _ e x e c _ r u n t i m e ( c u r t a s k ,

d e l t a _ e x e c ) ;}

a c c o u n t _ c f s _ r q _ r u n t i m e ( c f s _ r q ,

d e l t a _ e x e c ) ;}

� Invoked regularly by the system timer, and when a process

becomes runnable/unrunnable

Page 27: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationProcess selection

� Adapted from [1]

Page 28: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationProcess selection (2)

� When CFS needs to choose which runnable process to run next:.., The process with the smallest vruntime is selected.., It is the leftmost node in the tree

1

23

4

56

7

89

s t r u c t s c h e d _ e n t i t y * p i c k _ f i r s t _ e n t i t y ( s t r u c t c f s _ r q * c f s _ r q )

{

s t r u c t rb _ n o d e * l e f t = c f s _ r q - > r b _ l e f t m o s t ;

i f ( ! l e f t )

re turn NULL;

re turn r b _ e n t r y ( l e f t , s t r u c t s c h e d _ e n t i t y , r u n _ n o d e ) ;

}

Page 29: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationProcess selection: adding a process to the tree

� A process is added through enqueue e n t i t y :

1

2

3

4

56

7

89

10

1112

13

1415

16

1718

s t a t i c v o i d

e n q u e u e _ e n t i t y ( s t r u c t c f s _ r q * c f s _ r q ,

s t r u c t s c h e d _ e n t i t y * s e , i n t f l a g s )

{

b o o l renorm = ! ( f l a g s & ENQUEUE_WAKEUP)

| | ( f l a g s & ENQUEUE_MIGRATED);

b o o l c u r r = c f s _ r q - > c u r r == s e ;

i f ( r eno rm && c u r r )

s e - > v r u n t i m e += c f s _ r q - > m i n _ v r u n t i m e ;

u p d a t e _ c u r r ( c f s _ r q ) ;

i f ( r eno rm && ! c u r r )

s e - > v r u n t i m e += c f s _ r q - > m i n _ v r u n t i m e ;

u p d a t e _ l o a d _ a v g ( s e , UPDATE_TG);

e n q u e u e _ e n t i t y _ l o a d _ a v g ( c f s _ r q , s e ) ;

a c c o u n t _ e n t i t y _ e n q u e u e ( c f s _ r q , s e ) ;

u p d a t e _ c f s _ s h a r e s ( c f s _ r q ) ;

19

2021

22

2324

25

2627

28

2930

31

3233

i f ( f l a g s & ENQUEUE_WAKEUP)

p l a c e _ e n t i t y ( c f s _ r q , s e , 0 ) ;

c h e c k _ s c h e d s t a t _ r e q u i r e d ( ) ;

u p d a t e _ s t a t s _ e n q u e u e ( c f s _ r q , s e , f l a g s ) ;

c h e c k _ s p r e a d ( c f s _ r q , s e ) ;i f ( ! c u r r )

e n q u e u e _ e n t i t y ( c f s _ r q , s e ) ;

s e - > o n _ r q = 1 ;

i f ( c f s _ r q - > n r _ r u n n i n g == 1 ) {

l i s t _ a d d _ l e a f _ c f s _ r q ( c f s _ r q ) ;

c h e c k _ e n q u e u e _ t h r o t t l e ( c f s _ r q ) ;}

}

Page 30: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationProcess selection: adding a process to the tree (2)

� enqueue e n t i t y :

1

2

3

4

56

7

89

10

1112

13

14

15

16

17

s t a t i c vo id e n q u e u e _ e n t i t y ( s t r u c t c f s _ r q

* c f s _ r q , s t r u c t s c h e d _ e n t i t y * s e )

{

s t r u c t rb_node * * l i n k = &cfs_ rq - >

t a s k s _ t i m e l i n e . r b _ n o d e ;s t r u c t rb_node * p a r e n t = NULL;

s t r u c t s c h e d _ e n t i t y * e n t r y ;

i n t l e f t m o s t = 1 ;

/ *

* Find t h e r i g h t p l a c e i n t h e r b t r e e :

* /

w h i l e ( * l i n k ) {

p a r e n t = * l i n k ;

e n t r y = r b _ e n t r y ( p a r e n t , s t r u c t

s c h e d _ e n t i t y , r u n _ n o d e ) ;/ *

*We dont care about c o l l i s i o n s .

Nodes wi th

*t h e same key s t a y t o g e t h e r.

* /

18

1920

21

2223

24

2526

27

28

2930

31

32

33

34

i f ( e n t i t y _ b e f o r e ( s e , e n t r y ) ) {

l i n k = & p a r e n t - > r b _ l e f t ;} e l s e {

l i n k = & p a r e n t - > r b _ r i g h t ;

l e f t m o s t = 0 ;}

}/ *

* Maintain a cache o f l e f t m o s t t r e e

e n t r i e s ( i t i s f r e q u e n t l y

* u s e d ) :

* /

i f ( l e f t m o s t )

c f s _ r q - > r b _ l e f t m o s t = &se -> run_no de ;

r b _ l i n k _ n o d e ( &s e - > r u n _ n o d e , p a r e n t , l i n k

) ;

r b _ i n s e r t _ c o l o r ( & s e - > r u n _ n o d e , &cfs_ rq - >

t a s k s _ t i m e l i n e ) ;}

Page 31: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationProcess selection: removing a process from the tree

� dequeue e n t i t y :

1

2

34

5

6

7

8

9

1011

12

1314

s t a t i c v o i d

d e q u e u e _ e n t i t y ( s t r u c t c f s _ r q * c f s _ r q ,

s t r u c t s c h e d _ e n t i t y * s e , i n t f l a g s )

{

u p d a t e _ c u r r ( c f s _ r q ) ;

d e q u e u e _ e n t i t y _ l o a d _ a v g ( c f s _ r q , s e ) ;

u p d a t e _ s t a t s _ d e q u e u e ( c f s _ r q , s e , f l a g s

) ;

c l e a r _ b u d d i e s ( c f s _ r q , s e ) ;

i f ( s e ! = c f s _ r q - > c u r r )

d e q u e u e _ e n t i t y ( c f s _ r q , s e ) ;

s e - > o n _ r q = 0 ;

a c c o u n t _ e n t i t y _ d e q u e u e ( c f s _ r q , s e ) ;

15

16

1718

19

2021

22

23

24

i f ( ! ( f l a g s & DEQUEUE_SLEEP))

s e - > v r u n t i m e - = c f s _ r q - >min_v run t ime ;

r e t u r n _ c f s _ r q _ r u n t i m e ( c f s _ r q ) ;

u p d a t e _ c f s _ s h a r e s ( c f s _ r q ) ;

i f ( ( f l a g s & (DEQUEUE_SAVE |

DEQUEUE_MOVE)) == DEQUEUE_SAVE)u p d a t e _ m i n _ v r u n t i m e ( c f s _ r q ) ;

}

Page 32: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationProcess selection: removing a process from the tree (2)

� dequeue e n t i t y :

1

23

4

56

7

89

10

11

s t a t i c vo id d e q u e u e _ e n t i t y ( s t r u c t c f s _ r q * c f s _ r q , s t r u c t s c h e d _ e n t i t y * s e )

{

i f ( c f s _ r q - > r b _ l e f t m o s t == &se -> ru n _ n o d e ) {

s t r u c t rb_node * n e x t _ n o d e ;

nex t_node = r b _ n e x t ( & s e - > r u n _ n o d e ) ;

c f s _ r q - > r b _ l e f t m o s t = n e x t _ n o d e ;}

r b _ e r a s e ( & s e - > r u n _ n o d e , & c f s _ r q - > t a s k s _ t i m e l i n e ) ;

}

Page 33: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationEntry point: s c h e d u l e ( )

� The kernel calls s c h e d u l e ( ) anytime it wants to invoke the scheduler

.., Calls p i c k n e x t t a s k ( )1

2

3

4

5

67

8

9

10

11

1213

14

15

16

s t a t i c i n l i n e s t r u c t t a s k _ s t r u c t * p i c k _ n e x t _ t a s k ( s t r u c t r q * r q , s t r u c t

t a s k _ s t r u c t * p r e v , s t r u c t p i n _ c o o k i e

c o o k i e )

{

c o n s t s t r u c t s c h e d _ c l a s s * c l a s s = &

f a i r _ s c h e d _ c l a s s ;s t r u c t t a s k _ s t r u c t * p ;

i f ( l i k e l y ( p r e v - > s c h e d _ c l a s s == c l a s s &&

r q - > n r _ r u n n i n g == r q - > c f s .h _ n r _ r u n n i n g ) ) {

p = f a i r _ s c h e d _ c l a s s . p i c k _ n e x t _ t a s k ( r q

, p r e v , c o o k i e ) ;

i f ( u n l i k e l y ( p == RETRY_TASK))

go to a g a i n ;

i f ( u n l i k e l y ( ! p ) )

p = i d l e _ s c h e d _ c l a s s . p i c k _ n e x t _ t a s k (

r q , p r e v , c o o k i e ) ;re turn p ;

}

17

1819

20

2122

23

2425

26

27

28

a g a i n :

f o r _ e a c h _ c l a s s ( c l a s s ) {

p = c l a s s - > p i c k _ n e x t _ t a s k ( r q , p r e v ,

c o o k i e ) ;i f ( p ) {

i f ( u n l i k e l y ( p == RETRY_TASK))

go to a g a i n ;

re turn p ;

}

}

BUG(); / * t h e i d l e c l a s s w i l l a lways

have a runnable t a s k * /

}

Page 34: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationSleeping and waking up

� Multiple reasons for a task to sleep:

.., Specified amount of time, waiting for I/O, blocking on a mutex, etc.

� Going to sleep - steps:

1

2

3

4

Task marks itself as sleeping

Task enters a waitqueueTask leaves the rbtree of runnable processes

Task calls s c h e d u l e ( ) to select a new process to run

� Inverse steps for waking up

� Two states associated with sleeping:.., TASK INTERRUPTIBLE

.., Will be awaken on signal reception

.., TASK UNINTERRUPTIBLE

.., Ignore signals

Page 35: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationSleeping and waking up: wait queues

� Wait queue:

.., List of processes waiting

for an event to occur

1

2

3

4

t y p e d e f s t r u c t wai t_queue_head

wai t_queue_head_ ts t r u c t wai t_queue_head {

s p i n l o c k _ t l o c k ;

s t r u c t l i s t _ h e a d t a s k _ l i s t ; }

� Some simple interfaces used to go to sleep have races:.., It is possible to go to sleep after the event we are waiting for has

occurred.., Recommended way:

1

23

4

56

7

89

10

1112

/ * We assume t h e wa i t queue we want t o wa i t on i s a c c e s s i b l e through a v a r i a b l e q * /

DEFINE_WAIT(wait); / * i n i t i a l i z e a wa i t queue e n t r y * /

a d d _ wa i t _ q u e u e ( q , & w a i t ) ;

w h i l e ( ! c o n d i t i o n ) { / * e v e n t we a re w a i t i n g f o r * /

p r e p a r e _ t o _ w a i t ( & q , &wai t , TASK_INTERRUPTIBLE); i f ( s i g n a l _ p e n d i n g ( c u r r e n t ) )

/ * handle s i g n a l * /

s c h e d u l e ( ) ;

}

f i n i s h _ w a i t ( & q , & w a i t ) ;

Page 36: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationSleeping and waking up: wait queues (2)

� Steps for waiting on a waitqueue:

1

2

3

4

Create a waitqueue entry (DEFINE WAIT())

Add the calling process to a wait queue (add w a i t q u e u e ( ) )

Call p r e p a r e t o w a i t ( ) to change the process state

If the state is TASK INTERRUPTIBLE, a signal can wake the task

up → need to check5

6

7

Executes another process with s c h e d u l e ( )

When the task awakens, check the condition

When the condition is true, get out of the wait queue and set the

state accordingly using f i n i s h w a i t ( )

Page 37: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationSleeping and waking up: wake u p ( )

� Waking up is taken care of by wake u p ( ).., Awakes all the processes on a waitqueue by default

1

2

# d e f i n e wake_up(x) wake_up (x , TASK_NORMAL, 1 , NULL)

/ * type o f x i s wait_queue_head_t * /

� wake u p ( ) calls wake up common():

1

23

4

56

7

89

10

1112

13

s t a t i c vo id wake_up_common(wait_queue_head_t * q , uns igned i n t mode,

i n t n r _ e x c l u s i v e , i n t w a k e _ f l a g s , v o i d * k e y)

{

wa i t _ q u e u e _ t * c u r r , * n e x t ;

l i s t _ f o r _ e a c h _ e n t r y _ s a f e ( c u r r , n e x t , & q - > t a s k _ l i s t , t a s k _ l i s t ) {

uns igned f l a g s = c u r r - > f l a g s ;

i f ( c u r r - > f u n c ( c u r r , mode, w a k e _ f l a g s , k e y) && ( f l a g s

& WQ_FLAG_EXCLUSIVE) && ! - - n r _ e x c l u s i v e )break ; / * wakes up on ly a s u b s e t o f ’ e x c l u s i v e ’ t a s k s * /

}

}

� Exclusive tasks are added through

p r e p a r e t o w a i t e x c l u s i v e ( )

Page 38: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationSleeping and waking up: wake up() (2)

e A wait queue entry contains a pointer to a wake-up function) include/linux/wait.h:

1

23

4

56

7

89

10

1112

typedefstruct wait_queue wait_queue_t;

typedefint (*wait_queue_func_t)(wait_queue_t *wait, unsigned mode, int flags, void *key);

int default_wake_function(*wait_queue_func_t)(wait_queue_t *wait, unsigned mode,

int flags, void *key);

/* ... */

struct wait_queue {

/* ... */

wait_queue_func_t func;

/* ... */

}

e default wake function() calls try to wake up() ...) ... which calls ttwu queue() ...) ... which calls ttwu do activate() (put the task back on

runqueue) ...) ... which calls ttwu do wakeup ...) ... which sets the task state to TASK RUNNING

Page 39: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

CFS implementationCFS on multicores: brief, high-level overview

e Per-CPU runqueues (rbtrees)) To avoid costly accesses to shared data structures

e Runqueues must be kept balanced) Ex: dual-core with one large runqueue of high-priority processes,

and a small one with low-priority processes) High-priority processes get less CPU time than low-priority ones

) A load balancing algorithm is run periodically) Balances the queues based on processes priorities and their actual

CPU usage

e More info: [6]

Page 40: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Outline

1 General information

2 Linux Completely Fair Scheduler

3 CFS implementation

4 Preemption and context switching

5 Real-time scheduling policies

6 Scheduling-related syscalls

Page 41: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Preemption and context switchingContext switch

e A context switch is the action of swapping the process currently running on the CPU to another one

) Performed by the context switch()function

) Called by schedule()

1

2

Switch the address space through switch mm()

Switch the CPU state (registers) through switch to()

e A task can voluntarily relinquish the CPU by calling schedule()) But when does the kernel check if there is a need of

preemption?) need resched flag (per-process, in the thread info of current)

) need resched is set by:

1

2

scheduler tick()when the currently running task needs to be

preempted

try to wake up() when a process with higher priority wakes

up

Page 42: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Preemption and context switchingneed resched, user preemption

e The need resched flag is checked:

1

2

Upon returning to user space (from a syscall or an interrupt)

Upon returning from an interrupt

e If the flag is set, schedule() is called

e User preemption happens:1

2

When returning to user space from a syscall

When returning to user space from an interrupt

e With Linux, the kernel is also subject topreemption

Page 43: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Preemption and context switchingKernel preemption

e In most of Unix-like, kernel code is non-preemptive:) It runs until it finishes

e Linux kernel code is preemptive) A task can be preempted in the kernel as long as execution is in a

safe state) Not holding any lock (kernel is SMP safe)

e preempt count in the thread info structure

) Indicates the current lock depth

e If need resched && !preempt count → safe to preempt) Checked when returning to the kernel from interrupt) need resched is also checked when releasing a lock and

preempt count is 0

e Kernel code can also call directlyschedule()

Page 44: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Preemption and context switchingKernel preemption (2)

e Kernel preemption canoccur:1

2

3

4

On return from interrupt to kernel space

When kernel code becomes preemptible again

If a task explicitly calls schedule() from the kernel

If a task in the kernel blocks (ex: mutex, result in a call to

schedule())

Page 45: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Outline

1 General information

2 Linux Completely Fair Scheduler

3 CFS implementation

4 Preemption and context switching

5 Real-time scheduling policies

6 Scheduling-related syscalls

Page 46: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Real-time scheduling policiesSCHED FIFO and SCHED RR

e Soft real-time scheduling classes:) Best effort, no guarantees

e Real-time task of any scheduling class will always run before non-real time ones (CFS, SCHED OTHER)

) schedule() → pick next task() → for each class()

e 2 ”classical” RT scheduling policies (kernel/sched/rt.c):) SCHED FIFO

) Tasks run until it blocks/yield, only a higher priority RT task can

preempt it) Round-robin for tasks of same priority

) SCHED RR

) Same as SCHED FIFO, but with a fixedtimeslice

Page 47: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Real-time scheduling policiesOther scheduling policies

e SCHED DEADLINE:

) Real-time policies mainlined in v3.14 enabling predictable RT

scheduling) EDF implementation based on a period of activation and a worst

case execution time (WCET) for each task) More info: Documentation/sched-deadline.txt, [4], etc.

e SCHED BATCH: non-real-time, low priority background jobs

e SCHED IDLE: non-real-time, very low priority background jobs

Page 48: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Outline

1 General information

2 Linux Completely Fair Scheduler

3 CFS implementation

4 Preemption and context switching

5 Real-time scheduling policies

6 Scheduling-related syscalls

Page 49: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Scheduling-related syscallsScheduling syscalls list

e sched getscheduler, sched setscheduler

e nice

e sched getparam, sched setparam

e sched get priority max, sched get priority min

e sched getaffinity, sched setaffinity

e sched yield

Page 50: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Scheduling-related syscallsUsage example

1

23

4

56

7

89

10

1112

13

1415

16

1718

19

2021

22

23

#define _GNU_SOURCE

#include <stdio.h>

#include <stdlib.h>

#include <sys/types.h>

#include <unistd.h>

#include <sched.h>

#include <assert.h>

void handle_err(int ret, char *func)

{

perror(func);

exit(EXIT_FAILURE);

}

int main(void)

{

pid_t pid = -1;

int ret = -1;

struct sched_param sp;

int max_rr_prio, min_rr_prio = -42;

size_t cpu_set_size = 0;

cpu_set_t cs;

24 /* GetthePIDofthecallingprocess */

/* Gettheschedulingclass */

handle_err(ret,"sched_getscheduler");

*/

25 pid = getpid();

26 printf("Mypidis:%d¥n", pid);

27

2829 ret = sched_getscheduler(pid);

30 if(ret == -1)

3132 printf("sched_getschedulerreturns:"

33 "%d¥n", ret);

34 assert(ret == SCHED_OTHER);

3536 /* Getthepriority(nice/RT) */

37 sp.sched_priority = -1;

38 ret = sched_getparam(pid, &sp);

39 if(ret == -1)

40 handle_err(ret,"sched_getparam");

41 printf("Mypriorityis:%d¥n",

42 sp.sched_priority);

4344 /* Setthepriority(nicevalue)

45 ret = nice(1);

46 if(ret == -1)

47 handle_err(ret,"nice");

Page 51: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Scheduling-related syscallsUsage example (2)

46 /* Getthepriority */ 65 /* Getthepriority */

47 sp.sched_priority = -1; 66 sp.sched_priority = -1;

48 ret = sched_getparam(pid, &sp); 67 ret = sched_getparam(pid, &sp);

49 if(ret == -1) 68 if(ret == -1)

50 handle_err(ret,"sched_getparam"); 69 handle_err(ret,"sched_getparam");

51 printf("Mypriorityis:%d¥n", 70 printf("Mypriorityis:%d¥n",

52 sp.sched_priority); 71 sp.sched_priority);

5354

/* SwitchschedulingclasstoFIFOand7273

/* SettheRTpriority */

55 * thepriorityto99 */ 74 sp.sched_priority = 42;

56 sp.sched_priority = 99; 75 ret = sched_setparam(pid, &sp);

57 ret = sched_setscheduler(pid, 76 if(ret == -1)

58 SCHED_FIFO, &sp); 77 handle_err(ret,"sched_setparam");

59 if(ret == -1) 78 printf("Prioritychangedto%d¥n",

60 handle_err(ret,"sched_setscheduler"); 79 sp.sched_priority);

61 8062 /* Gettheschedulingclass */ 81 /* Getthepriority */

63 ret = sched_getscheduler(pid); 82 sp.sched_priority = -1;

64 if(ret == -1) 83 ret = sched_getparam(pid, &sp);

65 handle_err(ret,"sched_getscheduler"); 84 if(ret == -1)

66 printf("sched_getschedulerreturns:" 85 handle_err(ret,"sched_getparam");

67 "%d¥n", ret); 86 printf("Mypriorityis:%d¥n",

68 assert(ret == SCHED_FIFO); 87 sp.sched_priority);

Page 52: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Scheduling-related syscallsUsage example (2)

/* clearthemask */

85 /* GetthemaxpriorityvalueforSCHED_RR */

86 max_rr_prio = sched_get_priority_max(SCHED_RR);

87 if(max_rr_prio == -1)

88 handle_err(max_rr_prio,"sched_get_priority_max");

89 printf("MaxRRprio:%d¥n", max_rr_prio);

9091 /* GettheminpriorityvalueforSCHED_RR */

92 min_rr_prio = sched_get_priority_min(SCHED_RR);

93 if(min_rr_prio == -1)

94 handle_err(min_rr_prio,"sched_get_priority_min");

95 printf("MinRRprio:%d¥n", min_rr_prio);

9697 cpu_set_size = sizeof(cpu_set_t);

98 CPU_ZERO(&cs);

99 CPU_SET(0, &cs);

100 CPU_SET(1, &cs);

101 /* SettheaffinitytoCPUs0and1only */

102 ret = sched_setaffinity(pid, cpu_set_size, &cs);

103 if(ret == -1)

104 handle_err(ret,"sched_setaffinity");

105 /* GettheCPUaffinity */

106 CPU_ZERO(&cs);

107 ret = sched_getaffinity(pid,

108 cpu_set_size, &cs);

109 if(ret == -1)

110 handle_err(ret,

111 "sched_getaffinity");

112 assert(CPU_ISSET(0, &cs));

113 assert(CPU_ISSET(1, &cs));

114 printf("AffinitytestsOK¥n");

115 /* YieldtheCPU */

116 ret = sched_yield();

118 if(ret == -1)

119 handle_err(ret,

120 "sched_yield");

121 return EXIT_SUCCESS;

123 }

Page 53: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

Additional documentation

e CFS:) http://www.ibm.com/developerworks/library/

l-completely-fair-scheduler/

) http://elinux.org/images/d/dc/Elc2013_Na.pdf

e Linux scheduling:) https://www.cs.columbia.edu/smb/classes/

s06-4118/l13.pdf

Page 54: Advanced Operating Systems #1 - PFLab...General information A bit of Linux scheduler history From v1.0 to v2.4: simple implementation.., But it did not scale to numerous processes

BibliographyI

1 Inside the linux 2.6 completely fair scheduler. https://www.ibm.com/developerworks/library/l-completely-fair-scheduler/.

Accessed: 2017-02-01.

2 The rotating staircase deadline scheduler. https://lwn.net/Articles/224865/.

Accessed: 2017-01-29.

3 Rsdl completely fair starvation free interactive cpu scheduler.

https://lwn.net/Articles/224654/.

Accessed: 2017-01-29.

4 Sched deadline: a status update. http://events.linuxfoundation.org/sites/events/files/slides/SCHED_DEADLINE-20160404.pdf.

Accessed: 2017-02-06.

5 Understanding the linux 2.6.8.1 cpu scheduler.https://web.archive.org/web/20131231085709/http://joshaas.net/linux/linux_cpu_scheduler.pdf.

Accessed: 2017-01-28.

[6]L OZI, J.-P., LEPERS, B., FUNSTON, J., GAUD, F., QUE MA, V., AND FEDOROVA, A.

The linux scheduler: A decade of wasted cores.

In Proceedings of the Eleventh European Conference on Computer Systems (2016), EuroSys ’16, ACM, pp. 1:1–1:16.