linux, locking and ots of - hs-rm.dekaiser/1515_aos/08-linux.pdf · linux, locking and lots of...

Post on 07-Jun-2020

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

LINUX, LOCKING AND LOTS OF

PROCESSORS

Peter Chubb

peter.chubb@nicta.com.au

A LITTLE BIT OF HISTORY

• Multix in the ’60s

NICTA Copyright c© 2014 From Imagination to Impact 2

A LITTLE BIT OF HISTORY

• Multix in the ’60s

• Ken Thompson and Dennis Ritchie in 1967–70

NICTA Copyright c© 2014 From Imagination to Impact 2

A LITTLE BIT OF HISTORY

• Multix in the ’60s

• Ken Thompson and Dennis Ritchie in 1967–70

• USG and BSD

NICTA Copyright c© 2014 From Imagination to Impact 2

A LITTLE BIT OF HISTORY

• Multix in the ’60s

• Ken Thompson and Dennis Ritchie in 1967–70

• USG and BSD

• John Lions 1976–95

NICTA Copyright c© 2014 From Imagination to Impact 2

A LITTLE BIT OF HISTORY

• Multix in the ’60s

• Ken Thompson and Dennis Ritchie in 1967–70

• USG and BSD

• John Lions 1976–95

• Andrew Tanenbaum 1987

NICTA Copyright c© 2014 From Imagination to Impact 2

A LITTLE BIT OF HISTORY

• Multix in the ’60s

• Ken Thompson and Dennis Ritchie in 1967–70

• USG and BSD

• John Lions 1976–95

• Andrew Tanenbaum 1987

• Linux Torvalds 1991

NICTA Copyright c© 2014 From Imagination to Impact 2

A LITTLE BIT OF HISTORY

• Basic concepts well established

– Process model

– File system model

– IPC

NICTA Copyright c© 2014 From Imagination to Impact 3

A LITTLE BIT OF HISTORY

• Basic concepts well established

– Process model

– File system model

– IPC

• Additions:

– Paged virtual memory (3BSD, 1979)

NICTA Copyright c© 2014 From Imagination to Impact 3

A LITTLE BIT OF HISTORY

• Basic concepts well established

– Process model

– File system model

– IPC

• Additions:

– Paged virtual memory (3BSD, 1979)

– TCP/IP Networking (BSD 4.1, 1983)

NICTA Copyright c© 2014 From Imagination to Impact 3

A LITTLE BIT OF HISTORY

• Basic concepts well established

– Process model

– File system model

– IPC

• Additions:

– Paged virtual memory (3BSD, 1979)

– TCP/IP Networking (BSD 4.1, 1983)

– Multiprocessing (Vendor Unices such as

Sequent’s ‘Balance’, 1984)

NICTA Copyright c© 2014 From Imagination to Impact 3

ABSTRACTIONS

Linux Kernel

File

s

Th

read

of

Co

ntr

ol

Mem

ory

Sp

ace

NICTA Copyright c© 2014 From Imagination to Impact 4

PROCESS MODEL

• Root process (init)

• fork() creates (almost) exact copy

– Much is shared with parent — Copy-On-Write

avoids overmuch copying

• exec() overwrites memory image from a file

NICTA Copyright c© 2014 From Imagination to Impact 5

PROCESS MODEL

• Root process (init)

• fork() creates (almost) exact copy

– Much is shared with parent — Copy-On-Write

avoids overmuch copying

• exec() overwrites memory image from a file

• Allows a process to control what is shared

NICTA Copyright c© 2014 From Imagination to Impact 5

FORK() AND EXEC()

➜ A process can clone itself by calling fork().

➜ Most attributes copied :

➜ Address space (actually shared, marked copy-on-write)

➜ current directory, current root

➜ File descriptors

➜ permissions, etc.

➜ Some attributes shared :

➜ Memory segments marked MAP SHARED

➜ Open files

NICTA Copyright c© 2014 From Imagination to Impact 6

FORK() AND EXEC()

Files and Processes:

.

.

0

1

2

3

4

5

6

7

File descriptor table

Process A

NICTA Copyright c© 2014 From Imagination to Impact 7

FORK() AND EXEC()

Files and Processes:

Open file descriptor

Offset

In−kernel inode

.

.

0

1

2

3

4

5

6

7

File descriptor table

Process A

NICTA Copyright c© 2014 From Imagination to Impact 7

FORK() AND EXEC()

Files and Processes:

dup()

Open file descriptor

Offset

In−kernel inode

.

.

0

1

2

3

4

5

6

7

File descriptor table

Process A

NICTA Copyright c© 2014 From Imagination to Impact 7

FORK() AND EXEC()

Files and Processes:

.

.

0

1

2

3

4

5

6

7

File descriptor table

Process B

fork()

dup()

Open file descriptor

Offset

In−kernel inode

.

.

0

1

2

3

4

5

6

7

File descriptor table

Process A

NICTA Copyright c© 2014 From Imagination to Impact 7

FORK() AND EXEC()

switch (kidpid = fork()) {

case 0: /* child */

close(0); close(1); close(2);

dup(infd); dup(outfd); dup(outfd);

execve("path/to/prog", argv, envp);

_exit(EXIT_FAILURE);

case -1:

/* handle error */

default:

waitpid(kidpid, &status, 0);

}

NICTA Copyright c© 2014 From Imagination to Impact 8

STANDARD FILE DESCRIPTORS

0 Standard Input

1 Standard Output

2 Standard Error

➜ Inherited from parent

➜ On login, all are set to controlling tty

NICTA Copyright c© 2014 From Imagination to Impact 9

FILE MODEL

• Separation of names from content.

• ‘regular’ files ‘just bytes’ → structure/meaning

supplied by userspace

• Devices represented by files.

• Directories map names to index node indices

(inums)

• Simple permissions model

NICTA Copyright c© 2014 From Imagination to Impact 10

FILE MODEL

.

..

bash

shls

whichrnano

busybox

setserial

bzcmp

367

368

402401

265

/ bin / ls

.

..

boot

sbin

bin

dev

var

vmlinux

etc

usr

inode 324

2300300

301

32434

5

76

2

2324

8

125

NICTA Copyright c© 2014 From Imagination to Impact 11

NAMEI

➜ translate name → inode

➜ abstracted per filesystem in VFS layer

➜ Can be slow: extensive use of caches to speed it up

dentry cache

➜ hide filesystem and device boundaries

➜ walks pathname, translating symbolic links

NICTA Copyright c© 2014 From Imagination to Impact 12

NAMEI

➜ translate name → inode

➜ abstracted per filesystem in VFS layer

➜ Can be slow: extensive use of caches to speed it up

dentry cache — becomes SMP bottleneck

➜ hide filesystem and device boundaries

➜ walks pathname, translating symbolic links

NICTA Copyright c© 2014 From Imagination to Impact 12

EVOLUTION

KISS:

➜ Simplest possible algorithm used at first

NICTA Copyright c© 2014 From Imagination to Impact 13

EVOLUTION

KISS:

➜ Simplest possible algorithm used at first

➜ Easy to show correctness

➜ Fast to implement

NICTA Copyright c© 2014 From Imagination to Impact 13

EVOLUTION

KISS:

➜ Simplest possible algorithm used at first

➜ Easy to show correctness

➜ Fast to implement

➜ As drawbacks and bottlenecks are found, replace with

faster/more scalable alternatives

NICTA Copyright c© 2014 From Imagination to Impact 13

C DIALECT

• Extra keywords:

– Section IDs: init, exit, percpu etc

– Info Taint annotation user, rcu, kernel,

iomem

– Locking annotations acquires(X),

releases(x)

– extra typechecking (endian portability) bitwise

NICTA Copyright c© 2014 From Imagination to Impact 14

C DIALECT

• Extra iterators

– type name foreach()

• Extra accessors

– container of()

NICTA Copyright c© 2014 From Imagination to Impact 15

C DIALECT

• Massive use of inline functions

• Some use of CPP macros

• Little #ifdef use in code: rely on optimizer to elide

dead code.

NICTA Copyright c© 2014 From Imagination to Impact 16

SCHEDULING

Goals:

• O(1) in number of runnable processes, number of

processors

– good uniprocessor performance

• ‘fair’

• Good interactive response

• topology-aware

NICTA Copyright c© 2014 From Imagination to Impact 17

SCHEDULING

Implementation:

• Changes from time to time.

• Currently ‘CFS’ by Ingo Molnar.

NICTA Copyright c© 2014 From Imagination to Impact 18

SCHEDULING

Dual Entitlement Scheduler

0.5 0.7 0.1

0 0

Expired

Running

NICTA Copyright c© 2014 From Imagination to Impact 19

SCHEDULING

1. Keep tasks ordered by effective CPU runtime

weighted by nice in red-black tree

2. Always run left-most task.

NICTA Copyright c© 2014 From Imagination to Impact 20

SCHEDULING

1. Keep tasks ordered by effective CPU runtime

weighted by nice in red-black tree

2. Always run left-most task.

Devil’s in the details:

• Avoiding overflow

• Keeping recent history

• multiprocessor locality

• handling too-many threads

• Sleeping tasks

• Group hierarchyNICTA Copyright c© 2014 From Imagination to Impact 20

SCHEDULING

(hyper)Thread

NICTA Copyright c© 2014 From Imagination to Impact 21

SCHEDULING

Core

NICTA Copyright c© 2014 From Imagination to Impact 21

SCHEDULING

(hyper)Threads

Packages

Cores

NICTA Copyright c© 2014 From Imagination to Impact 21

SCHEDULING

(hyper)Threads

Packages

Cores

(hyper)Threads

Packages

Cores

(hyper)Threads

Packages

Cores

RAM

RAM

RAM

NUMA Node

NICTA Copyright c© 2014 From Imagination to Impact 21

SCHEDULING

Locality Issues:

• Best to reschedule on same processor (don’t move

cache footprint, keep memory close)

NICTA Copyright c© 2014 From Imagination to Impact 22

SCHEDULING

Locality Issues:

• Best to reschedule on same processor (don’t move

cache footprint, keep memory close)

– Otherwise schedule on a ‘nearby’ processor

NICTA Copyright c© 2014 From Imagination to Impact 22

SCHEDULING

Locality Issues:

• Best to reschedule on same processor (don’t move

cache footprint, keep memory close)

– Otherwise schedule on a ‘nearby’ processor

• Try to keep whole sockets idle

NICTA Copyright c© 2014 From Imagination to Impact 22

SCHEDULING

Locality Issues:

• Best to reschedule on same processor (don’t move

cache footprint, keep memory close)

– Otherwise schedule on a ‘nearby’ processor

• Try to keep whole sockets idle

• Somehow identify cooperating threads, co-schedule

on same package?

NICTA Copyright c© 2014 From Imagination to Impact 22

SCHEDULING

• One queue per processor (or hyperthread)

• Processors in hierarchical ‘domains’

• Load balancing per-domain, bottom up

• Aims to keep whole domains idle if possible (power

savings)

NICTA Copyright c© 2014 From Imagination to Impact 23

MEMORY MANAGEMENT

Memory in

zones

Highmem

Normal

DMA

Normal

Physical address 0

16M

900M

DMA

3GLinux kernel

User VM

VirtualPhysical

Iden

tity

Map

ped

wit

h o

ffse

t

NICTA Copyright c© 2014 From Imagination to Impact 24

MEMORY MANAGEMENT

• Direct mapped pages become logical addresses

– pa() and va() convert physical to virtual for

these

NICTA Copyright c© 2014 From Imagination to Impact 25

MEMORY MANAGEMENT

• Direct mapped pages become logical addresses

– pa() and va() convert physical to virtual for

these

• small memory systems have all memory as logical

NICTA Copyright c© 2014 From Imagination to Impact 25

MEMORY MANAGEMENT

• Direct mapped pages become logical addresses

– pa() and va() convert physical to virtual for

these

• small memory systems have all memory as logical

• More memory → ∆ kernel refer to memory by

struct page

NICTA Copyright c© 2014 From Imagination to Impact 25

MEMORY MANAGEMENT

struct page:

• Every frame has a struct page (up to 10 words)

• Track:

– flags

– backing address space

– offset within mapping or freelist pointer

– Reference counts

– Kernel virtual address (if mapped)

NICTA Copyright c© 2014 From Imagination to Impact 26

MEMORY MANAGEMENT

File(or swap)

struct

address_space

struct

vm_area_structstruct

vm_area_structstruct

vm_area_struct

struct mm_struct

In virtual address order....

struct task_struct

Pag

e T

able

(har

dw

are

def

ined

)

owner

NICTA Copyright c© 2014 From Imagination to Impact 27

MEMORY MANAGEMENT

Address Space:

• Misnamed: means collection of pages mapped from

the same object

• Tracks inode mapped from, radix tree of pages in

mapping

• Has ops (from file system or swap manager) to:

dirty mark a page as dirty

readpages populate frames from backing store

writepages Clean pages — make backing store the

same as in-memory copyNICTA Copyright c© 2014 From Imagination to Impact 28

MEMORY MANAGEMENT

migratepage Move pages between NUMA nodes

Others. . . And other housekeeping

NICTA Copyright c© 2014 From Imagination to Impact 29

PAGE FAULT TIME

• Special case in-kernel faults

• Find the VMA for the address

– segfault if not found (unmapped area)

• If it’s a stack, extend it.

• Otherwise:

1. Check permissions, SIG SEGV if bad

2. Call handle mm fault():

– walk page table to find entry (populate higher

levels if nec. until leaf found)

– call handle pte fault()

NICTA Copyright c© 2014 From Imagination to Impact 30

PAGE FAULT TIME

handle pte fault(): Depending on PTE status, can

• provide an anonymous page

• do copy-on-write processing

• reinstantiate PTE from page cache

• initiate a read from backing store.

and if necessary flushes the TLB.

NICTA Copyright c© 2014 From Imagination to Impact 31

DRIVER INTERFACE

Three kinds of device:

1. Platform device

2. enumerable-bus device

3. Non-enumerable-bus device

NICTA Copyright c© 2014 From Imagination to Impact 32

DRIVER INTERFACE

Enumerable buses:

static DEFINE_PCI_DEVICE_TABLE(cp_pci_tbl) = {

{ PCI_DEVICE(PCI_VENDOR_ID_REALTEK,PCI_DEVICE_ID_REALTEK_8139)

{ PCI_DEVICE(PCI_VENDOR_ID_TTTECH,PCI_DEVICE_ID_TTTECH_MC322),

{ },

};

MODULE_DEVICE_TABLE(pci, cp_pci_tbl);

NICTA Copyright c© 2014 From Imagination to Impact 33

DRIVER INTERFACE

Driver interface:

init called to register driver

exit called to deregister driver, at module unload time

probe() called when bus-id matches; returns 0 if driver

claims device

open, close, etc as necessary for driver class

NICTA Copyright c© 2014 From Imagination to Impact 34

DRIVER INTERFACE

Platform Devices:

static struct platform_device nslu2_uart = {

.name = "serial8250",

.id = PLAT8250_DEV_PLATFORM,

.dev.platform_data = nslu2_uart_data,

.num_resources = 2,

.resource = nslu2_uart_resources,

};

NICTA Copyright c© 2014 From Imagination to Impact 35

DRIVER INTERFACE

non-enumerable buses: Treat like platform devices

NICTA Copyright c© 2014 From Imagination to Impact 36

SUMMARY

• I’ve told you status today

NICTA Copyright c© 2014 From Imagination to Impact 37

SUMMARY

• I’ve told you status today

– Next week it may be different

NICTA Copyright c© 2014 From Imagination to Impact 37

SUMMARY

• I’ve told you status today

– Next week it may be different

• I’ve simplified a lot. There are many hairy details

NICTA Copyright c© 2014 From Imagination to Impact 37

FILE SYSTEMS

I’m assuming:

• You’ve already looked at ext[234]-like filesystems

• You’ve some awareness of issues around on-disk

locality and I/O performance

• You understand issues around avoiding on-disk

corruption by carefully ordering events, and/or by the

use of a Journal.

NICTA Copyright c© 2014 From Imagination to Impact 38

NORMAL FILE SYSTEMS

• Optimised for use on spinning disk

• RAID optimised (especially XFS)

• Journals, snapshots, transactions...

NICTA Copyright c© 2014 From Imagination to Impact 39

FLASH MEMORY

• NOR Flash

• NAND Flash

NICTA Copyright c© 2014 From Imagination to Impact 40

FLASH MEMORY

• NOR Flash

• NAND Flash

– MTD

– eMMC, SDHC etc

– SSD, USB

NICTA Copyright c© 2014 From Imagination to Impact 40

FLASH MEMORY

• NOR Flash

• NAND Flash

– MTD — Memory Technology Device

– eMMC, SDHC etc

– SSD, USB

NICTA Copyright c© 2014 From Imagination to Impact 40

FLASH MEMORY

• NOR Flash

• NAND Flash

– MTD — Memory Technology Device

– eMMC, SDHC etc — A JEDEC standard

– SSD, USB

NICTA Copyright c© 2014 From Imagination to Impact 40

FLASH MEMORY

• NOR Flash

• NAND Flash

– MTD — Memory Technology Device

– eMMC, SDHC etc — A JEDEC standard

– SSD, USB — and other disk-like interfaces

NICTA Copyright c© 2014 From Imagination to Impact 40

NAND CHARACTERISTICS

Erase Block

Page

Interface Circuitry

NAND Flash Chip

NICTA Copyright c© 2014 From Imagination to Impact 41

FLASH UPDATE

NICTA Copyright c© 2014 From Imagination to Impact 42

FLASH UPDATE

NICTA Copyright c© 2014 From Imagination to Impact 42

FLASH UPDATE

NICTA Copyright c© 2014 From Imagination to Impact 43

FLASH UPDATE

NICTA Copyright c© 2014 From Imagination to Impact 43

FLASH UPDATE

NICTA Copyright c© 2014 From Imagination to Impact 43

FLASH UPDATE

NICTA Copyright c© 2014 From Imagination to Impact 43

FLASH UPDATE

NICTA Copyright c© 2014 From Imagination to Impact 43

FLASH UPDATE

NICTA Copyright c© 2014 From Imagination to Impact 43

FLASH UPDATE

NICTA Copyright c© 2014 From Imagination to Impact 43

FLASH UPDATE

NICTA Copyright c© 2014 From Imagination to Impact 43

FLASH UPDATE

Erase Block

Page

Interface Circuitry

NAND Flash Chip

NOR flashorEEPROM

RAM

Processor

NICTA Copyright c© 2014 From Imagination to Impact 44

FLASH UPDATE

NICTA Copyright c© 2014 From Imagination to Impact 45

THE CONTROLLER:

• Presents illusion of ‘standard’ block device

• Manages writes to prevent wearing out

• Manages reads to prevent read-disturb

• Performs garbage collection

• Performs bad-block management

NICTA Copyright c© 2014 From Imagination to Impact 46

THE CONTROLLER:

• Presents illusion of ‘standard’ block device

• Manages writes to prevent wearing out

• Manages reads to prevent read-disturb

• Performs garbage collection

• Performs bad-block management

Mostly documented in Korean patents referred to by US

patents!

NICTA Copyright c© 2014 From Imagination to Impact 46

WEAR MANAGEMENT

Two ways:

• Remap blocks when they begin to fail (bad block

remapping)

NICTA Copyright c© 2014 From Imagination to Impact 47

WEAR MANAGEMENT

Two ways:

• Remap blocks when they begin to fail (bad block

remapping)

• Spread writes over all erase blocks (wear levelling)

NICTA Copyright c© 2014 From Imagination to Impact 47

WEAR MANAGEMENT

Two ways:

• Remap blocks when they begin to fail (bad block

remapping)

• Spread writes over all erase blocks (wear levelling)

In practice both are used.

NICTA Copyright c© 2014 From Imagination to Impact 47

WEAR MANAGEMENT

Two ways:

• Remap blocks when they begin to fail (bad block

remapping)

• Spread writes over all erase blocks (wear levelling)

In practice both are used.

Also:

• Count reads and schedule garbage collection after

some threshhold

NICTA Copyright c© 2014 From Imagination to Impact 47

PREFORMAT

• Typically use FAT32 (or exFAT for sdxc cards)

• Always do cluster-size I/O (64k)

• First partition segment-aligned

NICTA Copyright c© 2014 From Imagination to Impact 48

PREFORMAT

• Typically use FAT32 (or exFAT for sdxc cards)

• Always do cluster-size I/O (64k)

• First partition segment-aligned

Conjecture Flash controller optimises for the

preformatted FAT fs

NICTA Copyright c© 2014 From Imagination to Impact 48

FAT FILE SYSTEMS

Clu

ster

Info

Blo

ck

FAT

Data Area

Root Directory

Bo

ot

Par

amR

eser

ved

NICTA Copyright c© 2014 From Imagination to Impact 49

FAT FILE SYSTEMS

Conjecture The controller has some number of

buffers it treats specially, to allow more than one write

locus.

NICTA Copyright c© 2014 From Imagination to Impact 50

TESTING SDHC CARDS

NICTA Copyright c© 2014 From Imagination to Impact 51

SD CARD CHARACTERISTICS

Card Price/G #AU Page size Erase Size

Kingston

Class 10

$0.80 2 128k 4M

Toshiba

Class 10

$1.20 2 64k 8M

SanDisk

Extreme

UHS-1

$5.00 9 64k 8M

SanDisk

Ex-

treme Pro

$6.50 9 16k 4M

NICTA Copyright c© 2014 From Imagination to Impact 52

WRITE PATTERNS: FILE CREATE

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0 0.5 1 1.5 2 2.5 3 3.5 4

Blo

ckN

umbe

r

Time

Write 40M File

0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

1.4e+07

0 2 4 6 8 10 12

Blo

ckN

umbe

r

Time

Write 40M File

(On Toshiba Exceria card)

NICTA Copyright c© 2014 From Imagination to Impact 53

WRITE PATTERNS: FILE CREATE

8000

10000

12000

14000

16000

18000

20000

22000

0 1 2 3 4 5 6 7

"~/iozone.dat" using 1:2

NICTA Copyright c© 2014 From Imagination to Impact 54

F2FS

• By Samsung

NICTA Copyright c© 2014 From Imagination to Impact 55

F2FS

• By Samsung

• ‘Use on-card FTL, rather than work against it’

NICTA Copyright c© 2014 From Imagination to Impact 55

F2FS

• By Samsung

• ‘Use on-card FTL, rather than work against it’

• Cooperate with garbage collection

NICTA Copyright c© 2014 From Imagination to Impact 55

F2FS

• By Samsung

• ‘Use on-card FTL, rather than work against it’

• Cooperate with garbage collection

• Use FAT32 optimisations

NICTA Copyright c© 2014 From Imagination to Impact 55

F2FS

• 2M Segments written as whole chunks

NICTA Copyright c© 2014 From Imagination to Impact 56

F2FS

• 2M Segments written as whole chunks — always

writes at log head

NICTA Copyright c© 2014 From Imagination to Impact 56

F2FS

• 2M Segments written as whole chunks — always

writes at log head

— aligned with FLASH allocation units

NICTA Copyright c© 2014 From Imagination to Impact 56

F2FS

• 2M Segments written as whole chunks — always

writes at log head

— aligned with FLASH allocation units

• Log is the only data structure on-disk

NICTA Copyright c© 2014 From Imagination to Impact 56

F2FS

• 2M Segments written as whole chunks — always

writes at log head

— aligned with FLASH allocation units

• Log is the only data structure on-disk

• Metadata (e.g., head of log) written to FAT area in

single-block writes

NICTA Copyright c© 2014 From Imagination to Impact 56

F2FS

• 2M Segments written as whole chunks — always

writes at log head

— aligned with FLASH allocation units

• Log is the only data structure on-disk

• Metadata (e.g., head of log) written to FAT area in

single-block writes

• Splits Hot and Cold data and Inodes.

NICTA Copyright c© 2014 From Imagination to Impact 56

BENCHMARKS: POSTMARK 32K READ

Kingston Toshiba

Sandisk Extreme SanDisk Extreme Pro

Filesystem

EX

T4

FA

T32

1

0

2

3

5

4

F2F

S

MB

/s

NICTA Copyright c© 2014 From Imagination to Impact 57

USING NON-F2FS

• Observation: XFS and ext4 already understand RAID

NICTA Copyright c© 2014 From Imagination to Impact 58

USING NON-F2FS

• Observation: XFS and ext4 already understand RAID

• RAID has multiple chunks, and a fixed stride, so...

NICTA Copyright c© 2014 From Imagination to Impact 58

USING NON-F2FS

• Observation: XFS and ext4 already understand RAID

• RAID has multiple chunks, and a fixed stride, so...

• Configure FS as if for RAID

NICTA Copyright c© 2014 From Imagination to Impact 58

USING NON-F2FS

Still running benchmarks, see LCA talk next January for

results!

NICTA Copyright c© 2014 From Imagination to Impact 59

SCALABILITY

The Multiprocessor Effect:

• Some fraction of the system’s cycles are not available

for application work:

– Operating System Code Paths

– Inter-Cache Coherency traffic

– Memory Bus contention

– Lock synchronisation

– I/O serialisation

NICTA Copyright c© 2014 From Imagination to Impact 60

SCALABILITY

Amdahl’s law:

If a process can be split

such that σ of the running

time cannot be sped up, but

the rest is sped up by

running on p processors,

then overall speedup is

p

1 + σ(p− 1)

T(1−σ ) Tσ

T(1−σ )

T(1−σ )

T(1−σ )

NICTA Copyright c© 2014 From Imagination to Impact 61

SCALABILITY

1 processor

Throughput

Applied load

NICTA Copyright c© 2014 From Imagination to Impact 62

SCALABILITY

1 processor

Throughput

Applied load

NICTA Copyright c© 2014 From Imagination to Impact 62

SCALABILITY

1 processor

Throughput

Applied load

NICTA Copyright c© 2014 From Imagination to Impact 62

SCALABILITY

1 processor

Throughput

Applied load

NICTA Copyright c© 2014 From Imagination to Impact 62

SCALABILITY

1 processor

Throughput

Applied load

2 processors

3 processors

NICTA Copyright c© 2014 From Imagination to Impact 62

SCALABILITY

1 processor

Throughput

Applied load

2 processors

3 processors

NICTA Copyright c© 2014 From Imagination to Impact 62

SCALABILITY

1 processor

Throughput

Applied load

2 processors

3 processors

NICTA Copyright c© 2014 From Imagination to Impact 62

SCALABILITY

3 processors

2 processors

Applied load

Throughput

Latency

Throughput

NICTA Copyright c© 2014 From Imagination to Impact 63

SCALABILITY

Gunther’s law:

C(N) =N

1 + α(N − 1) + βN(N − 1)

where:

N is demand

α is the amount of serialisation: represents Amdahl’s law

β is the coherency delay in the system.

C is Capacity or Throughput

NICTA Copyright c© 2014 From Imagination to Impact 64

SCALABILITY

0

2000

4000

6000

8000

10000

0 2000 4000 6000 8000 10000

Thr

ough

put

Load

USL with alpha=0,beta=0

α = 0, β = 0

NICTA Copyright c© 2014 From Imagination to Impact 65

SCALABILITY

0

2000

4000

6000

8000

10000

0 2000 4000 6000 8000 10000

Thr

ough

put

Load

USL with alpha=0,beta=0

α = 0, β = 0

0

10

20

30

40

50

60

70

0 2000 4000 6000 8000 10000

Thr

ough

put

Load

USL with alpha=0.015,beta=0

α > 0, β = 0

NICTA Copyright c© 2014 From Imagination to Impact 65

SCALABILITY

0

2000

4000

6000

8000

10000

0 2000 4000 6000 8000 10000

Thr

ough

put

Load

USL with alpha=0,beta=0

α = 0, β = 0

0

10

20

30

40

50

60

70

0 2000 4000 6000 8000 10000

Thr

ough

put

Load

USL with alpha=0.015,beta=0

α > 0, β = 0

0

100

200

300

400

500

600

700

0 2000 4000 6000 8000 10000

Thr

ough

put

Load

USL with alpha=0.001,beta=0.0000001

α > 0, β > 0

NICTA Copyright c© 2014 From Imagination to Impact 65

SCALABILITY

Queueing Models:

ServerQueue

Poissonarrivals

Poissonservice times

NICTA Copyright c© 2014 From Imagination to Impact 66

SCALABILITY

Queueing Models:

ServerQueue

Poissonarrivals

Poissonservice times

ServerQueue

Poissonservice times

Same Server

High Priority

Normal Priority

Sink

NICTA Copyright c© 2014 From Imagination to Impact 66

SCALABILITY

Real examples:

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 10 20 30 40 50 60 70 80

Thr

ough

put

Load

Postgres TPC throughput

NICTA Copyright c© 2014 From Imagination to Impact 67

SCALABILITY

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 10 20 30 40 50 60 70 80

Thr

ough

put

Load

USL with alpha=0.342101,beta=0.017430Postgres TPC throughput

NICTA Copyright c© 2014 From Imagination to Impact 68

SCALABILITY

0

1000

2000

3000

4000

5000

6000

7000

8000

0 10 20 30 40 50 60 70 80

Thr

ough

put

Load

Postgres TPC throughput, separate log disc

NICTA Copyright c© 2014 From Imagination to Impact 69

SCALABILITY

Another example:

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

0 10 20 30 40 50

Jobs

per

Min

ute

Number of Clients

01-way02-way04-way08-way12-way

NICTA Copyright c© 2014 From Imagination to Impact 70

SCALABILITY

SPINLOCKS HOLD WAIT

UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME

72.3% 13.1% 0.5us(9.5us) 29us( 20ms)(42.5%) 50542055 86.9% 13.1% 0%

find lock page+0x30

0.01% 85.3% 1.7us(6.2us) 46us(4016us)(0.01%) 1113 14.7% 85.3% 0%

find lock page+0x130

NICTA Copyright c© 2014 From Imagination to Impact 71

SCALABILITY

struct page *find lock page(struct address space *mapping,

unsigned long offset)

{

struct page *page;

spin lock irq(&mapping->tree lock);

repeat:

page = radix tree lookup(&mapping>page tree, offset);

if (page) {

page cache get(page);

if (TestSetPageLocked(page)) {

spin unlock irq(&mapping->tree lock);

lock page(page);

spin lock irq(&mapping->tree lock);

. . .NICTA Copyright c© 2014 From Imagination to Impact 72

SCALABILITY

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 10 20 30 40 50

Jobs

per

Min

ute

Number of Clients

01-way02-way04-way08-way12-way16-way

NICTA Copyright c© 2014 From Imagination to Impact 73

TACKLING SCALABILITY PROBLEMS

• Find the bottleneck

NICTA Copyright c© 2014 From Imagination to Impact 74

TACKLING SCALABILITY PROBLEMS

• Find the bottleneck

– not always easy

NICTA Copyright c© 2014 From Imagination to Impact 74

TACKLING SCALABILITY PROBLEMS

• Find the bottleneck

• fix or work around it

NICTA Copyright c© 2014 From Imagination to Impact 74

TACKLING SCALABILITY PROBLEMS

• Find the bottleneck

• fix or work around it

– not always easy

NICTA Copyright c© 2014 From Imagination to Impact 74

TACKLING SCALABILITY PROBLEMS

• Find the bottleneck

• fix or work around it

• check performance doesn’t suffer too much on the

low end.

NICTA Copyright c© 2014 From Imagination to Impact 74

TACKLING SCALABILITY PROBLEMS

• Find the bottleneck

• fix or work around it

• check performance doesn’t suffer too much on the

low end.

• Experiment with different algorithms, parameters

NICTA Copyright c© 2014 From Imagination to Impact 74

TACKLING SCALABILITY PROBLEMS

• Each solved problem

uncovers another

• Fixing performance for

one workload can

worsen another

NICTA Copyright c© 2014 From Imagination to Impact 75

TACKLING SCALABILITY PROBLEMS

• Each solved problem

uncovers another

• Fixing performance for

one workload can

worsen another

• Performance problems

can make you cry

NICTA Copyright c© 2014 From Imagination to Impact 75

DOING WITHOUT LOCKS

Avoiding Serialisation:

• Lock-free algorithms

• Allow safe concurrent access without excessive

serialisation

NICTA Copyright c© 2014 From Imagination to Impact 76

DOING WITHOUT LOCKS

Avoiding Serialisation:

• Lock-free algorithms

• Allow safe concurrent access without excessive

serialisation

• Many techniques. We cover:

– Sequence locks

– Read-Copy-Update (RCU)

NICTA Copyright c© 2014 From Imagination to Impact 76

DOING WITHOUT LOCKS

Sequence locks:

• Readers don’t lock

• Writers serialised.

NICTA Copyright c© 2014 From Imagination to Impact 77

DOING WITHOUT LOCKS

Reader:

volatile seq;

do {

do {

lastseq = seq;

} while (lastseq & 1);

rmb();

....

} while (lastseq != seq);

NICTA Copyright c© 2014 From Imagination to Impact 78

DOING WITHOUT LOCKS

Writer:

spinlock(&lck);

seq++; wmb()

...

wmb(); seq++;

spinunlock(&lck);

NICTA Copyright c© 2014 From Imagination to Impact 79

DOING WITHOUT LOCKS

RCU: ??

1.

NICTA Copyright c© 2014 From Imagination to Impact 80

DOING WITHOUT LOCKS

RCU: ??

1. 2.

NICTA Copyright c© 2014 From Imagination to Impact 80

DOING WITHOUT LOCKS

RCU: ??

1. 2.

3.

NICTA Copyright c© 2014 From Imagination to Impact 80

DOING WITHOUT LOCKS

RCU: ??

1. 2.

3. 4.

NICTA Copyright c© 2014 From Imagination to Impact 80

BACKGROUND READING

References

McKenney, P. E. (2004), Exploiting Deferred Destruction:

An Analysis of Read-Copy-Update Techniques in

Operating System Kernels, PhD thesis, OGI School of

Science and Engineering at Oregon Health and

Sciences University.

URL:

http://www.rdrop.com/users/paulmck/RCU/RCUdissertation.2004.07.14e1.pdf

McKenney, P. E., Sarma, D., Arcangelli, A., Kleen, A.,

Krieger, O. & Russell, R. (2002), Read copy update, inNICTA Copyright c© 2014 From Imagination to Impact 81

BACKGROUND READING

‘Ottawa Linux Symp.’.

URL:

http://www.rdrop.com/users/paulmck/rclock/rcu.2002.07.08.pdf

NICTA Copyright c© 2014 From Imagination to Impact 82

top related