cs/ece 552: i/osinclair/courses/cs552/spring2020/hando… · cs/ece 552: i/o prof. matthew d....

CS/ECE 552: I/O

Prof. Matthew D. Sinclair

Lecture notes based in part on slides created by MikkoLipasti, Mark Hill, David Wood, Guri Sohi, John Shen,

Joshua San Miguel, and Jim Smith

Announcements 4/6

• HW5 Due Friday 4/10

– H&P 5.9, 5.12 may be helpful

– 5.12 posted on Canvas

– Goal: get feedback back ASAP for Phase 2.3

• Project Phase 2.3 due 4/17

• Optionally work on Phases 2.1 and 2.2

– Will make Phase 3 easier

2

Memory Hierarchy

3

L1-I cache L1-D cache

unified I/D L2 cache

main memory

Input/Output

4



main memory I/O I/O I/O

Reliability: RAID• Error correction: more important for disk than for memory

– Error correction/detection per block (handled by disk hardware)

– Mechanical disk failures (entire disk lost) most common failure mode

• Many disks means high failure rates

• Entire file system can be lost if files striped across multiple disks

• RAID (redundant array of inexpensive disks)– Add redundancy

– Similar to DRAM error correction, but…

– Major difference: which disk failed is known

• Even parity can be used to recover from single failures

• Parity disk can be used to reconstruct data faulty disk

– RAID design balances bandwidth and fault-tolerance

– Implemented in hardware (fast, expensive) or software

5

Levels of RAID - Summary• RAID-0 - no redundancy

– Multiplies read and write bandwidth

• RAID-1 - mirroring

– Pair disks together (write both, read one)

– 2x storage overhead

– Multiples only read bandwidth (not write bandwidth)

• RAID-3 - bit-level parity (dedicated parity disk)

– N+1 disks, calculate parity (write all, read all)

– Good sequential read/write bandwidth, poor random accesses

– If N=8, only 13% overhead

• RAID-4/5 - block-level parity

– Reads only data you need

– Writes require read, calculate parity, write data & parity

• RAID-6 – diagonal parity

6

RAID-3: Bit-level parity • RAID-3 - bit-level parity

– dedicated parity disk

– N+1 disks, calculate parity (write all, read all)

– Good sequential read/write bandwidth, poor random accesses

– If N=8, only 13% overhead

© 2003 Elsevier Science7

RAID 4/5 - Block-level Parity

© 2003 Elsevier Science

• RAID-4/5

– Reads only data you need

– Writes require read, calculate parity, write data&parity

• Naïve approach

1. Read all disks

2. Calculate parity

3. Write data&parity

• Better approach

– Read data&parity

– Calculate parity

– Write data&parity

• Still worse for small writesthan RAID-3

8

From the original paper:

9

RAID-4 vs RAID-5• RAID-5 rotates the parity disk, avoid single-disk bottleneck

© 2003 Elsevier Science

10

In color: RAID 4 vs. RAID 5

11Source: Wikipedia

In color: RAID 6

12Source: Wikipedia

Input/Output

• How to communicate from host processor to I/O device?

– Memory-mapped I/O vs. ISA commands for I/O

13

virtual memory space

memory-mapped I/O register

instruction set

read/write I/O register

Input/Output

• How to communicate from I/O device to host processor?

– Polling vs. interrupts

14



main memory I/O I/O I/O

poll (read) I/O register

interrupt processor(similar to exceptions)

Designing an I/O System for Bandwidth

• Approach

– Find bandwidths of individual components

– Configure components you can change…

– To match bandwidth of bottleneck component you can’t

• Example

– Parameters

• 300 MIPS CPU, 100 MB/s I/O bus

• 50K OS insns + 100K user insns per I/O operation

• SCSI-2 controllers (20 MB/s): each accommodates up to 7 disks

• 5 MB/s disks with tseek + trotation = 10 ms, 64 KB reads

– Determine

• What is the maximum sustainable I/O rate?

• How many SCSI-2 controllers and disks does it require?

– Assuming random reads

15

• First: determine I/O rates of components we can’t change

• Second: configure remaining components to match rate

16


• First: determine I/O rates of components we can’t change

– CPU: (300M insn/s) / (150K Insns/IO) = 2000 IO/s

– I/O bus: (100M B/s) / (64K B/IO) = 1562 IO/s

– Peak I/O rate determined by bus: 1562 IO/s

• Second: configure remaining components to match rate

– Disk: 1 / [10 ms/IO + (64K B/IO) / (5M B/s)] = 43.9 IO/s

– How many disks?

• (1562 IO/s) / (43.9 IO/s) = 36 disks

– How many controllers?

• For 100MB/s we need five 20MB/s controllers

• But each can only hold 7 disks, 7*5 = 35

• So, we need six SCSI-2 controllers

• Caveat: real I/O systems modeled with simulation

17


CS/ECE 552: Parallel Processors (Part 1)




Announcements 4/16

• Project Phase 2.3 due tomorrow

– Posted trace of randbench

• Phase 3 next

– Integrate Phase 2.3 into Phase 2

– Add forwarding, etc. if you haven’t already

• Posted additional register renaming practice

• AEFIS Final Evals Released 4/17

– Will post requests for feedback shortly

19

Quiz Week 13 Renaming Question

20


• Key idea:

– Every instruction takes 1 cycle in X

– Start with 4 free physical registers

– By time 5th instruction (lw $t0) reaches Di, 1st

instruction R in same cycle

– Since we do R first in a cycle, reuse its free physical register right away for 5th instruction

– This process repeats for subsequent instructions

21


22

insn \ cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14

lw $t0, 0($s2) F De Di S X C R

and $s2, $t2, $t1 F De Di S X C R

or $s1, $s1, $t2 F De Di S X C R

sub $t2, $s0, $s2 F De Di S X C R

lw $t0, 4($t0) F De Di S X C R

lw $s2, 0($s1) F De Di S X C R

sub $t0, $t1, $s1 F De Di S X C R

add $t1, $t2, $s1 F De Di S X C R

or $s1, $t2, $t0 F De Di S X C

lw $t2, 0($t0) F De Di

Types of Parallelism

23

Compute-Level Parallelism:➢ Executing multiple computation streams

simultaneously

Data-Level Parallelism:➢ Processing multiple data streams simultaneously

*-Level Parallelism

24

Request-Level Parallelism:➢ Users issue simultaneous requests (e.g., databases, web servers, transactional

systems)➢ Compute-level parallelism in multi-chip processors

Task-Level Parallelism:➢ Programs invoke simultaneous tasks (e.g., dataflow, message passing,

speculative processors)➢ Compute-level parallelism in multi-chip processors and multiprocessors

Thread-Level Parallelism:➢ Programs execute simultaneous threads (e.g., pthreads, OpenMP, GPUs)➢ Compute-level parallelism in multiprocessors

Memory-Level Parallelism:➢ Threads issue simultaneous memory accesses (e.g., cache misses, prefetchers)➢ Data-level parallelism in multiprocessors

*-Level Parallelism

25

Instruction-Level Parallelism:➢ Threads issue simultaneous instructions (e.g., out-of-order, superscalar)➢ Compute-level parallelism in uniprocessors

Superword-Level Parallelism:➢ Instructions process simultaneous data words (e.g., vector instructions, Intel

SSE/AVX)➢ Data-level parallelism in uniprocessors

Bit-Level Parallelism:➢ Operations manipulate simultaneous bits (e.g., functional units, bitwise

arithmetic)➢ Data-level parallelism in uniprocessors

Amdahl’s Law, Redux

• Speedup = 1 / (1 – f + f/s)

– f is % done in parallel

– s is amount done in parallel (e.g., on multiple cores)

26

Amdahl’s Law Example

• Applying Amdahl's Law, you estimate that when executing on two cores, the speedup of your entire program is 1.5x. What is the fraction of your program that can be parallelized?

Speedup = 1.5, s = 2

Speedup = 1.5 = 1 / ((1 – x) + x/2)

x = 2/327

Example Vector ISA Extensions (SIMD)

• Extend ISA with floating point (FP) vector storage …– Vector register: fixed-size array of 32- or 64- bit FP elements

– Vector length: For example: 4, 8, 16, 64, …

• … and example operations for vector length of 4– Load vector: lv $v1, X($r1) //[X+r1]->v1

lv [X+r1+0]->v10

lv [X+r1+1]->v11

lv [X+r1+2]->v12

lv [X+r1+3]->v13

– Add two vectors: addv.f v3, v1, v2 // v3 = v1 + v2

addv.f v3i, v1i, v2i (where i is 0,1,2,3)

– Add vector to scalar: addv.f v3, v1, f2

addv.f v3i, v1i, f2 (where i is 0,1,2,3)

• Today’s vectors: short (128 or 256 bits), but fully parallel

28

Example Use of Vectors – 4-wide

• Operations– Load vector: lv v1, X(r1) // [X+r1]->v1

– Multiply vector to scalar: mulv.f // v1 * f2->v3

– Add two vectors: addv.f v3, v1, v2 // v1,v2->v3

– Store vector: sv v1, X(r1) // v1->[X+r1]

• Performance?– Best case: 4x speedup

– Tradeoff: execution width (implementation) vs vector width (ISA)

lwcl f1, X(r1)

mul.s f2, f0, f1

lwcl f3, Y(r1)

add.s f4, f2, f3

swcl f4, Z(r1)

addi r1, r1, 4

bne r1, 4096, L1

lv v1, X(r1)

mulv.f v2, v1, f0

lv v3, Y(r1)

addv.f v4, v2, v3

sv v4, Z(r1)

addi r1,16->r1

bne r1,4096,L1

7x1024 instructions 7x256 instructions

(4x fewer instructions)

29

Parallelism and Concurrency

30

Concurrency:Coexistence of multiple tasks/computations in progress➢ Logically doing multiple things at once➢ Typically algorithm/system-defined

Parallelism:Execution of multiple tasks/computations simultaneously➢ Physically doing multiple things at once➢ Typically architecture-defined


31

e.g., Concurrent but not parallel:


32

e.g., Concurrent but not parallel:➢ Context-switching threads on a single core

thread 1 thread 2

context switches


33

e.g., Parallel but not concurrent:


34

e.g., Parallel but not concurrent:➢ Pipelining

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

F D X M W

F D X M W

F D X M W

F D X M W

F D X M W

F D X M W

F D X M W


35

e.g., Parallel but not concurrent:➢ Parallelism via speculation

task 1

task 2

task 3 task 4

task 5


36


task 1

task 2

task 3 task 4

task 5

predict input, start early


37


task 1

task 2

task 3

task 4

task 5

run speculatively


38


task 1

task 2

task 3

task 4

task 5

run speculatively


39

e.g., Parallel but not concurrent:➢ Parallelism via surrogates

task 1 sort task 2


40

e.g., Parallel but not concurrent:➢ Parallelism via surrogates

task 1 radix sort

task 2

quick sort

bubble sort

CS/ECE 552: Parallel Processors(Part 3)




Announcements 4/20

• Phase 2.3 grading on-going

• Phase 3 due 5/1

• AEFIS final evals out

– Specific questions posted

42

Multithreading

43

context-switching

Multithreading

44

context-switching

simultaneous multithreading

Multithreading

45

context-switching


chip multiprocessor

Multithreading

46

context-switching


chip multiprocessor

parallelism/throughput

area/complexity

Multithreading

47

convoying

Multithreading

48

convoying self-navigating

Multithreading

49

convoying self-navigating separate roads

Chip Multiprocessor

50

main memory

L1-I cache

L1-D cache

unified I/DL2 cache

L1-I cache

L1-D cache

unified I/DL2 cache

L1-I cache

L1-D cache

unified I/DL2 cache

Simultaneous Multithreading

51

• Fetch and execute instructions from different threads in parallel on the same superscalar processor

OoO engine is naturally well-suited to SMT


52

• Partition the instruction/decode buffer, dispatch buffer, reorder buffer and map table

• Pool the reservation stations, functional units and physical registers

Both threads 1 and 2

Thread 1 Thread 2

Thread 1 Thread 2


53



thread 1 map table

$t0 P15 rdy

$t1 P9 rdy

$s0 P6 !rdy

$s1 P4 rdy

$s2 P0 rdy

thread 2 map table

$t0 P2 !rdy

$t1 P10 rdy

$s0 P3 !rdy

$s1 P11 !rdy

$s2 P7 rdy

partitioned

free list

P8, P1, P14, P12, P5, P13

pooled


54[J. E. Smith]

partitionedpooled

e.g., Intel Hyperthreading:



cs/ece 552: i/osinclair/courses/cs552/spring2020/hando… · cs/ece 552: i/o prof. matthew d....

Documents