cs/ece 552: i/osinclair/courses/cs552/spring2020/hando… · cs/ece 552: i/o prof. matthew d....
TRANSCRIPT
CS/ECE 552: I/O
Prof. Matthew D. Sinclair
Lecture notes based in part on slides created by MikkoLipasti, Mark Hill, David Wood, Guri Sohi, John Shen,
Joshua San Miguel, and Jim Smith
Announcements 4/6
• HW5 Due Friday 4/10
– H&P 5.9, 5.12 may be helpful
– 5.12 posted on Canvas
– Goal: get feedback back ASAP for Phase 2.3
• Project Phase 2.3 due 4/17
• Optionally work on Phases 2.1 and 2.2
– Will make Phase 3 easier
2
Memory Hierarchy
3
L1-I cache L1-D cache
unified I/D L2 cache
main memory
Input/Output
4
L1-I cache L1-D cache
unified I/D L2 cache
main memory I/O I/O I/O
Reliability: RAID• Error correction: more important for disk than for memory
– Error correction/detection per block (handled by disk hardware)
– Mechanical disk failures (entire disk lost) most common failure mode
• Many disks means high failure rates
• Entire file system can be lost if files striped across multiple disks
• RAID (redundant array of inexpensive disks)– Add redundancy
– Similar to DRAM error correction, but…
– Major difference: which disk failed is known
• Even parity can be used to recover from single failures
• Parity disk can be used to reconstruct data faulty disk
– RAID design balances bandwidth and fault-tolerance
– Implemented in hardware (fast, expensive) or software
5
Levels of RAID - Summary• RAID-0 - no redundancy
– Multiplies read and write bandwidth
• RAID-1 - mirroring
– Pair disks together (write both, read one)
– 2x storage overhead
– Multiples only read bandwidth (not write bandwidth)
• RAID-3 - bit-level parity (dedicated parity disk)
– N+1 disks, calculate parity (write all, read all)
– Good sequential read/write bandwidth, poor random accesses
– If N=8, only 13% overhead
• RAID-4/5 - block-level parity
– Reads only data you need
– Writes require read, calculate parity, write data & parity
• RAID-6 – diagonal parity
6
RAID-3: Bit-level parity • RAID-3 - bit-level parity
– dedicated parity disk
– N+1 disks, calculate parity (write all, read all)
– Good sequential read/write bandwidth, poor random accesses
– If N=8, only 13% overhead
© 2003 Elsevier Science7
RAID 4/5 - Block-level Parity
© 2003 Elsevier Science
• RAID-4/5
– Reads only data you need
– Writes require read, calculate parity, write data&parity
• Naïve approach
1. Read all disks
2. Calculate parity
3. Write data&parity
• Better approach
– Read data&parity
– Calculate parity
– Write data&parity
• Still worse for small writesthan RAID-3
8
From the original paper:
9
RAID-4 vs RAID-5• RAID-5 rotates the parity disk, avoid single-disk bottleneck
© 2003 Elsevier Science
10
In color: RAID 4 vs. RAID 5
11Source: Wikipedia
In color: RAID 6
12Source: Wikipedia
Input/Output
• How to communicate from host processor to I/O device?
– Memory-mapped I/O vs. ISA commands for I/O
13
virtual memory space
memory-mapped I/O register
instruction set
read/write I/O register
Input/Output
• How to communicate from I/O device to host processor?
– Polling vs. interrupts
14
L1-I cache L1-D cache
unified I/D L2 cache
main memory I/O I/O I/O
poll (read) I/O register
interrupt processor(similar to exceptions)
Designing an I/O System for Bandwidth
• Approach
– Find bandwidths of individual components
– Configure components you can change…
– To match bandwidth of bottleneck component you can’t
• Example
– Parameters
• 300 MIPS CPU, 100 MB/s I/O bus
• 50K OS insns + 100K user insns per I/O operation
• SCSI-2 controllers (20 MB/s): each accommodates up to 7 disks
• 5 MB/s disks with tseek + trotation = 10 ms, 64 KB reads
– Determine
• What is the maximum sustainable I/O rate?
• How many SCSI-2 controllers and disks does it require?
– Assuming random reads
15
• First: determine I/O rates of components we can’t change
• Second: configure remaining components to match rate
16
Designing an I/O System for Bandwidth
• First: determine I/O rates of components we can’t change
– CPU: (300M insn/s) / (150K Insns/IO) = 2000 IO/s
– I/O bus: (100M B/s) / (64K B/IO) = 1562 IO/s
– Peak I/O rate determined by bus: 1562 IO/s
• Second: configure remaining components to match rate
– Disk: 1 / [10 ms/IO + (64K B/IO) / (5M B/s)] = 43.9 IO/s
– How many disks?
• (1562 IO/s) / (43.9 IO/s) = 36 disks
– How many controllers?
• For 100MB/s we need five 20MB/s controllers
• But each can only hold 7 disks, 7*5 = 35
• So, we need six SCSI-2 controllers
• Caveat: real I/O systems modeled with simulation
17
Designing an I/O System for Bandwidth
CS/ECE 552: Parallel Processors (Part 1)
Prof. Matthew D. Sinclair
Lecture notes based in part on slides created by MikkoLipasti, Mark Hill, David Wood, Guri Sohi, John Shen,
Joshua San Miguel, and Jim Smith
Announcements 4/16
• Project Phase 2.3 due tomorrow
– Posted trace of randbench
• Phase 3 next
– Integrate Phase 2.3 into Phase 2
– Add forwarding, etc. if you haven’t already
• Posted additional register renaming practice
• AEFIS Final Evals Released 4/17
– Will post requests for feedback shortly
19
Quiz Week 13 Renaming Question
20
Quiz Week 13 Renaming Question
• Key idea:
– Every instruction takes 1 cycle in X
– Start with 4 free physical registers
– By time 5th instruction (lw $t0) reaches Di, 1st
instruction R in same cycle
– Since we do R first in a cycle, reuse its free physical register right away for 5th instruction
– This process repeats for subsequent instructions
21
Quiz Week 13 Renaming Question
22
insn \ cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14
lw $t0, 0($s2) F De Di S X C R
and $s2, $t2, $t1 F De Di S X C R
or $s1, $s1, $t2 F De Di S X C R
sub $t2, $s0, $s2 F De Di S X C R
lw $t0, 4($t0) F De Di S X C R
lw $s2, 0($s1) F De Di S X C R
sub $t0, $t1, $s1 F De Di S X C R
add $t1, $t2, $s1 F De Di S X C R
or $s1, $t2, $t0 F De Di S X C
lw $t2, 0($t0) F De Di
Types of Parallelism
23
Compute-Level Parallelism:➢ Executing multiple computation streams
simultaneously
Data-Level Parallelism:➢ Processing multiple data streams simultaneously
*-Level Parallelism
24
Request-Level Parallelism:➢ Users issue simultaneous requests (e.g., databases, web servers, transactional
systems)➢ Compute-level parallelism in multi-chip processors
Task-Level Parallelism:➢ Programs invoke simultaneous tasks (e.g., dataflow, message passing,
speculative processors)➢ Compute-level parallelism in multi-chip processors and multiprocessors
Thread-Level Parallelism:➢ Programs execute simultaneous threads (e.g., pthreads, OpenMP, GPUs)➢ Compute-level parallelism in multiprocessors
Memory-Level Parallelism:➢ Threads issue simultaneous memory accesses (e.g., cache misses, prefetchers)➢ Data-level parallelism in multiprocessors
*-Level Parallelism
25
Instruction-Level Parallelism:➢ Threads issue simultaneous instructions (e.g., out-of-order, superscalar)➢ Compute-level parallelism in uniprocessors
Superword-Level Parallelism:➢ Instructions process simultaneous data words (e.g., vector instructions, Intel
SSE/AVX)➢ Data-level parallelism in uniprocessors
Bit-Level Parallelism:➢ Operations manipulate simultaneous bits (e.g., functional units, bitwise
arithmetic)➢ Data-level parallelism in uniprocessors
Amdahl’s Law, Redux
• Speedup = 1 / (1 – f + f/s)
– f is % done in parallel
– s is amount done in parallel (e.g., on multiple cores)
26
Amdahl’s Law Example
• Applying Amdahl's Law, you estimate that when executing on two cores, the speedup of your entire program is 1.5x. What is the fraction of your program that can be parallelized?
Speedup = 1.5, s = 2
Speedup = 1.5 = 1 / ((1 – x) + x/2)
x = 2/327
Example Vector ISA Extensions (SIMD)
• Extend ISA with floating point (FP) vector storage …– Vector register: fixed-size array of 32- or 64- bit FP elements
– Vector length: For example: 4, 8, 16, 64, …
• … and example operations for vector length of 4– Load vector: lv $v1, X($r1) //[X+r1]->v1
lv [X+r1+0]->v10
lv [X+r1+1]->v11
lv [X+r1+2]->v12
lv [X+r1+3]->v13
– Add two vectors: addv.f v3, v1, v2 // v3 = v1 + v2
addv.f v3i, v1i, v2i (where i is 0,1,2,3)
– Add vector to scalar: addv.f v3, v1, f2
addv.f v3i, v1i, f2 (where i is 0,1,2,3)
• Today’s vectors: short (128 or 256 bits), but fully parallel
28
Example Use of Vectors – 4-wide
• Operations– Load vector: lv v1, X(r1) // [X+r1]->v1
– Multiply vector to scalar: mulv.f // v1 * f2->v3
– Add two vectors: addv.f v3, v1, v2 // v1,v2->v3
– Store vector: sv v1, X(r1) // v1->[X+r1]
• Performance?– Best case: 4x speedup
– Tradeoff: execution width (implementation) vs vector width (ISA)
lwcl f1, X(r1)
mul.s f2, f0, f1
lwcl f3, Y(r1)
add.s f4, f2, f3
swcl f4, Z(r1)
addi r1, r1, 4
bne r1, 4096, L1
lv v1, X(r1)
mulv.f v2, v1, f0
lv v3, Y(r1)
addv.f v4, v2, v3
sv v4, Z(r1)
addi r1,16->r1
bne r1,4096,L1
7x1024 instructions 7x256 instructions
(4x fewer instructions)
29
Parallelism and Concurrency
30
Concurrency:Coexistence of multiple tasks/computations in progress➢ Logically doing multiple things at once➢ Typically algorithm/system-defined
Parallelism:Execution of multiple tasks/computations simultaneously➢ Physically doing multiple things at once➢ Typically architecture-defined
Parallelism and Concurrency
31
e.g., Concurrent but not parallel:
Parallelism and Concurrency
32
e.g., Concurrent but not parallel:➢ Context-switching threads on a single core
thread 1 thread 2
context switches
Parallelism and Concurrency
33
e.g., Parallel but not concurrent:
Parallelism and Concurrency
34
e.g., Parallel but not concurrent:➢ Pipelining
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
F D X M W
F D X M W
F D X M W
F D X M W
F D X M W
F D X M W
F D X M W
Parallelism and Concurrency
35
e.g., Parallel but not concurrent:➢ Parallelism via speculation
task 1
task 2
task 3 task 4
task 5
Parallelism and Concurrency
36
e.g., Parallel but not concurrent:➢ Parallelism via speculation
task 1
task 2
task 3 task 4
task 5
predict input, start early
Parallelism and Concurrency
37
e.g., Parallel but not concurrent:➢ Parallelism via speculation
task 1
task 2
task 3
task 4
task 5
run speculatively
Parallelism and Concurrency
38
e.g., Parallel but not concurrent:➢ Parallelism via speculation
task 1
task 2
task 3
task 4
task 5
run speculatively
Parallelism and Concurrency
39
e.g., Parallel but not concurrent:➢ Parallelism via surrogates
task 1 sort task 2
Parallelism and Concurrency
40
e.g., Parallel but not concurrent:➢ Parallelism via surrogates
task 1 radix sort
task 2
quick sort
bubble sort
CS/ECE 552: Parallel Processors(Part 3)
Prof. Matthew D. Sinclair
Lecture notes based in part on slides created by MikkoLipasti, Mark Hill, David Wood, Guri Sohi, John Shen,
Joshua San Miguel, and Jim Smith
Announcements 4/20
• Phase 2.3 grading on-going
• Phase 3 due 5/1
• AEFIS final evals out
– Specific questions posted
42
Multithreading
43
context-switching
Multithreading
44
context-switching
simultaneous multithreading
Multithreading
45
context-switching
simultaneous multithreading
chip multiprocessor
Multithreading
46
context-switching
simultaneous multithreading
chip multiprocessor
parallelism/throughput
area/complexity
Multithreading
47
convoying
Multithreading
48
convoying self-navigating
Multithreading
49
convoying self-navigating separate roads
Chip Multiprocessor
50
main memory
L1-I cache
L1-D cache
unified I/DL2 cache
L1-I cache
L1-D cache
unified I/DL2 cache
L1-I cache
L1-D cache
unified I/DL2 cache
Simultaneous Multithreading
51
• Fetch and execute instructions from different threads in parallel on the same superscalar processor
OoO engine is naturally well-suited to SMT
Simultaneous Multithreading
52
• Partition the instruction/decode buffer, dispatch buffer, reorder buffer and map table
• Pool the reservation stations, functional units and physical registers
Both threads 1 and 2
Thread 1 Thread 2
Thread 1 Thread 2
Simultaneous Multithreading
53
• Partition the instruction/decode buffer, dispatch buffer, reorder buffer and map table
• Pool the reservation stations, functional units and physical registers
thread 1 map table
$t0 P15 rdy
$t1 P9 rdy
$s0 P6 !rdy
$s1 P4 rdy
$s2 P0 rdy
thread 2 map table
$t0 P2 !rdy
$t1 P10 rdy
$s0 P3 !rdy
$s1 P11 !rdy
$s2 P7 rdy
partitioned
free list
P8, P1, P14, P12, P5, P13
pooled
Simultaneous Multithreading
54[J. E. Smith]
partitionedpooled
e.g., Intel Hyperthreading:
• Partition the instruction/decode buffer, dispatch buffer, reorder buffer and map table
• Pool the reservation stations, functional units and physical registers