storage, network and other peripheralsb97032/lecture_10.pdf · optical disk storage machine 500.00...
TRANSCRIPT
2
I/O Systems
Processor
Cache
Memory - I/O Bus
Main Memory
I/O Controller
Disk Disk
I/O Controller
I/O Controller
Graphics Network
interrupts
Processor & I/O
Communication
3
I/O
• I/O System Evaluation Metric – Dependability
– Expandability
– Cost
– Performance
• I/O Device Characteristics – Behavior
» Input, output or storage
– Partner
– Data rate
Device Behavior Partner Data Rate (KB/sec)
Keyboard Input Human 0.01
Mouse Input Human 0.02
Floppy disk Storage Machine 50.00
Laser Printer Output Human 100.00
Optical Disk Storage Machine 500.00
Magnetic Disk Storage Machine 5,000.00
Network-LAN Input or Output Machine 20 – 1,000.00
Graphics Display Output Human 30,000.00
4
• I/O System performance depends on many aspects of the system (“limited by weakest link in the chain”):
– The CPU
– The memory system:
» Internal and external caches
» Main Memory
– The underlying interconnection (buses)
– The I/O controller
– The I/O device
– The speed of the I/O software (Operating System)
– The efficiency of the software’s use of the I/O devices
• Two common performance metrics:
– Throughput:
» How much data can we move through the system in a certain time? (data rate)
» How many I/O operations can we do per unit of time? (I/O rate)
– Response time: Latency
I/O System Performance
5
• Purpose:
– Long-term, nonvolatile storage
– Large, inexpensive, slow level in the storage hierarchy
Platters
Track
Sector
Organization of a Hard Magnetic Disk
6
Devices: Magnetic Disks
• Read/Write data is three-stage process:
– Seek Time (~20 ms avg, 1M cyc at 50MHz)
» move arm over track
– rotational latency -
» wait for the sector to rotate under head
» Average = (0.5)/3600RPM = 8.3ms
– Transfer rate
» About a sector per ms (1-10 MB/s)
• Controller time
– The overhead of controller imposes in performing an I/O access
Sector
Track
Cylinder
Head Platter
Disk Latency = Seek Time + Rotation Time + Xfer Time + Controller time
7
Disk Time Example
• Disk Parameters:
– 512-byte sector
– Advertised average seek time is 6 ms
– Transfer rate is 50 MB/sec
– Disk spins at 10,000 RPM.
– Controller overhead is 0.2ms
– Assume that the disk is idle so there is no waiting time
• What is the average time to read/write a sector?
– Ave seek time + ave rot time + xfer time + control overhead
– 6 + 0.5 / 10,000RPM + 0.5KB / (50 MB/sec) + 0.2 = 9.2
8
Reliability, Availability and Dependability
• Dependability
– Computer System dependability is the quality of delivered service such that reliance can justifiably be placed on this service. The service delivered by a system is its observed actual behavior as perceived by other systems interacting with the system’s users. Each module also has an ideal specified behavior, where a service is an agreed description of the expected behavior. A system failure occurs when the actual behavior deviates from the specified behavior.
• Two System State
– Service accomplishment
– Service interruption
Service accom. Service Inter.
Failures
Restorations
9
Reliability, Availability and Dependability (cont.)
• Reliability
– A measure of the continuous service accomplished from a reference initial instant.
– MTTF - Mean Time to Failure
• Availability
– A measure of the service accomplished with respect to the alternation between the two states of accomplishment & interruption
– MTTR - Mean Time to Repair
– MTBF - Mean Time Between Failure
» MTTF + MTTR
– Availability = MTTF
– (MTTF + MTTR)
Failure occurs Failure repair Failure occurs
System not available System Available
MTTR MTTF
10
How to Improve MTTF
• Fault avoidance
– Prevent fault occurrence by construction
• Fault tolerance
– Using redundancy to allow the service to comply with the service spec despite of failures
• Fault forecasting
– Predict the presence and creation of faults
11
Improving Availability with Redundancy
• RAID
– Redundant arrays of inexpensive disks
• Redundancy offers 2 advantage:
– Data not lost: reconstruct data onto new disk
– Continuous operation in the presence of failures
3.5”
Disk Array vs. a large disk
performance
reliability
availability (with redundancy)
12
Raid 0 : Striping & Non-Redundancy
Block 1
Block 6
Block 16
Block 11
Block 21
Block 2
Block 7
Block 17
Block 12
Block 22
Block 3
Block 8
Block 18
Block 13
Block 23
Block 4
Block 9
Block 19
Block 14
Block 24
Block 5
Block 10
Block 20
Block 15
Block 25
Disk 0 Disk 1 Disk 2 Disk 3 Disk 4
13
RAID 1: Disk Mirroring/Shadowing
• Each disk is fully duplicated onto its "shadow“
– Very high availability can be achieved
• Most expensive solution: 100% capacity overhead
Block 1
Block 2
Block 4
Block 3
Block 5
Block 1
Block 2
Block 4
Block 3
Block 5
Disk 0 Disk 1
14
RAID 2: Error Detecting and Correcting Code (Unused)
Bit 1
Bit 33
Bit 97
Bit 65
Bit 129
Bit 2
Bit 34
Bit 98
Bit 66
Bit 130
ECC 1-32
ECC 33-64
ECC 97-128
ECC 65-96
ECC 129-160
ECC 1-32
ECC 33-64
ECC 97-128
ECC 65-96
ECC 129-160
Disk 0 Disk 1 ECC Disk 6 ECC Disk 7
...
15
Raid 3: Bit-Interleaved Parity
Bit 1
Bit 33
Bit 97
Bit 65
Bit 129
Bit 2
Bit 34
Bit 98
Bit 66
Bit 130
Parity 1-32
Parity 33-64
Parity 97-128
Parity 65-96
Parity 129-160
Disk 0 Disk 1 Parity Disk
...Bit 3
Bit 35
Bit 99
Bit 67
Bit 131
Disk 1
Parity = sum mod 2
16
RAID 3: Bit-Interleaved Parity (cont.)
• Parity computed across recovery group to protect against hard disk failures
– 33% capacity cost for parity in this configuration
P
1 0 0 1 0 0 1 1
1 1 0 0 1 1 0 1
1 0 0 1 0 0 1 1
1 1 0 0 1 1 0 1
= sum mod 2
Protection group
17
Raid 4: Block-Interleaved Parity
• Small read does not need to access all disks
– Allow independent accesses to occur in parallel
• Do writes need to access all disks?
Block 1
Block 5
Block 13
Block 9
Block 17
Block 2
Block 6
Block 14
Block 10
Block 18
Block 3
Block 7
Block 15
Block 11
Block 19
Block 4
Block 8
Block 16
Block 12
Block 20
Disk 0 Disk 1 Disk 2 Disk 3
Parity 1-4
Parity 5-8
Parity 13-16
Parity 9-12
Parity 17-20
Parity Disk
18
Raid 4: Block-Interleaved Parity (cont.)
D0’ D0 D1 D2 D3 P
D0’ D1 D2 D3 P’
+
D0’ D0 D1 D2 D3 P
D0’ D1 D2 D3 P’
+ +
XOR
19
RAID 4&5
• Block-interleaved Parity – only one write per group
can occur at one time
• Block-interleaved Distributed Parity:
– allow multiple writes to occur simultaneously as long as the stripe units are not located in the same disks
D0 D1 D2 D3 P
D4 D5 D6 P D7
D8 D9 P D10 D11
D12 P D13 D14 D15
P D16 D17 D18 D19
D20 D21 D22 D23 P
RAID 4 RAID 5
D0 D1 D2 D3 P
D4 D5 D6 P D7
D8 D9 P D10 D11
D12 P D13 D14 D15
P D16 D17 D18 D19
D20 D21 D22 D23 P
20
Connection between Processors, Memory, and I/O Device: Buses
• Advantage
– Versatility:
» New devices can be added easily
» Peripherals can be moved between computer systems that use the same bus standard
– Low Cost:
» A single set of wires is shared in multiple ways
Memory Processor
I/O
Device
I/O
Device
I/O
Device
21
Buses: Disadvantage
• It creates a communication bottleneck
– The bandwidth of that bus can limit the maximum I/O throughput
• The maximum bus speed is largely limited by:
– The length of the bus
– The number of devices on the bus
– The need to support a range of devices with:
» Widely varying latencies
» Widely varying data transfer rates
Memory Processor
I/O
Device
I/O
Device
I/O
Device
22
• Control lines:
– Signal requests and acknowledgments
– Indicate what type of information is on the data lines
• Data lines carry information between the source and the destination:
– Data and Addresses
– Complex commands
Data Lines
Control Lines
The General Organization of a Bus
23
• A bus transaction includes two parts:
– Issuing the command (and address) – request
– Transferring the data – action
• Master is the one who starts the bus transaction by:
– issuing the command (and address)
• Slave is the one who responds to the address by:
– Sending data to the master if the master ask for data
– Receiving data from the master if the master wants to send data
Bus
Master
Bus
Slave
Master issues command
Data can go either way
Master versus Slave
24
• Processor-Memory Bus (design specific)
– Short and high speed
– Only need to match the memory system
» Maximize memory-to-processor bandwidth
– Connects directly to the processor
– Optimized for cache block transfers
• I/O Bus (industry standard)
– Usually is lengthy and slower
– Need to match a wide range of I/O devices
– Connects to the processor-memory bus or backplane bus
• Backplane Bus (standard or proprietary)
– Backplane: an interconnection structure within the chassis
– Allow processors, memory, and I/O devices to coexist
– Cost advantage: one bus for all components
Types of Buses
25
• A small number of backplane buses tap into the processor-memory bus
– Processor-memory bus is used for processor memory traffic
– I/O buses are connected to the backplane bus
Processor Memory
Processor Memory Bus
Bus
Adaptor Bus
Adaptor
Bus
Adaptor
I/O Bus Backplane Bus
I/O Bus
A Three-Bus System
26
• Synchronous Bus:
– Includes a clock in the control lines
– A fixed protocol for communication that is relative to the clock
– Advantage: involves very little logic and can run very fast
– Disadvantages:
» Every device on the bus must run at the same clock rate
» To avoid clock skew, they cannot be long if they are fast
• Asynchronous Bus:
– It is not clocked
– It can accommodate a wide range of devices
– It can be lengthened without worrying about clock skew
– It requires a handshaking protocol
Synchronous and Asynchronous Bus
27
1. When memory sees the ReadReq line, it reads the addresses from the data bus and raises Ack to indicate it has been seen.
2. I/O device sees the Ack line high and releases the ReadReq and data lines
3. Memory sees that ReadReq is low and drops the Ack line to acknowledge the Readreq signal.
4. This step starts when the memory has the data ready. It places the data from the read request on the data lines and raises DataRdy.
5. The I/O device sees DataRdy, reads the data from the bus, and signals that it has the data by raising Ack
6. The memory sees the Ack signal, drops DataRdy and releases the data lines.
7. Finally, the I/O device, seeing DataRdy go low. Drops the Ack line, which indicates that the transmission is completed.
Asynchronous handshaking: Read Transaction
28
• One of the most important issues in bus design:
– How is the bus reserved by a devices that wishes to use it?
• Chaos is avoided by a master-slave arrangement:
– Only the bus master can control access to the bus:
It initiates and controls all bus requests
– A slave responds to read and write requests
• The simplest system:
– Processor is the only bus master
– All bus requests must be controlled by the processor
– Major drawback: the processor is involved in every transaction
Bus
Master
Bus
Slave
Control: Master initiates requests
Data can go either way
Arbitration: Obtaining Access to the Bus
29
• Bus arbitration scheme:
– A bus master wanting to use the bus asserts the bus request
– A bus master cannot use the bus until its request is granted
– A bus master must signal to the arbiter after finish using the bus
• Bus arbitration schemes usually try to balance two factors:
– Bus priority: the highest priority device should be serviced first
– Fairness: Even the lowest priority device should never be completely locked out from the bus
• Bus arbitration schemes
– Daisy chain arbitration: single device with all request lines.
– Centralized, parallel arbitration: see next-next slide
Multiple Potential Bus Masters: the Need for Arbitration
30
• Advantage: simple
• Disadvantages:
– Cannot assure fairness: A low-priority device may be locked out indefinitely
– The use of the daisy chain grant signal also limits the bus speed
Bus
Arbiter
Device 1
Highest
Priority
Device N
Lowest
Priority
Device 2
Grant Grant Grant
Release
Request
The Daisy Chain Bus Arbitrations Scheme
31
• Used in essentially all processor-memory busses and in high-speed I/O busses
Bus
Arbiter
Device 1 Device N Device 2
Grant Req
Centralized Parallel Arbitration
32
The Buses and Networks of Pentium 4
Parallel ATA
(100 MB/sec)
Parallel ATA
(100 MB/sec)
(20 MB/sec)
PCI bus
(132 MB/sec)
CSA
(0.266 GB/sec)
AGP 8X
(2.1 GB/sec)
Serial ATA
(150 MB/sec)
Disk
Pentium 4
processor
1 Gbit Ethernet
Memory
controller
hub
(north bridge)
82875P
Main
memory
DIMMs
DDR 400
(3.2 GB/sec)
DDR 400
(3.2 GB/sec)
Serial ATA
(150 MB/sec)
Disk
AC/97
(1 MB/sec)Stereo
(surround-
sound) USB 2.0
(60 MB/sec)
. . .
I/O
controller
hub
(south bridge)
82801EB
Graphics
output
(266 MB/sec)
System bus (800 MHz, 604 GB/sec)
CD/DVD
Tape
10/100 Mbit Ethernet
33
• Split transaction protocol- separate versus multiplexed address and data lines:
– Address and data can be transmitted in one bus cycle if separate address and data lines are available
– Cost: (a) more bus lines, (b) increased complexity
• Data bus width:
– By increasing the width of the data bus, transfers of multiple words require fewer bus cycles
– Example: SPARCstation 20’s memory bus is 128 bit wide
– Cost: more bus lines
• Block transfers:
– Allow the bus to transfer multiple words in back-to-back bus cycles
– Only one address needs to be sent at the beginning
– The bus is not released until the last word is transferred
– Cost: (a) increased complexity (b) decreased response time for request
Increasing the Bus Bandwidth
34
Example
• Suppose we have a system with the following characteristics:
– A memory and bus system supporting block access of 4 to 16 32-bit words
– A 64-bit synchronous bus clocked at 200 MHZ, with each 64-bit transfer taking 1 cycle, and 1 clock cycle required to send an address to memory
– Two clock cycles needed between each bus operation. (Assume the bus is idle before an access)
– A memory access time for the first four words of 200 ns; each additional set of four words can be read in 20 ns. Assume that a bus transfer of the most recently read data and a read of the next four words can be overlapped.
• Find
– (1) the sustained bandwidth and the latency
– (2) the effective number of bus transactions per second
– for a read of 256 words for transactions that uses 4-word and 16-word blocks, respectively.
35
Communicating with I/O
• Memory-mapped I/O
– Portion of address space is assigned to I/O devices;
– Read and writes to those addresses causes data to be transferred
• Dedicated I/O instruction
– Specify both the device # and command word
I/O
0
n
CPU
Interface Interface
Peripheral Peripheral
Memory
I/O
37
Polling
• Polling overhead:
1. Transferring to the polling routine
2. Accessing the device
3. Restarting the user program
CPU
IOC
device
Memory
Is the data
ready?
read data
store data
yes
no
done? no
yes
busy wait loop not an efficient
way to use the CPU unless the device
is very fast!
but checks for I/O completion can be dispersed among computationally intensive code
38
Interrupt Driven Data Transfer
• I/O interrupt is asynchronous with respect to the instruction execution
• Prioritized I/O interrupt (IPL- Interrupt Priority Level)
– I/O has lower priority than internal exception
– high-speed devices are associated with higher priority
CPU
IOC
device
Memory
add sub and or nop
read store ... rti
memory
user program (1) I/O
interrupt
(2) save PC
(3) interrupt service addr
interrupt service routine (4)
39
Direct Memory Access
CPU
IOC
device
Memory DMAC
CPU setup DMA 1. Identity of the device 2. Operation (read, write) 3. Memory address of the source or destination 4. Number of bytes to transfer
DMA (1) issue requests to device (2) interrupt the processor when the transfer completes
40
DMA and Memory System
• I/O and consistency of data between cache and memory
Cache
Main
Memory
I/O Bridge
CPU
Cache
Main
Memory
I/O Bridge
CPU
DMA
• I/O always see the latest data
• interfere with CPU
• Not interfering with the CPU
• Might see the stale data
• Output: write-through
• Input:
• Non-cacheable
• SW: flush the cache
• HW: check I/O address on input
41
Input/Output Processors
CPU IOP
Mem
D1
D2
Dn
. . .
main memory bus
I/O bus
CPU
IOP interrupts when done
(1)
memory
(2)
(3)
(4)
Device to/from memory transfers are controlled by the IOP directly.
OP Device Address
target device where cmnds are
looks in memory for commands
OP Addr Cnt Other
what to do
where to put data
how much
special requests
issues instruction to IOP
42
Responsibility of OS in I/O
• The OS guarantees that a user’s program accesses only the portions of an I/O device to which the user has rights
• The OS supplies routines to handle low-level device operations
• The OS handles the I/O interrupts as other exceptions
• The OS provides equitable access to the shared IO resources and schedule accesses
43
• Transaction processing:
– Examples: Airline reservations systems and bank ATMs
– Small changes to large shared software
– Response time & Throughput
– I/O Rate: No. disk accesses / second given upper limit for latency
– Benchmark: TPC-, TPC-R, TPC-W
• File system & Web I/O benchmark
– Synthetic benchmark
» Simulate typical file access pattern
90% of accesses are to files less than 10 KB, 67% reads, 27% writes 6% read-modify-write, 90% of all file accesses are to data with sequential addresses on the disk
– SPECSFS – measuring NFS (Network File System)
– SPECWeb – Webserver benchmark
– I/O Rate & Latency
44
• Suppose we have a benchmark that executes in 100 seconds of elapsed time, where 90 seconds is CPU time and the rest is I/O time. If CPU time improves by 50% per year for the next five years but I/O time doesn’t improve, how much faster will our program run at the end of five years?
1. Elapsed time = CPU time + I/O time
100 = 90 + I/O time
I/O time = 10 seconds
2. The improvement in CPU performance over five years is
90 / 12 = 7.5
The improvement in elapsed time is only
100 / 22 = 4.5
After n years
CPU time I/O time Elapsed time
% I/O time
0 90 seconds 10 second
s
100 second
s
10%
1 90 / 1.5 = 60 seconds
10 second
s
70 second
s
14%
2 60 / 1.5 = 40 seconds
10 second
s
50 second
s
20%
3 40 / 1.5 = 27 seconds
10 second
s
37 second
s
27%
4 27 / 1.5 = 18 seconds
10 second
s
28 second
s
36%
5 18 / 1.5 = 12 seconds
10 second
s
22 second
s
45%
Impact of I/O on System Performance
45
Designing an I/O System
• I/O system often needs to meet two constraints
– Latency constraint
– Bandwidth constraint
• Methodology
– Find the weakest link in the I/O system
» Processor, memory, bus, I/O controller or device
– Configure this component to sustain the required bandwidth
– Determine the requirements for the rest of the system and configure them to support the bandwidth
46
I/O System Design
• Consider the following computer system:
– A CPU that sustains 3 billion instructions per second and averages 100,000 instructions in the operating system per I/O operation
– A memory backplane bus capable of sustaining a transfer rate of 1000 MB/sec
– SCSI Ultra 320 controllers with a transfer rate of 320 MB/sec and accommodating up to 7 disks
– Disk drives with a read/write bandwidth of 75 MB/sec and an average seek plus rotational latency of 6 ms
• If the workload consists of 64 KB reads (where the block is sequential on a track) and the user program needs 200,000 instructions per I/O operation, find
1. The maximum sustainable I/O rate
2. The number of disks and SCSI controllers required. Assume that the reads can always be done on an idle disk if one exists (i.e., ignore disk conflicts).
CPU
SCSI
controller ???
47
Answer
• The two fixed components of the system are the memory bus and the CPU. Let’s first find the I/O rate that these two components can sustain and determine which of these is the bottleneck.
– Each I/O takes 200,000 user instructions and 100,000 OS instructions, so
– Each I/O transfers 64 KB, and bus bandwidth is 1000 MB/sec
Maximum I/O rate of CPU
Instruction execution rate
Instructions per I/O
=
=
3 x10^9
(200+100) x 10^3 = 10000 IOs/sec
Maximum I/O rate of bus = Bus bandwidth
Bytes per I/O
=
1000 x 10 ^6
64 x 10^3 = 15,625 IOs/sec
CPU is the bottleneck !!
48
Answer (cont.)
• The CPU is the bottleneck, so we can now configure the rest of the system to perform at the level dictated by the CPU, 10,000 I/Os per second.
• Let’s determine how many disks we need to be able to accommodate 10,000 I/Os per second.
• Can SCSI bus sustain the I/O transfer rate?
• So how many SCSI drivers do we need?
Time per I/O at disk = Seek + rotational time + Transfer time
= 6 ms + 64KB / 75 Mb/sec = 6.9 ms
IOs per sec = 1000 ms =
6.9 ms
146 IOs/sec
Therefore, to saturate the CPU requirements 10,000 IOs per sec need 10,000/146 = 69 disks
Transfer rate = Transfer size
Transfer time = 64KB / 6.9 ms ≈ 9.56 MB/sec
9.56 x 7 ≈ 70 MB/sec << 320 MB/sec
69 / 7 ~ 10 SCSI drivers
49
• I/O performance is limited by weakest link in chain between OS and device
• Disk I/O Benchmarks: I/O rate vs. Data rate vs. latency
• Bus
– Synchronous vs. Asynchornous
– Bus arbitration
• I/O device notifying the operating system:
– Polling: it can waste a lot of processor time
– I/O interrupt: similar to exception except it is asynchronou
• Delegating I/O responsibility from the CPU: DMA
Summary