storage, network and other peripheralsb97032/lecture_10.pdf · optical disk storage machine 500.00...

1

Lecture 13

Storage, Network and Other Peripherals

2

I/O Systems

Processor

Cache

Memory - I/O Bus

Main Memory

I/O Controller

Disk Disk

I/O Controller

I/O Controller

Graphics Network

interrupts

Processor & I/O

Communication

3

I/O

• I/O System Evaluation Metric – Dependability

– Expandability

– Cost

– Performance

• I/O Device Characteristics – Behavior

» Input, output or storage

– Partner

– Data rate

Device Behavior Partner Data Rate (KB/sec)

Keyboard Input Human 0.01

Mouse Input Human 0.02

Floppy disk Storage Machine 50.00

Laser Printer Output Human 100.00

Optical Disk Storage Machine 500.00

Magnetic Disk Storage Machine 5,000.00

Network-LAN Input or Output Machine 20 – 1,000.00

Graphics Display Output Human 30,000.00

4

• I/O System performance depends on many aspects of the system (“limited by weakest link in the chain”):

– The CPU

– The memory system:

» Internal and external caches

» Main Memory

– The underlying interconnection (buses)

– The I/O controller

– The I/O device

– The speed of the I/O software (Operating System)

– The efficiency of the software’s use of the I/O devices

• Two common performance metrics:

– Throughput:

» How much data can we move through the system in a certain time? (data rate)

» How many I/O operations can we do per unit of time? (I/O rate)

– Response time: Latency

I/O System Performance

5

• Purpose:

– Long-term, nonvolatile storage

– Large, inexpensive, slow level in the storage hierarchy

Platters

Track

Sector

Organization of a Hard Magnetic Disk

6

Devices: Magnetic Disks

• Read/Write data is three-stage process:

– Seek Time (~20 ms avg, 1M cyc at 50MHz)

» move arm over track

– rotational latency -

» wait for the sector to rotate under head

» Average = (0.5)/3600RPM = 8.3ms

– Transfer rate

» About a sector per ms (1-10 MB/s)

• Controller time

– The overhead of controller imposes in performing an I/O access

Sector

Track

Cylinder

Head Platter

Disk Latency = Seek Time + Rotation Time + Xfer Time + Controller time

7

Disk Time Example

• Disk Parameters:

– 512-byte sector

– Advertised average seek time is 6 ms

– Transfer rate is 50 MB/sec

– Disk spins at 10,000 RPM.

– Controller overhead is 0.2ms

– Assume that the disk is idle so there is no waiting time

• What is the average time to read/write a sector?

– Ave seek time + ave rot time + xfer time + control overhead

– 6 + 0.5 / 10,000RPM + 0.5KB / (50 MB/sec) + 0.2 = 9.2

8

Reliability, Availability and Dependability

• Dependability

– Computer System dependability is the quality of delivered service such that reliance can justifiably be placed on this service. The service delivered by a system is its observed actual behavior as perceived by other systems interacting with the system’s users. Each module also has an ideal specified behavior, where a service is an agreed description of the expected behavior. A system failure occurs when the actual behavior deviates from the specified behavior.

• Two System State

– Service accomplishment

– Service interruption

Service accom. Service Inter.

Failures

Restorations

9

Reliability, Availability and Dependability (cont.)

• Reliability

– A measure of the continuous service accomplished from a reference initial instant.

– MTTF - Mean Time to Failure

• Availability

– A measure of the service accomplished with respect to the alternation between the two states of accomplishment & interruption

– MTTR - Mean Time to Repair

– MTBF - Mean Time Between Failure

» MTTF + MTTR

– Availability = MTTF

– (MTTF + MTTR)

Failure occurs Failure repair Failure occurs

System not available System Available

MTTR MTTF

10

How to Improve MTTF

• Fault avoidance

– Prevent fault occurrence by construction

• Fault tolerance

– Using redundancy to allow the service to comply with the service spec despite of failures

• Fault forecasting

– Predict the presence and creation of faults

11

Improving Availability with Redundancy

• RAID

– Redundant arrays of inexpensive disks

• Redundancy offers 2 advantage:

– Data not lost: reconstruct data onto new disk

– Continuous operation in the presence of failures

3.5”

Disk Array vs. a large disk

performance

reliability

availability (with redundancy)

12

Raid 0 : Striping & Non-Redundancy

Block 1

Block 6

Block 16

Block 11

Block 21

Block 2

Block 7

Block 17

Block 12

Block 22

Block 3

Block 8

Block 18

Block 13

Block 23

Block 4

Block 9

Block 19

Block 14

Block 24

Block 5

Block 10

Block 20

Block 15

Block 25

Disk 0 Disk 1 Disk 2 Disk 3 Disk 4

13

RAID 1: Disk Mirroring/Shadowing

• Each disk is fully duplicated onto its "shadow“

– Very high availability can be achieved

• Most expensive solution: 100% capacity overhead

Block 1

Block 2

Block 4

Block 3

Block 5

Block 1

Block 2

Block 4

Block 3

Block 5

Disk 0 Disk 1

14

RAID 2: Error Detecting and Correcting Code (Unused)

Bit 1

Bit 33

Bit 97

Bit 65

Bit 129

Bit 2

Bit 34

Bit 98

Bit 66

Bit 130

ECC 1-32

ECC 33-64

ECC 97-128

ECC 65-96

ECC 129-160

ECC 1-32

ECC 33-64

ECC 97-128

ECC 65-96

ECC 129-160

Disk 0 Disk 1 ECC Disk 6 ECC Disk 7

...

15

Raid 3: Bit-Interleaved Parity

Bit 1

Bit 33

Bit 97

Bit 65

Bit 129

Bit 2

Bit 34

Bit 98

Bit 66

Bit 130

Parity 1-32

Parity 33-64

Parity 97-128

Parity 65-96

Parity 129-160

Disk 0 Disk 1 Parity Disk

...Bit 3

Bit 35

Bit 99

Bit 67

Bit 131

Disk 1

Parity = sum mod 2

16

RAID 3: Bit-Interleaved Parity (cont.)

• Parity computed across recovery group to protect against hard disk failures

– 33% capacity cost for parity in this configuration

P

1 0 0 1 0 0 1 1

1 1 0 0 1 1 0 1

1 0 0 1 0 0 1 1

1 1 0 0 1 1 0 1

= sum mod 2

Protection group

17

Raid 4: Block-Interleaved Parity

• Small read does not need to access all disks

– Allow independent accesses to occur in parallel

• Do writes need to access all disks?

Block 1

Block 5

Block 13

Block 9

Block 17

Block 2

Block 6

Block 14

Block 10

Block 18

Block 3

Block 7

Block 15

Block 11

Block 19

Block 4

Block 8

Block 16

Block 12

Block 20

Disk 0 Disk 1 Disk 2 Disk 3

Parity 1-4

Parity 5-8

Parity 13-16

Parity 9-12

Parity 17-20

Parity Disk

18

Raid 4: Block-Interleaved Parity (cont.)

D0’ D0 D1 D2 D3 P

D0’ D1 D2 D3 P’

+

D0’ D0 D1 D2 D3 P

D0’ D1 D2 D3 P’

+ +

XOR

19

RAID 4&5

• Block-interleaved Parity – only one write per group

can occur at one time

• Block-interleaved Distributed Parity:

– allow multiple writes to occur simultaneously as long as the stripe units are not located in the same disks

D0 D1 D2 D3 P

D4 D5 D6 P D7

D8 D9 P D10 D11

D12 P D13 D14 D15

P D16 D17 D18 D19

D20 D21 D22 D23 P

RAID 4 RAID 5

D0 D1 D2 D3 P

D4 D5 D6 P D7

D8 D9 P D10 D11

D12 P D13 D14 D15

P D16 D17 D18 D19

D20 D21 D22 D23 P

20

Connection between Processors, Memory, and I/O Device: Buses

• Advantage

– Versatility:

» New devices can be added easily

» Peripherals can be moved between computer systems that use the same bus standard

– Low Cost:

» A single set of wires is shared in multiple ways

Memory Processor

I/O

Device

I/O

Device

I/O

Device

21

Buses: Disadvantage

• It creates a communication bottleneck

– The bandwidth of that bus can limit the maximum I/O throughput

• The maximum bus speed is largely limited by:

– The length of the bus

– The number of devices on the bus

– The need to support a range of devices with:

» Widely varying latencies

» Widely varying data transfer rates

Memory Processor

I/O

Device

I/O

Device

I/O

Device

22

• Control lines:

– Signal requests and acknowledgments

– Indicate what type of information is on the data lines

• Data lines carry information between the source and the destination:

– Data and Addresses

– Complex commands

Data Lines

Control Lines

The General Organization of a Bus

23

• A bus transaction includes two parts:

– Issuing the command (and address) – request

– Transferring the data – action

• Master is the one who starts the bus transaction by:

– issuing the command (and address)

• Slave is the one who responds to the address by:

– Sending data to the master if the master ask for data

– Receiving data from the master if the master wants to send data

Bus

Master

Bus

Slave

Master issues command

Data can go either way

Master versus Slave

24

• Processor-Memory Bus (design specific)

– Short and high speed

– Only need to match the memory system

» Maximize memory-to-processor bandwidth

– Connects directly to the processor

– Optimized for cache block transfers

• I/O Bus (industry standard)

– Usually is lengthy and slower

– Need to match a wide range of I/O devices

– Connects to the processor-memory bus or backplane bus

• Backplane Bus (standard or proprietary)

– Backplane: an interconnection structure within the chassis

– Allow processors, memory, and I/O devices to coexist

– Cost advantage: one bus for all components

Types of Buses

25

• A small number of backplane buses tap into the processor-memory bus

– Processor-memory bus is used for processor memory traffic

– I/O buses are connected to the backplane bus

Processor Memory

Processor Memory Bus

Bus

Adaptor Bus

Adaptor

Bus

Adaptor

I/O Bus Backplane Bus

I/O Bus

A Three-Bus System

26

• Synchronous Bus:

– Includes a clock in the control lines

– A fixed protocol for communication that is relative to the clock

– Advantage: involves very little logic and can run very fast

– Disadvantages:

» Every device on the bus must run at the same clock rate

» To avoid clock skew, they cannot be long if they are fast

• Asynchronous Bus:

– It is not clocked

– It can accommodate a wide range of devices

– It can be lengthened without worrying about clock skew

– It requires a handshaking protocol

Synchronous and Asynchronous Bus

27

1. When memory sees the ReadReq line, it reads the addresses from the data bus and raises Ack to indicate it has been seen.

2. I/O device sees the Ack line high and releases the ReadReq and data lines

3. Memory sees that ReadReq is low and drops the Ack line to acknowledge the Readreq signal.

4. This step starts when the memory has the data ready. It places the data from the read request on the data lines and raises DataRdy.

5. The I/O device sees DataRdy, reads the data from the bus, and signals that it has the data by raising Ack

6. The memory sees the Ack signal, drops DataRdy and releases the data lines.

7. Finally, the I/O device, seeing DataRdy go low. Drops the Ack line, which indicates that the transmission is completed.

Asynchronous handshaking: Read Transaction

28

• One of the most important issues in bus design:

– How is the bus reserved by a devices that wishes to use it?

• Chaos is avoided by a master-slave arrangement:

– Only the bus master can control access to the bus:

It initiates and controls all bus requests

– A slave responds to read and write requests

• The simplest system:

– Processor is the only bus master

– All bus requests must be controlled by the processor

– Major drawback: the processor is involved in every transaction

Bus

Master

Bus

Slave

Control: Master initiates requests

Data can go either way

Arbitration: Obtaining Access to the Bus

29

• Bus arbitration scheme:

– A bus master wanting to use the bus asserts the bus request

– A bus master cannot use the bus until its request is granted

– A bus master must signal to the arbiter after finish using the bus

• Bus arbitration schemes usually try to balance two factors:

– Bus priority: the highest priority device should be serviced first

– Fairness: Even the lowest priority device should never be completely locked out from the bus

• Bus arbitration schemes

– Daisy chain arbitration: single device with all request lines.

– Centralized, parallel arbitration: see next-next slide

Multiple Potential Bus Masters: the Need for Arbitration

30

• Advantage: simple

• Disadvantages:

– Cannot assure fairness: A low-priority device may be locked out indefinitely

– The use of the daisy chain grant signal also limits the bus speed

Bus

Arbiter

Device 1

Highest

Priority

Device N

Lowest

Priority

Device 2

Grant Grant Grant

Release

Request

The Daisy Chain Bus Arbitrations Scheme

31

• Used in essentially all processor-memory busses and in high-speed I/O busses

Bus

Arbiter

Device 1 Device N Device 2

Grant Req

Centralized Parallel Arbitration

32

The Buses and Networks of Pentium 4

Parallel ATA

(100 MB/sec)

Parallel ATA

(100 MB/sec)

(20 MB/sec)

PCI bus

(132 MB/sec)

CSA

(0.266 GB/sec)

AGP 8X

(2.1 GB/sec)

Serial ATA

(150 MB/sec)

Disk

Pentium 4

processor

1 Gbit Ethernet

Memory

controller

hub

(north bridge)

82875P

Main

memory

DIMMs

DDR 400

(3.2 GB/sec)

DDR 400

(3.2 GB/sec)

Serial ATA

(150 MB/sec)

Disk

AC/97

(1 MB/sec)Stereo

(surround-

sound) USB 2.0

(60 MB/sec)

. . .

I/O

controller

hub

(south bridge)

82801EB

Graphics

output

(266 MB/sec)

System bus (800 MHz, 604 GB/sec)

CD/DVD

Tape

10/100 Mbit Ethernet

33

• Split transaction protocol- separate versus multiplexed address and data lines:

– Address and data can be transmitted in one bus cycle if separate address and data lines are available

– Cost: (a) more bus lines, (b) increased complexity

• Data bus width:

– By increasing the width of the data bus, transfers of multiple words require fewer bus cycles

– Example: SPARCstation 20’s memory bus is 128 bit wide

– Cost: more bus lines

• Block transfers:

– Allow the bus to transfer multiple words in back-to-back bus cycles

– Only one address needs to be sent at the beginning

– The bus is not released until the last word is transferred

– Cost: (a) increased complexity (b) decreased response time for request

Increasing the Bus Bandwidth

34

Example

• Suppose we have a system with the following characteristics:

– A memory and bus system supporting block access of 4 to 16 32-bit words

– A 64-bit synchronous bus clocked at 200 MHZ, with each 64-bit transfer taking 1 cycle, and 1 clock cycle required to send an address to memory

– Two clock cycles needed between each bus operation. (Assume the bus is idle before an access)

– A memory access time for the first four words of 200 ns; each additional set of four words can be read in 20 ns. Assume that a bus transfer of the most recently read data and a read of the next four words can be overlapped.

• Find

– (1) the sustained bandwidth and the latency

– (2) the effective number of bus transactions per second

– for a read of 256 words for transactions that uses 4-word and 16-word blocks, respectively.

35

Communicating with I/O

• Memory-mapped I/O

– Portion of address space is assigned to I/O devices;

– Read and writes to those addresses causes data to be transferred

• Dedicated I/O instruction

– Specify both the device # and command word

I/O

0

n

CPU

Interface Interface

Peripheral Peripheral

Memory

I/O

36

Communicating with the Processor

• Polling

• Interrupt

• DMA

• Intelligent I/O Controller

37

Polling

• Polling overhead:

1. Transferring to the polling routine

2. Accessing the device

3. Restarting the user program

CPU

IOC

device

Memory

Is the data

ready?

read data

store data

yes

no

done? no

yes

busy wait loop not an efficient

way to use the CPU unless the device

is very fast!

but checks for I/O completion can be dispersed among computationally intensive code

38

Interrupt Driven Data Transfer

• I/O interrupt is asynchronous with respect to the instruction execution

• Prioritized I/O interrupt (IPL- Interrupt Priority Level)

– I/O has lower priority than internal exception

– high-speed devices are associated with higher priority

CPU

IOC

device

Memory

add sub and or nop

read store ... rti

memory

user program (1) I/O

interrupt

(2) save PC

(3) interrupt service addr

interrupt service routine (4)

39

Direct Memory Access

CPU

IOC

device

Memory DMAC

CPU setup DMA 1. Identity of the device 2. Operation (read, write) 3. Memory address of the source or destination 4. Number of bytes to transfer

DMA (1) issue requests to device (2) interrupt the processor when the transfer completes

40

DMA and Memory System

• I/O and consistency of data between cache and memory

Cache

Main

Memory

I/O Bridge

CPU

Cache

Main

Memory

I/O Bridge

CPU

DMA

• I/O always see the latest data

• interfere with CPU

• Not interfering with the CPU

• Might see the stale data

• Output: write-through

• Input:

• Non-cacheable

• SW: flush the cache

• HW: check I/O address on input

41

Input/Output Processors

CPU IOP

Mem

D1

D2

Dn

. . .

main memory bus

I/O bus

CPU

IOP interrupts when done

(1)

memory

(2)

(3)

(4)

Device to/from memory transfers are controlled by the IOP directly.

OP Device Address

target device where cmnds are

looks in memory for commands

OP Addr Cnt Other

what to do

where to put data

how much

special requests

issues instruction to IOP

42

Responsibility of OS in I/O

• The OS guarantees that a user’s program accesses only the portions of an I/O device to which the user has rights

• The OS supplies routines to handle low-level device operations

• The OS handles the I/O interrupts as other exceptions

• The OS provides equitable access to the shared IO resources and schedule accesses

43

• Transaction processing:

– Examples: Airline reservations systems and bank ATMs

– Small changes to large shared software

– Response time & Throughput

– I/O Rate: No. disk accesses / second given upper limit for latency

– Benchmark: TPC-, TPC-R, TPC-W

• File system & Web I/O benchmark

– Synthetic benchmark

» Simulate typical file access pattern

90% of accesses are to files less than 10 KB, 67% reads, 27% writes 6% read-modify-write, 90% of all file accesses are to data with sequential addresses on the disk

– SPECSFS – measuring NFS (Network File System)

– SPECWeb – Webserver benchmark

– I/O Rate & Latency

44

• Suppose we have a benchmark that executes in 100 seconds of elapsed time, where 90 seconds is CPU time and the rest is I/O time. If CPU time improves by 50% per year for the next five years but I/O time doesn’t improve, how much faster will our program run at the end of five years?

1. Elapsed time = CPU time + I/O time

100 = 90 + I/O time

I/O time = 10 seconds

2. The improvement in CPU performance over five years is

90 / 12 = 7.5

The improvement in elapsed time is only

100 / 22 = 4.5

After n years

CPU time I/O time Elapsed time

% I/O time

0 90 seconds 10 second

s

100 second

s

10%

1 90 / 1.5 = 60 seconds

10 second

s

70 second

s

14%

2 60 / 1.5 = 40 seconds

10 second

s

50 second

s

20%

3 40 / 1.5 = 27 seconds

10 second

s

37 second

s

27%

4 27 / 1.5 = 18 seconds

10 second

s

28 second

s

36%

5 18 / 1.5 = 12 seconds

10 second

s

22 second

s

45%

Impact of I/O on System Performance

45

Designing an I/O System

• I/O system often needs to meet two constraints

– Latency constraint

– Bandwidth constraint

• Methodology

– Find the weakest link in the I/O system

» Processor, memory, bus, I/O controller or device

– Configure this component to sustain the required bandwidth

– Determine the requirements for the rest of the system and configure them to support the bandwidth

46

I/O System Design

• Consider the following computer system:

– A CPU that sustains 3 billion instructions per second and averages 100,000 instructions in the operating system per I/O operation

– A memory backplane bus capable of sustaining a transfer rate of 1000 MB/sec

– SCSI Ultra 320 controllers with a transfer rate of 320 MB/sec and accommodating up to 7 disks

– Disk drives with a read/write bandwidth of 75 MB/sec and an average seek plus rotational latency of 6 ms

• If the workload consists of 64 KB reads (where the block is sequential on a track) and the user program needs 200,000 instructions per I/O operation, find

1. The maximum sustainable I/O rate

2. The number of disks and SCSI controllers required. Assume that the reads can always be done on an idle disk if one exists (i.e., ignore disk conflicts).

CPU

SCSI

controller ???

47

Answer

• The two fixed components of the system are the memory bus and the CPU. Let’s first find the I/O rate that these two components can sustain and determine which of these is the bottleneck.

– Each I/O takes 200,000 user instructions and 100,000 OS instructions, so

– Each I/O transfers 64 KB, and bus bandwidth is 1000 MB/sec

Maximum I/O rate of CPU

Instruction execution rate

Instructions per I/O

=

=

3 x10^9

(200+100) x 10^3 = 10000 IOs/sec

Maximum I/O rate of bus = Bus bandwidth

Bytes per I/O

=

1000 x 10 ^6

64 x 10^3 = 15,625 IOs/sec

CPU is the bottleneck !!

48

Answer (cont.)

• The CPU is the bottleneck, so we can now configure the rest of the system to perform at the level dictated by the CPU, 10,000 I/Os per second.

• Let’s determine how many disks we need to be able to accommodate 10,000 I/Os per second.

• Can SCSI bus sustain the I/O transfer rate?

• So how many SCSI drivers do we need?

Time per I/O at disk = Seek + rotational time + Transfer time

= 6 ms + 64KB / 75 Mb/sec = 6.9 ms

IOs per sec = 1000 ms =

6.9 ms

146 IOs/sec

Therefore, to saturate the CPU requirements 10,000 IOs per sec need 10,000/146 = 69 disks

Transfer rate = Transfer size

Transfer time = 64KB / 6.9 ms ≈ 9.56 MB/sec

9.56 x 7 ≈ 70 MB/sec << 320 MB/sec

69 / 7 ~ 10 SCSI drivers

49

• I/O performance is limited by weakest link in chain between OS and device

• Disk I/O Benchmarks: I/O rate vs. Data rate vs. latency

• Bus

– Synchronous vs. Asynchornous

– Bus arbitration

• I/O device notifying the operating system:

– Polling: it can waste a lot of processor time

– I/O interrupt: similar to exception except it is asynchronou

• Delegating I/O responsibility from the CPU: DMA

Summary

storage, network and other peripheralsb97032/lecture_10.pdf · optical disk storage machine 500.00...

Documents