computer organization and design the hardware/software interface xin li ( 李新 ) shandong...

Computer Organization AND Design

The Hardware/Software Interface

Xin LI ( 李新 )Shandong University

Chapter 1Computer Abstractions and Technology

2

Contents of Chapter 1

1.1 Introduction 1.2 Below your program 1.3 Under the covers 1.4 Performance 1.5 Power wall 1.6 The Sea Change 1.7 Real Stuff: Manufacturing Chips

Computers have led to a third revolution for civilization

The following applications used to be “computer science fiction” Automatic teller machines Computers in automobiles Laptop computers Human genome project World Wide Web

1.1 Introduction

Tomorrow’s science fiction computer applications Cashless society

Digital cash from 2004 failed Automated intelligent highways

ITS from 2003 failed Genuinely ubiquitous computing

Embedded system from 1999 ?Mobile phone will kilo-core ? GPU: 1600 cores

Cloud computing

？

The influence of hardware on software In the past

Memory size was very small Programmers must minimize memory space to

make programs fast Nowadays

The hierarchical nature of memories The parallel nature of processors Programmers must understand computer

organization more

Computer major

Theory/Software Algorithm, Language principle…

Hardware/System Organization, architecture…

Application Database, Web, Embedded systems, graphics, …

SCI categories HARDWARE & ARCHITECTURE ARTIFICIAL INTELLIGENCE CYBERNETICS INFORMATION SYSTEMS INTERDISCIPLINARY APPLICATIONS SOFTWARE ENGINEERING THEORY & METHODS

Hardware PK Software

Who develop qk Round 1: hardware win machine code/ASM Round 2: Software win C/C++/java Round 3: hardware win multicore/manycore Round 4: ?

Why we need learn hardware? CS PK EE What difference between professionally trained person and

other majorProgramming skill?Tools

What is the threshold when non-computer major students work in IT

Personal Computer E.g. Desktop, laptop Server High performance E.g. Mainframes, minicomputers, supercomputers, data center Application

WWW, search engine, weather broadcast Embedded Computers a computer system with a dedicated function within a larger

mechanical or electrical system E.g. Cell phone, microprocessors in cars/ television

Classes of Computer applications

Embedded computers in a car

Growth of Sales of Embedded Computers

1.2 Below your programs

Hardware

Sy

stems software

Applications software

A simplified view of hardware and software as hierarchical layers

Systems software aimed at programmers E.g. Operation Systems, Database, Compiler

Applications software aimed at users E.g. Word, IE, QQ, WeChat

1.2 Below your programs

Computer language Computers only understands electrical signals Binary numbers express machine instructions e.g. 1000110010100000 means to add two numbers Easiest signals: on and off Very tedious to write

Assembly language Symbolic notations e.g. add a, b, c #a=b+c The assembler translates them into machine

instruction Programmers have to think like the machine

Computer Language and Software System

The Instruction Set Architecture (ISA)

instruction set architecture

software

hardware

The interface description separating the software and hardware

High-level programming language Notations more closer to the natural language The compiler translates them into assembly language statements Advantages over assembly language

Programmers can think in a more natural language Improved programming productivity Programs can be independent of hardware

Subroutine library ---- reusing programs

Which one faster? Asm 、 C 、 C++ 、 Java Lower, faster

Mouse The mechanical version

Moving the mouse rolls the large ball inside

The ball makes contact with an x-wheel and a y-wheel

Decide the distance and direction the mouse moves according to the rotation of wheels

The photoelectric version Better orientation and better

precision

1.3 Under the covers

鼠标之父——道格 ·恩格尔巴特 (1925-2013)

鼠标 1968

Display CRT (raster cathode ray tube) display

Scan an image one line at a time, 30 to 75 times / s

Pixels and the bit map, 512×340 to 1560×1280

The more bits per pixel, the more colors to be displayed

LCD (liquid crystal display) Thin and low-power The LCD pixel is not the source of light Rod-shaped molecules in a liquid that form

a twisting helix that bends light entering the display

Hardware support for graphics ---- raster refresh buffer (frame buffer) to store bit map

Goal of bit map ---- to faithfully represent what is on the screen

X0 X1

Y0

Y1

0

1

10

1

10

1

X0 X1

Y0

Y1

Frame buffer Raster scan CRT display

Processor Memory module Processor Memory module Expansion cards

• Sound card• Modem card• Video card• Network

interface card

Ports and Connectors

Processor Memory module Expansion cards

• Sound card


• Sound card• Modem card


• Sound card• Modem card• Video card


• Sound card• Modem card• Video card• Network

interface card

Processor

The System UnitThe System UnitWhat are common components insidethe system unit?

Motherboard and the hardware on it Motherboard

Thin, green, plastic, covered with dozens of small rectangles which contain integrated circuits (chips)

Three pieces: the piece connecting to the I/O devices, memory, and processor

Memory Place to keep running prgrams and data needed Each memory board contains 8 integrated circuits DRAM and cache

Processor Add numbers, tests numbers, signals I/O devices to activate, and

so on CPU (central processor unit)

Program platform of motherboard UEFI ， ASM/C/C++

What is the motherboard?

Close-up of PC motherboard

FourISAcardslots

FourPCIcardslots Four

SIMMslots

Two IDEconnectors

Processor

Parallel/serial

Audio/MIDI What diff

Slot 0 … Slot3

Speed 0>1>2>3

Prime>slave

25

CPU CPU散热风扇

内存条电源主机箱

26

软驱硬盘

显卡声卡

光驱

The five classic components of a computerThe five classic components of a computer

Abstractions Lower-level details are hidden to higher levels Instruction set architecture ---- the interface between

hardware and lowest-level software Many implementations of varying cost and performance

can run identical software

A safe place for data ---- secondary memory Main memory is volatile Secondary memory is nonvolatile

Below the Program

C compiler

assembler

High-level language program (in C) swap (int v[], int k) . . . Assembly language program (for MIPS) swap: sll $2, $5, 2 add $2, $4, $2 lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) jr $31

Machine (object) code (for MIPS) 000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000

Input Device Inputs Object Code

Processor

Control

Datapath

Memory

000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000

Devices

Input

Output

Network

Object Code Stored in Memory

Processor

Control

Datapath

Memory000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000100011 00010 01111 0000000000000000100011 00010 10000 0000000000000100101011 00010 10000 0000000000000000101011 00010 01111 0000000000000100000000 11111 00000 0000000000001000

Devices

Input

Output

Network

Processor Fetches an Instruction

Processor

Control

Datapath

Memory000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000100011 00010 01111 0000000000000000100011 00010 10000 0000000000000100101011 00010 10000 0000000000000000101011 00010 01111 0000000000000100000000 11111 00000 0000000000001000

Processor fetches an instruction from memory

Devices

Input

Output

Network

Control Decodes the Instruction

Processor

Control

Datapath

Memory000000 00100 00010 0001000000100000

Control decodes the instruction to determine what to execute

Devices

Input

Output

Network

Datapath Executes the Instruction

Processor

Control

Datapath

Memory

contents Reg #4 ADD contents Reg #2results put in Reg #2

Datapath executes the instruction as directed by control

000000 00100 00010 0001000000100000

Devices

Input

Output

Network

Integrated Circuits

2,400,000Very large-scale integrated Circuit

1995

900Integrated Circuit1975

35Transistor1965

1Vacuum tube1951

Relative performance / unit cost

Technology used in computers

Year

2,400,000Very large-scale integrated Circuit

1995

900Integrated Circuit1975

35Transistor1965

1Vacuum tube1951

Relative performance / unit cost

Technology used in computers

Year

Relative performance / unit cost of technologies used in computers

1962, SSI （ Small-Scale Integration ） 12 transistors

1966, MSI （ Medium-Scale Integration ） ,100 － 1k transistors

1967 － 1973 年， LSI （ Large-Scale Integration ）， 1k~100k transistors

1977 ， VLSI （ Very Large-Scale Integration ）， 30m2, 150k transistors

1993, ULSI (Ultra Large-Scale Integration) 16M FLASH and 256M DRAM which integrate 10M transistors

1994 ， GSI （ Giga Scale Integration ） 1G DRAM which integrate 100M transistors

2007: 2T flops 80core CPU

Growth of capacity per DRAM chip Growth of capacity per DRAM chip over timeover time

1992

100,000

Kb

it ca

paci

ty

10,000

1000

100

1019901988198619841982198019781976

Year of introduction

16M

4M

1M

256K

16K

64K

1994 1996

64M

History of Computer Development

1946 ENIAC (Electronic Numerical Integrator and Calculator)

History of Computer Development

The first electronic computers ENIAC (Electronic Numerical Integrator and Calculator)

J. Presper Eckert and John Mauchly Publicly known in 1946 30 tons, 80 feet long, 8.5 feet high, several feet wide 18,000 vacuum tubes

EDVAC (Electronic Discrete Variable Automatic Computer) John von Neumann’s memo about stored-program

computer von Neumann Computer

EDSAC (Electronic Delay Storage Automatic Calculator)

Operational in 1949 First full-scale, operational, stored-program

computer in the world John Atanasoff’s small-scale electronic computer in

the early 1940s A special-purpose machine by Konrad Zuse in

Germany Colossus built in 1943 Harvard architecture Whirlwind project

Commercial Developments Eckert-Mauchly Computer Corporation

Formed in 1947 $1 million for each of the 48 computers

IBM computers First one, the IBM 701, shipped in 1952 Investing $5 billion for System/360 in 1964

Digital Equipment Corporation (DEC) The first commercial minicomputer PDP-8 in 1965 Low-cost design, under $20,000

CDC 6600 The first supercomputer, built in 1963

Cray Research, Inc. Cray-1 in 1976 The fastest, the most expensive, the best cost/performance

for scientific programs Personal computer

Apple II In 1977 Low cost, high volume, high reliability

IBM Personal Computer Annouced in 1981 Best-selling computer of any kind Microprocessors of Intel and operating systems of

Microsoft became popular

Computer Generations First generation

1950-1959, vacuum tubes, commercial electronic computer

Second generation 1960-1968, transistors, cheaper computers

Third generation 1969-1977, integrated circuit, minicomputer

Fourth generation 1978-1997, LSI and VLSI, PCs and workstations

Fifth generation 1998-?, micromation and hugeness

selfstudy course

new departure from computer hardware

Multicore From 2006 IBM/SUN/AMD/Intel number of core

2, 4, 8, 16, 48 Software

adequate provision?

Embedded system embed PC into electronic product pervasive computing/Ubiquitous computing

I/O Device innovation

WII, PS3, Xbox 360

Communication 3G/4G/WIMAX

Computer performance tools CPU Memory DISK Task manager Service System info

Performance metrics: Response time, wall-clock time, or

elapsed time The time between the start and the

completion of an event Execution time

The time CPU spends computing , not include time spent waiting

Throughput the total amount of work done in a given

time.

45

1.4 Performance

46

“X is faster than Y” the execution time on Y is longer than

that on X. “X is n times faster than Y”

“the throughput of X is 1.3 times higher than Y” the number of tasks completed per unit

time on machine X is 1.3 times the number completed on Y.

If computer A runs a program in 10 seconds and computer B runs the same program in 15 seconds, how much faster is A than B?

47

Example

The performance ratio is 15/10=1.5

A is therefore 1.5 times faster than B

48

Measuring Performance

wall-clock time, response time, or elapsed time, Including disk accesses, memory

accesses, input/output activities, operating system overhead —everything.

CPU time= user CPU time+ system CPU time. User CPU time: the CPU time spent in a program itself System CPU time: the CPU time spent in the operating

system

Clock cycle (also tick): the time for one clock period. 250ps (PicoSeconds)

Clock rate: the count of clocks in one second. 4GHz (GigaHertz)

倒数关系 ?

49

clock

Clock cycle time and clock rate are inverses.The inverse/reciprocal of clock cycle is clock rate.

50

Example

One program runs in 10 seconds on computer A, which has a 2 GHz clock. Computer B requires 1.2 times as many clock cycles as computer A for this program. To run this program in 6 seconds, what clock rate should the computer B supply?

CPU clock cyclesA=10×2×109=2×1010

Clock rateB=1.2×2×1010/6=4GHz

To run the program in 6 seconds, B must have twice the clock rate of A.

Clock cycles per instruction (CPI): average number of clock cycles per instruction for a program Different instructions may take different amounts of time

depending on what they do.

CPI provides one way of comparing two different implementations of the same instruction set architecture.

Instruction set architecture(ISA) The number of instructions executed for a program will

be the same, if the program run in two different implementations of the same instruction set architecture.

51

Instruction Performance

CPU time=Instruction count ×CPI×Clock cycle time CPU time=Instruction count ×CPI/Clock rate

52

CPU Performance Equation

53

Example Suppose we have two implementations of the same

instruction set architecture. Computer A has a clock cycle time for 250ps and a CPI of 2.0 for some program, and computer B has a clock cycle time of 500ps and a CPI of 1.2 for the same program. Which computer is faster for this program and by how much?

I: the number of instructions for the program

CPU timeA=I×2×250(ps)=500×I (ps) CPU timeB=I×1.2×500(ps)=600×I (ps)

Computer A is 1.2 times as fast as computer B for this program.

54

Choosing Programs to Evaluate Performance

five levels of programs : Real applications Modified (or scripted) applications Kernels Toy benchmarks Synthetic benchmarks

55

Desktop Benchmarks CPU-intensive benchmarks

SPEC89 SPEC92 SPEC95 SPEC2000 SPEC2006

graphics-intensive benchmarks SPEC2000

SPECviewperf is used for benchmarking systems supporting the

OpenGL graphics library SPECapc

consists of applications that make extensive use of graphics.

56

Server Benchmarks SPECrate--processing rate of a

multiprocessor (SPECSFS)--file server benchmark (SPECWeb)--Web server benchmark Transaction-processing (TP)

benchmarks TPC benchmark—Transaction Processing

CouncilTPC-A, 1985TPC-C, 1992,TPC-H TPC-RTPC-W

57

Embedded Benchmarks EDN Embedded Microprocessor

Benchmark Consortium (or EEMBC, pronounced “embassy”).

58

Quantitative Principles

Make the Common Case Fast Perhaps it is the most important and

pervasive principle of computer design.

A fundamental law, called Amdahl’s Law, can be used to quantify this principle.

59

Amdahl’s Law states that the performance

improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.

60

The fraction of the computation time in the original machine that can be converted to take advantage of the enhancement

The improvement gained by the enhanced execution mode; that is, how much faster the task would run if the enhanced mode were used for the entire program

62

Example1.2 Suppose that we are considering an

enhancement to the processor of a server system used for Web serving. The new CPU is 10 times faster on computation in the Web serving application than the original processor. Assuming that the original CPU is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the enhancement?

63

Answer

64

Example1.3 A common transformation required in graphics engines

is square root. Implementations of floating-point (FP) square root vary significantly in performance, especially among processors designed for graphics. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark.One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of 10. The other alternative is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for a total of 50% of the execution time for the application. The design team believes that they can make all FP instructions run 1.6 times faster with the same effort as required for the fast square root. Compare these two design alternatives.

65

answer

66

The CPU Performance Equation

68

CPU performance is dependent upon three characteristics: clock cycle (or rate) clock cycles per instruction and instruction count.

It is difficult to change one parameter in complete isolation from others because the basic technologies involved in changing each characteristic are interdependent:

69

Clock cycle time—Hardware technology and organization

CPI—Organization and instruction set architecture

Instruction count—Instruction set architecture and compiler technology

71

Example1.4: Suppose we have made the following measurements: Frequency of FP operations (other than

FPSQR) = 25% Average CPI of FP operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR= 2% CPI of FPSQR = 20Assume that the two design alternatives are to

decrease the CPI of FPSQR to 2 or to decrease the average CPI of all FP operations to 2.5. Compare these two design alternatives using the CPU performance equation.

72

Answer

Since the CPI of the overall FP enhancement is slightly lower, its performance will be marginally better.

73

This is the same speedup we obtained using Amdahl’s Law:

74

Principle of LocalityPrograms tend to reuse data and

instructions they have used recently.a program spends 90% of its execution

time in only 10% of the code. Temporal locality

states that recently accessed items are likely to be accessed in the near future.

Spatial locality says that items whose addresses are near

one another tend to be referenced close together in time.

75

Power Consumption Trends

Power=Dynamic power+ Leakage power•Dyn power∝activity capacitance×voltage2 ×frequency

•Capacitance per transistor and voltage are decreasing, but number of transistors and frequency are increasing at a faster rate

• Leakage power is also rising and will soon match dynamic power Power consumption is already around 100W in some high-performance processors today

Power wall

Power = K (Capacitive Load)·(Voltage)2·(Frequency Switched)

1.7 Real Stuff: Manufacturing AMD Chips

AMD Barcelona 65nm 463 million transistors each core has a 128KB

L1 cache and a 512KB L2 cache, with all four cores sharing a 2MB L3 cache

78

Wafers and Dies

The semiconductor silicon and the chip The semiconductor silicon and the chip manufacturing processmanufacturing process

80

Manufacturing Process

• Silicon wafers undergo many processing steps so that different parts of the wafer behave as insulators( 绝缘体 ), conductors, and transistors (switches)

• Multiple metal layers on the silicon enable connections between transistors

• The wafer is chopped into many dies – the size of the die determines yield and cost

81

Processor Technology Trends

• Shrinking of transistor sizes: 250nm (1997) 130nm (2002) 70nm (2008) 35nm (2014)

• Transistor density increases by 35% per year and die size increases by 10-20% per year… functionality improvements!

• Transistor speed improves linearly with size (complex equation involving voltages, resistances, capacitances)

• Wire delays do not scale down at the same rate as transistor delays

P56 1.1 1.3.1-1.3.3 1.3.1-1.4.3

82

Assignments

computer organization and design the hardware/software interface xin li ( 李新 ) shandong...

Documents

hardware slide

software system slide

introduction slide

technology slide

car slide

computer language computers

computer system

computer organization