computer organization and design the hardware/software interface xin li ( 李新 ) shandong...
TRANSCRIPT
Computer Organization AND Design
The Hardware/Software Interface
Xin LI ( 李新 )Shandong University
Chapter 1Computer Abstractions and Technology
2
Contents of Chapter 1
1.1 Introduction 1.2 Below your program 1.3 Under the covers 1.4 Performance 1.5 Power wall 1.6 The Sea Change 1.7 Real Stuff: Manufacturing Chips
Computers have led to a third revolution for civilization
The following applications used to be “computer science fiction” Automatic teller machines Computers in automobiles Laptop computers Human genome project World Wide Web
1.1 Introduction
Tomorrow’s science fiction computer applications Cashless society
Digital cash from 2004 failed Automated intelligent highways
ITS from 2003 failed Genuinely ubiquitous computing
Embedded system from 1999 ?Mobile phone will kilo-core ? GPU: 1600 cores
Cloud computing
?
The influence of hardware on software In the past
Memory size was very small Programmers must minimize memory space to
make programs fast Nowadays
The hierarchical nature of memories The parallel nature of processors Programmers must understand computer
organization more
Computer major
Theory/Software Algorithm, Language principle…
Hardware/System Organization, architecture…
Application Database, Web, Embedded systems, graphics, …
SCI categories HARDWARE & ARCHITECTURE ARTIFICIAL INTELLIGENCE CYBERNETICS INFORMATION SYSTEMS INTERDISCIPLINARY APPLICATIONS SOFTWARE ENGINEERING THEORY & METHODS
Hardware PK Software
Who develop qk Round 1: hardware win machine code/ASM Round 2: Software win C/C++/java Round 3: hardware win multicore/manycore Round 4: ?
Why we need learn hardware? CS PK EE What difference between professionally trained person and
other majorProgramming skill?Tools
What is the threshold when non-computer major students work in IT
Personal Computer E.g. Desktop, laptop Server High performance E.g. Mainframes, minicomputers, supercomputers, data center Application
WWW, search engine, weather broadcast Embedded Computers a computer system with a dedicated function within a larger
mechanical or electrical system E.g. Cell phone, microprocessors in cars/ television
Classes of Computer applications
Embedded computers in a car
Growth of Sales of Embedded Computers
1.2 Below your programs
Hardware
Sy
stems software
Applications software
A simplified view of hardware and software as hierarchical layers
Systems software aimed at programmers E.g. Operation Systems, Database, Compiler
Applications software aimed at users E.g. Word, IE, QQ, WeChat
1.2 Below your programs
Computer language Computers only understands electrical signals Binary numbers express machine instructions e.g. 1000110010100000 means to add two numbers Easiest signals: on and off Very tedious to write
Assembly language Symbolic notations e.g. add a, b, c #a=b+c The assembler translates them into machine
instruction Programmers have to think like the machine
Computer Language and Software System
The Instruction Set Architecture (ISA)
instruction set architecture
software
hardware
The interface description separating the software and hardware
High-level programming language Notations more closer to the natural language The compiler translates them into assembly language statements Advantages over assembly language
Programmers can think in a more natural language Improved programming productivity Programs can be independent of hardware
Subroutine library ---- reusing programs
Which one faster? Asm 、 C 、 C++ 、 Java Lower, faster
Mouse The mechanical version
Moving the mouse rolls the large ball inside
The ball makes contact with an x-wheel and a y-wheel
Decide the distance and direction the mouse moves according to the rotation of wheels
The photoelectric version Better orientation and better
precision
1.3 Under the covers
鼠标之父——道格 ·恩格尔巴特 (1925-2013)
鼠标 1968
Display CRT (raster cathode ray tube) display
Scan an image one line at a time, 30 to 75 times / s
Pixels and the bit map, 512×340 to 1560×1280
The more bits per pixel, the more colors to be displayed
LCD (liquid crystal display) Thin and low-power The LCD pixel is not the source of light Rod-shaped molecules in a liquid that form
a twisting helix that bends light entering the display
Hardware support for graphics ---- raster refresh buffer (frame buffer) to store bit map
Goal of bit map ---- to faithfully represent what is on the screen
X0 X1
Y0
Y1
0
1
10
1
10
1
X0 X1
Y0
Y1
Frame buffer Raster scan CRT display
Processor Memory module Processor Memory module Expansion cards
• Sound card• Modem card• Video card• Network
interface card
Ports and Connectors
Processor Memory module Expansion cards
• Sound card
Processor Memory module Expansion cards
• Sound card• Modem card
Processor Memory module Expansion cards
• Sound card• Modem card• Video card
Processor Memory module Expansion cards
• Sound card• Modem card• Video card• Network
interface card
Processor
The System UnitThe System UnitWhat are common components insidethe system unit?
Motherboard and the hardware on it Motherboard
Thin, green, plastic, covered with dozens of small rectangles which contain integrated circuits (chips)
Three pieces: the piece connecting to the I/O devices, memory, and processor
Memory Place to keep running prgrams and data needed Each memory board contains 8 integrated circuits DRAM and cache
Processor Add numbers, tests numbers, signals I/O devices to activate, and
so on CPU (central processor unit)
Program platform of motherboard UEFI , ASM/C/C++
What is the motherboard?
Close-up of PC motherboard
FourISAcardslots
FourPCIcardslots Four
SIMMslots
Two IDEconnectors
Processor
Parallel/serial
Audio/MIDI What diff
Slot 0 … Slot3
Speed 0>1>2>3
Prime>slave
24
25
CPU CPU散热风扇
内存条 电源 主机箱
26
软驱 硬盘
显卡 声卡
光驱
The five classic components of a computerThe five classic components of a computer
Abstractions Lower-level details are hidden to higher levels Instruction set architecture ---- the interface between
hardware and lowest-level software Many implementations of varying cost and performance
can run identical software
A safe place for data ---- secondary memory Main memory is volatile Secondary memory is nonvolatile
Below the Program
C compiler
assembler
High-level language program (in C) swap (int v[], int k) . . . Assembly language program (for MIPS) swap: sll $2, $5, 2 add $2, $4, $2 lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) jr $31
Machine (object) code (for MIPS) 000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000
Input Device Inputs Object Code
Processor
Control
Datapath
Memory
000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000
Devices
Input
Output
Network
Object Code Stored in Memory
Processor
Control
Datapath
Memory000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000100011 00010 01111 0000000000000000100011 00010 10000 0000000000000100101011 00010 10000 0000000000000000101011 00010 01111 0000000000000100000000 11111 00000 0000000000001000
Devices
Input
Output
Network
Processor Fetches an Instruction
Processor
Control
Datapath
Memory000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000100011 00010 01111 0000000000000000100011 00010 10000 0000000000000100101011 00010 10000 0000000000000000101011 00010 01111 0000000000000100000000 11111 00000 0000000000001000
Processor fetches an instruction from memory
Devices
Input
Output
Network
Control Decodes the Instruction
Processor
Control
Datapath
Memory000000 00100 00010 0001000000100000
Control decodes the instruction to determine what to execute
Devices
Input
Output
Network
Datapath Executes the Instruction
Processor
Control
Datapath
Memory
contents Reg #4 ADD contents Reg #2results put in Reg #2
Datapath executes the instruction as directed by control
000000 00100 00010 0001000000100000
Devices
Input
Output
Network
Integrated Circuits
2,400,000Very large-scale integrated Circuit
1995
900Integrated Circuit1975
35Transistor1965
1Vacuum tube1951
Relative performance / unit cost
Technology used in computers
Year
2,400,000Very large-scale integrated Circuit
1995
900Integrated Circuit1975
35Transistor1965
1Vacuum tube1951
Relative performance / unit cost
Technology used in computers
Year
Relative performance / unit cost of technologies used in computers
1962, SSI ( Small-Scale Integration ) 12 transistors
1966, MSI ( Medium-Scale Integration ) ,100 - 1k transistors
1967 - 1973 年, LSI ( Large-Scale Integration ), 1k~100k transistors
1977 , VLSI ( Very Large-Scale Integration ), 30m2, 150k transistors
1993, ULSI (Ultra Large-Scale Integration) 16M FLASH and 256M DRAM which integrate 10M transistors
1994 , GSI ( Giga Scale Integration ) 1G DRAM which integrate 100M transistors
2007: 2T flops 80core CPU
Growth of capacity per DRAM chip Growth of capacity per DRAM chip over timeover time
1992
100,000
Kb
it ca
paci
ty
10,000
1000
100
1019901988198619841982198019781976
Year of introduction
16M
4M
1M
256K
16K
64K
1994 1996
64M
History of Computer Development
1946 ENIAC (Electronic Numerical Integrator and Calculator)
History of Computer Development
The first electronic computers ENIAC (Electronic Numerical Integrator and Calculator)
J. Presper Eckert and John Mauchly Publicly known in 1946 30 tons, 80 feet long, 8.5 feet high, several feet wide 18,000 vacuum tubes
EDVAC (Electronic Discrete Variable Automatic Computer) John von Neumann’s memo about stored-program
computer von Neumann Computer
EDSAC (Electronic Delay Storage Automatic Calculator)
Operational in 1949 First full-scale, operational, stored-program
computer in the world John Atanasoff’s small-scale electronic computer in
the early 1940s A special-purpose machine by Konrad Zuse in
Germany Colossus built in 1943 Harvard architecture Whirlwind project
Commercial Developments Eckert-Mauchly Computer Corporation
Formed in 1947 $1 million for each of the 48 computers
IBM computers First one, the IBM 701, shipped in 1952 Investing $5 billion for System/360 in 1964
Digital Equipment Corporation (DEC) The first commercial minicomputer PDP-8 in 1965 Low-cost design, under $20,000
CDC 6600 The first supercomputer, built in 1963
Cray Research, Inc. Cray-1 in 1976 The fastest, the most expensive, the best cost/performance
for scientific programs Personal computer
Apple II In 1977 Low cost, high volume, high reliability
IBM Personal Computer Annouced in 1981 Best-selling computer of any kind Microprocessors of Intel and operating systems of
Microsoft became popular
Computer Generations First generation
1950-1959, vacuum tubes, commercial electronic computer
Second generation 1960-1968, transistors, cheaper computers
Third generation 1969-1977, integrated circuit, minicomputer
Fourth generation 1978-1997, LSI and VLSI, PCs and workstations
Fifth generation 1998-?, micromation and hugeness
selfstudy course
new departure from computer hardware
Multicore From 2006 IBM/SUN/AMD/Intel number of core
2, 4, 8, 16, 48 Software
adequate provision?
Embedded system embed PC into electronic product pervasive computing/Ubiquitous computing
I/O Device innovation
WII, PS3, Xbox 360
Communication 3G/4G/WIMAX
Computer performance tools CPU Memory DISK Task manager Service System info
Performance metrics: Response time, wall-clock time, or
elapsed time The time between the start and the
completion of an event Execution time
The time CPU spends computing , not include time spent waiting
Throughput the total amount of work done in a given
time.
45
1.4 Performance
46
“X is faster than Y” the execution time on Y is longer than
that on X. “X is n times faster than Y”
“the throughput of X is 1.3 times higher than Y” the number of tasks completed per unit
time on machine X is 1.3 times the number completed on Y.
If computer A runs a program in 10 seconds and computer B runs the same program in 15 seconds, how much faster is A than B?
47
Example
The performance ratio is 15/10=1.5
A is therefore 1.5 times faster than B
48
Measuring Performance
wall-clock time, response time, or elapsed time, Including disk accesses, memory
accesses, input/output activities, operating system overhead —everything.
CPU time= user CPU time+ system CPU time. User CPU time: the CPU time spent in a program itself System CPU time: the CPU time spent in the operating
system
Clock cycle (also tick): the time for one clock period. 250ps (PicoSeconds)
Clock rate: the count of clocks in one second. 4GHz (GigaHertz)
倒数关系 ?
49
clock
Clock cycle time and clock rate are inverses.The inverse/reciprocal of clock cycle is clock rate.
50
Example
One program runs in 10 seconds on computer A, which has a 2 GHz clock. Computer B requires 1.2 times as many clock cycles as computer A for this program. To run this program in 6 seconds, what clock rate should the computer B supply?
CPU clock cyclesA=10×2×109=2×1010
Clock rateB=1.2×2×1010/6=4GHz
To run the program in 6 seconds, B must have twice the clock rate of A.
Clock cycles per instruction (CPI): average number of clock cycles per instruction for a program Different instructions may take different amounts of time
depending on what they do.
CPI provides one way of comparing two different implementations of the same instruction set architecture.
Instruction set architecture(ISA) The number of instructions executed for a program will
be the same, if the program run in two different implementations of the same instruction set architecture.
51
Instruction Performance
CPU time=Instruction count ×CPI×Clock cycle time CPU time=Instruction count ×CPI/Clock rate
52
CPU Performance Equation
53
Example Suppose we have two implementations of the same
instruction set architecture. Computer A has a clock cycle time for 250ps and a CPI of 2.0 for some program, and computer B has a clock cycle time of 500ps and a CPI of 1.2 for the same program. Which computer is faster for this program and by how much?
I: the number of instructions for the program
CPU timeA=I×2×250(ps)=500×I (ps) CPU timeB=I×1.2×500(ps)=600×I (ps)
Computer A is 1.2 times as fast as computer B for this program.
54
Choosing Programs to Evaluate Performance
five levels of programs : Real applications Modified (or scripted) applications Kernels Toy benchmarks Synthetic benchmarks
55
Desktop Benchmarks CPU-intensive benchmarks
SPEC89 SPEC92 SPEC95 SPEC2000 SPEC2006
graphics-intensive benchmarks SPEC2000
SPECviewperf is used for benchmarking systems supporting the
OpenGL graphics library SPECapc
consists of applications that make extensive use of graphics.
56
Server Benchmarks SPECrate--processing rate of a
multiprocessor (SPECSFS)--file server benchmark (SPECWeb)--Web server benchmark Transaction-processing (TP)
benchmarks TPC benchmark—Transaction Processing
CouncilTPC-A, 1985TPC-C, 1992,TPC-H TPC-RTPC-W
57
Embedded Benchmarks EDN Embedded Microprocessor
Benchmark Consortium (or EEMBC, pronounced “embassy”).
58
Quantitative Principles
Make the Common Case Fast Perhaps it is the most important and
pervasive principle of computer design.
A fundamental law, called Amdahl’s Law, can be used to quantify this principle.
59
Amdahl’s Law states that the performance
improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
60
The fraction of the computation time in the original machine that can be converted to take advantage of the enhancement
The improvement gained by the enhanced execution mode; that is, how much faster the task would run if the enhanced mode were used for the entire program
61
62
Example1.2 Suppose that we are considering an
enhancement to the processor of a server system used for Web serving. The new CPU is 10 times faster on computation in the Web serving application than the original processor. Assuming that the original CPU is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the enhancement?
63
Answer
64
Example1.3 A common transformation required in graphics engines
is square root. Implementations of floating-point (FP) square root vary significantly in performance, especially among processors designed for graphics. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark.One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of 10. The other alternative is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for a total of 50% of the execution time for the application. The design team believes that they can make all FP instructions run 1.6 times faster with the same effort as required for the fast square root. Compare these two design alternatives.
65
answer
66
The CPU Performance Equation
67
68
CPU performance is dependent upon three characteristics: clock cycle (or rate) clock cycles per instruction and instruction count.
It is difficult to change one parameter in complete isolation from others because the basic technologies involved in changing each characteristic are interdependent:
69
Clock cycle time—Hardware technology and organization
CPI—Organization and instruction set architecture
Instruction count—Instruction set architecture and compiler technology
70
71
Example1.4: Suppose we have made the following measurements: Frequency of FP operations (other than
FPSQR) = 25% Average CPI of FP operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR= 2% CPI of FPSQR = 20Assume that the two design alternatives are to
decrease the CPI of FPSQR to 2 or to decrease the average CPI of all FP operations to 2.5. Compare these two design alternatives using the CPU performance equation.
72
Answer
Since the CPI of the overall FP enhancement is slightly lower, its performance will be marginally better.
73
This is the same speedup we obtained using Amdahl’s Law:
74
Principle of LocalityPrograms tend to reuse data and
instructions they have used recently.a program spends 90% of its execution
time in only 10% of the code. Temporal locality
states that recently accessed items are likely to be accessed in the near future.
Spatial locality says that items whose addresses are near
one another tend to be referenced close together in time.
75
Power Consumption Trends
Power=Dynamic power+ Leakage power•Dyn power∝activity capacitance×voltage2 ×frequency
•Capacitance per transistor and voltage are decreasing, but number of transistors and frequency are increasing at a faster rate
• Leakage power is also rising and will soon match dynamic power Power consumption is already around 100W in some high-performance processors today
Power wall
Power = K (Capacitive Load)·(Voltage)2·(Frequency Switched)
1.7 Real Stuff: Manufacturing AMD Chips
AMD Barcelona 65nm 463 million transistors each core has a 128KB
L1 cache and a 512KB L2 cache, with all four cores sharing a 2MB L3 cache
78
Wafers and Dies
The semiconductor silicon and the chip The semiconductor silicon and the chip manufacturing processmanufacturing process
80
Manufacturing Process
• Silicon wafers undergo many processing steps so that different parts of the wafer behave as insulators( 绝缘体 ), conductors, and transistors (switches)
• Multiple metal layers on the silicon enable connections between transistors
• The wafer is chopped into many dies – the size of the die determines yield and cost
81
Processor Technology Trends
• Shrinking of transistor sizes: 250nm (1997) 130nm (2002) 70nm (2008) 35nm (2014)
• Transistor density increases by 35% per year and die size increases by 10-20% per year… functionality improvements!
• Transistor speed improves linearly with size (complex equation involving voltages, resistances, capacitances)
• Wire delays do not scale down at the same rate as transistor delays
P56 1.1 1.3.1-1.3.3 1.3.1-1.4.3
82
Assignments
END