lecture 1: introduction microcomputer... · isa vs. computer architecture old definition of...
TRANSCRIPT
-
Lecture 1:
Introduction
Dr. Eng. Amr T. Abdel-Hamid
Elect 707
Winter 2014
Co
mp
ute
r Arc
hite
ctu
re
Text book slides: Computer Architecture: A Quantitative Approach 5th Edition, John L. Hennessy & David A. Patterso with modifications.
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
CPU History in a Flash
Intel 4004 (1971): 4-bit processor,
2312 transistors, 0.4 MHz,
10 micron PMOS, 11 mm2 chip
• Processor is the new transistor?
• RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip
• 125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache
– RISC II shrinks to ~ 0.02 mm2 at 65 nm
– Caches via DRAM or 1 transistor SRAM
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Processor Performance In
troductio
n
RISC
Move to multi-processor
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Snapdragon 805 processor specs
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Classes of Computers
Personal Mobile Device (PMD) e.g. start phones, tablet computers
Emphasis on energy efficiency and real-time
Desktop Computing Emphasis on price-performance
Servers Emphasis on availability, scalability, throughput
Clusters / Warehouse Scale Computers Used for “Software as a Service (SaaS)”
Emphasis on availability and price-performance
Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks
Embedded Computers Emphasis: price
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Instruction Set Architecture: Critical Interface
instruction set
software
hardware
Properties of a good abstraction Lasts through many generations (portability)
Used in many different ways (generality)
Provides convenient functionality to higher levels
Permits an efficient implementation at lower levels
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
ISA vs. Computer Architecture
Old definition of computer architecture
= instruction set design
Other aspects of computer design called implementation
Insinuates implementation is uninteresting or less challenging
Our view is: computer architecture >> ISA
Architect’s job much more than instruction set design; techni
cal hurdles today more challenging than those in instruction
set design
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Computer Architecture is Design and Analysis
Design
Analysis
Architecture is an iterative process: • Searching the space of possible designs • At all levels of computer systems
Creativity
Good Ideas
Mediocre Ideas Bad Ideas
Cost / Performance Analysis
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Administrivia
Instructor: Dr. Amr T. Abdel-Hamid
Office: C3-320
Email: [email protected]
Office Hours: Monday, 3rd & 4th Lectures.
T. A: ???
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Administrivia
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Course Grading
Exams
Quizzes 3 Quizzes: best 2 10 %
Mid Term 30 %
Final exam 40 %
Project 20%
Assignments Not Graded
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Project
Phase 0: Select your partner (17/9/2014)
Submit list of your group members (2-4 per group)
Submit a Comparision between Risc and Cisc Proc.
Phase 1:
.
.
.
.
Phase N: Project Implementation + Report (2 weeks before fin
als)
FINAL Non-Negotiable deadline
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
In time & It is too LATE Policy
In phases 0, & 1:
5% of project grade penalty per day for being late
In phase 2, to n:
No late presentation is possible.
Honor code
100% penalty for both copier and copy-giver of
Any Report/CODE.
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
1. Take Advantage of Parallelism
2. Principle of Locality
3. Focus on the Common Case
4. Amdahl’s Law
5. The Processor Performance Equation
Quantitative Principles of Design
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
1) Taking Advantage of Parallelism
Increasing throughput of server computer via multiple processors or multi
ple disks
Detailed HW design (DSD course shortly)
Carry lookahead adders uses parallelism to speed up computing sum
s from linear to logarithmic in number of bits per operand
Multiple memory banks searched in parallel in set-associative caches
Pipelining: overlap instruction execution to reduce the total time to compl
ete an instruction sequence.
Not every instruction depends on immediate predecessor executin
g instructions completely/partially in parallel possible
Classic 5-stage pipeline:
1) Instruction Fetch (Ifetch),
2) Register Read (Reg),
3) Execute (ALU),
4) Data Memory Access (Dmem),
5) Register Write (Reg)
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
2) The Principle of Locality The Principle of Locality:
Program access a relatively small portion of the address space at any instant of time.
Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be r
eferenced again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access)
Last 30 years, HW relied on locality for memory perf.
P MEM $
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Levels of the Memory Hierarchy
CPU Registers 100s Bytes 300 – 500 ps (0.3-0.5 ns)
L1 and L2 Cache 10s-100s K Bytes ~1 ns - ~10 ns $1000s/ GByte
Main Memory G Bytes 80ns- 200ns ~ $100/ GByte
Disk 10s T Bytes, 10 ms (10,000,000 ns) ~ $1 / GByte
Capacity Access Time Cost
Tape infinite sec-min ~$1 / GByte
Registers
L1 Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
Staging Xfer Unit
prog./compiler 1-8 bytes
cache cntl 32-64 bytes
OS 4K-8K bytes
user/operator Mbytes
Upper Level
Lower Level
faster
Larger
L2 Cache cache cntl 64-128 bytes Blocks
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
3) Focus on the Common Case
Common sense guides computer design Since its engineering, common sense is valuable
In making a design trade-off, favor the frequent case over the infrequent case E.g., Instruction fetch and decode unit used more frequently th
an multiplier, so optimize it 1st
E.g., If database server has 50 disks / processor, storage dependability dominates system dependability, so optimize it 1st
Frequent case is often simpler and can be done faster than the infrequent case E.g., overflow is rare when adding 2 numbers, so improve perf
ormance by optimizing more common case of no overflow
May slow down overflow, but overall performance improved by optimizing for the normal case
What is frequent case and how much performance improved by making case faster => Amdahl’s Law
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
4) Amdahl’s Law
enhanced
enhancedenhanced
new
oldoverall
Speedup
Fraction Fraction
1
ExTime
ExTime Speedup
1
Best you could ever hope to do:
enhancedmaximum Fraction - 1
1 Speedup
enhanced
enhancedenhancedoldnew Speedup
FractionFraction ExTime ExTime 1
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Amdahl’s Law example
New CPU 10X faster
I/O bound server, so 60% time waiting for I/O
56.1
64.0
1
10
0.4 0.4 1
1
Speedup
Fraction Fraction 1
1 Speedup
enhanced
enhancedenhanced
overall
• Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
5) Processor performance equation
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Inst Count CPI Clock Rate
Program X Compiler X (X) Inst. Set. X X Organization X X Technology X
inst count CPI Cycle time
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
5 Steps of MIPS Datapath Figure A.2, Page A-8
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
L M D
ALU
MU
X
Mem
ory
Reg F
ile
MU
X MU
X
Data
Mem
ory
MU
X
Sign Extend
4
Adder Zero? Next SEQ PC
Address
Next PC
WB Data
Inst
RD
RS1
RS2
Imm
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
5 Steps of MIPS Datapath Figure A.3, Page A-9
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
ALU
Mem
ory
Reg F
ile
MU
X MU
X
Data
Mem
ory
MU
X
Sign Extend
Zero?
IF/ID
ID/E
X
ME
M/W
B
EX
/ME
M
4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Dat
a
Next PC
Address
RS1
RS2
Imm
MU
X
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
5 Steps of MIPS Datapath Figure A.3, Page A-9
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
ALU
Mem
ory
Reg F
ile
MU
X MU
X
Data
Mem
ory
MU
X
Sign Extend
Zero?
IF/ID
ID/E
X
ME
M/W
B
EX
/ME
M
4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Dat
a
• Data stationary control – local decode for each instruction phase / pipeline stage
Next PC
Address
RS1
RS2
Imm
MU
X
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Visualizing Pipelining Figure A.2, Page A-8
I n s t r.
O r d e r
Time (clock cycles)
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Pipelining is not quite that easy!
Limits to pipelining: Hazards prevent next instruction from exec
uting during its designated clock cycle
Structural hazards: HW cannot support this combination of instruc
tions (single person to fold and put clothes away)
Data hazards: Instruction depends on result of prior instruction stil
l in the pipeline (missing sock)
Control hazards: Caused by delay between the fetching of instruct
ions and decisions about changes in control flow (branches and ju
mps).
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
One Memory Port/Structural Hazards Figure A.4, Page A-14
I n s t r.
O r d e r
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5
Reg
ALU
DMem Ifetch Reg
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
One Memory Port/Structural Hazards (Similar to Figure A.5, Page A-15)
I n s t r.
O r d e r
Time (clock cycles)
Load
Instr 1
Instr 2
Stall
Instr 3
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5
Reg
ALU
DMem Ifetch Reg
Bubble Bubble Bubble Bubble Bubble
How do you “bubble” the pipe?
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Speed Up Equation for Pipelining
pipelined
dunpipeline
TimeCycle
TimeCycle
CPI stall Pipeline CPI Ideal
depth Pipeline CPI Ideal Speedup
pipelined
dunpipeline
TimeCycle
TimeCycle
CPI stall Pipeline 1
depth Pipeline Speedup
Instper cycles Stall Average CPI Ideal CPIpipelined
For simple RISC pipeline, CPI = 1:
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Example: Dual-port vs. Single-port
Machine A: Dual ported memory (“Harvard Architecture”)
Machine B: Single ported memory, but its pipelined implementatio
n has a 1.05 times faster clock rate
Ideal CPI = 1 for both
Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
Machine A is 1.33 times faster
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
I n s t r.
O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Data Hazard on R1 Figure A.6, Page A-17
Time (clock cycles)
IF ID/RF EX MEM WB
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
Caused by a “Dependence” (in compiler nomenclature). This ha
zard results from an actual need for communication.
Three Generic Data Hazards
I: add r1,r2,r3
J: sub r4,r1,r3
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Write After Read (WAR) InstrJ writes operand before InstrI reads it
Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”.
Can’t happen in MIPS 5 stage pipeline because:
All instructions take 5 stages, and
Reads are always in stage 2, and
Writes are always in stage 5
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Three Generic Data Hazards
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Three Generic Data Hazards
Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
Can’t happen in MIPS 5 stage pipeline because:
All instructions take 5 stages, and
Writes are always in stage 5
Will see WAR and WAW in more complicated pipes
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Time (clock cycles)
Forwarding to Avoid Data Hazard Figure A.7, Page A-19
I n s t r.
O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
HW Change for Forwarding Figure A.23, Page A-37
ME
M/W
R
ID/E
X
EX
/ME
M
Data Memory
ALU
mux
m
ux
Registers
NextPC
Immediate
mux
What circuit detects and resolves this hazard?
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
38
Time (clock cycles)
Forwarding to Avoid LW-SW Data Hazard Figure A.8, Page A-20
I n s t r.
O r d e r
add r1,r2,r3
lw r4, 0(r1)
sw r4,12(r1)
or r8,r6,r9
xor r10,r9,r11
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Time (clock cycles)
I n s t r.
O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Data Hazard Even with Forwarding Figure A.9, Page A-21
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Data Hazard Even with Forwarding (Similar to Figure A.10, Page A-21)
Time (clock cycles)
or r8,r1,r9
I n s t r.
O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Reg
ALU
DMem Ifetch Reg
Reg Ifetch
ALU
DMem Reg Bubble
Ifetch
ALU
DMem Reg Bubble Reg
Ifetch
ALU
DMem Bubble Reg
How is this detected?
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Control Hazard on Branches
Three Stage Stall
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
Reg
ALU
DMem Ifetch Reg
What do you do with the 3 instructions in between? How do you do it?
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Branch Stall Impact
If CPI = 1, 30% branch,
Stall 3 cycles => new CPI = 1.9!
Two part solution:
Determine branch taken or not sooner, AND
Compute taken branch address earlier
MIPS branch tests if register = 0 or 0
MIPS Solution:
Move Zero test to ID/RF stage
Adder to calculate new PC in ID/RF stage
1 clock cycle penalty for branch versus 3
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Adder
IF/ID
Pipelined MIPS Datapath Figure A.24, page A-38
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
ALU
Mem
ory
Reg F
ile
MU
X
Data
Mem
ory
MU
X
Sign Extend
Zero?
ME
M/W
B
EX
/ME
M
4
Adder
Next SEQ PC
RD RD RD WB
Dat
a
• Interplay of instruction set design and cycle time.
Next PC
Address
RS1
RS2
Imm M
UX
ID/E
X
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Current Trends in Architecture
Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in
2003
New models for performance: Data-level parallelism (DLP)
Thread-level parallelism (TLP)
Request-level parallelism (RLP)
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Parallelism
Classes of parallelism in applications: Data-Level Parallelism (DLP)
Task-Level Parallelism (TLP)
Classes of architectural parallelism: Instruction-Level Parallelism (ILP)
Vector architectures/Graphic Processor Units (GPUs)
Thread-Level Parallelism
Request-Level Parallelism
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Trends in Technology
Integrated circuit technology Transistor density: 35%/year
Die size: 10-20%/year
Integration overall: 40-55%/year
DRAM capacity: 25-40%/year (slowing)
Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM
Magnetic disk technology: 40%/year 15-25X cheaper/bit then Flash
300-500X cheaper/bit than DRAM
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Bandwidth and Latency
Bandwidth or throughput Total work done in a given time
10,000-25,000X improvement for processors
300-1200X improvement for memory and disks
Latency or response time Time between start and completion of an event
30-80X improvement for processors
6-8X improvement for memory and disks
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Power and Energy
Problem: Get power in, get power out
Thermal Design Power (TDP) Characterizes sustained power consumption
Used as target for power supply and cooling system
Lower than peak power, higher than average power consumption
Clock rate can be reduced dynamically to limit power consumption
Energy per task is often a better measurement
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Dynamic Energy and Power
Dynamic energy Transistor switch from 0 -> 1 or 1 -> 0
½ x Capacitive load x Voltage2
Dynamic power ½ x Capacitive load x Voltage2 x Frequency switched
Reducing clock rate reduces power, not energy
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Power
Intel 80386 consumed ~ 2 W
3.3 GHz Intel Core i7 consumes 130 W
Heat must be dissipated from 1.5 x 1.5 cm chip
This is the limit of what can be cooled by air
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Reducing Power
Techniques for reducing power:
Do nothing well
Dynamic Voltage-Frequency Scaling
Low power state for DRAM, disks
turning off cores
-
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Reading Assignment:
Chapter 1, Appendix B