readings lecture 13: modern superscalar...
TRANSCRIPT
1
1
Lecture 13: Modern Superscalar Pipelines
CprE 581 Computer Systems Architecture, Fall 2012
Readings
Textbook 3.8-3.13
“Complexity effective superscalar processors” (thesis) ch. 1 and 2,
http://jes.ece.wisc.edu/papers/isca97.subba.pdf
2
Copyright © 2012, Elsevier Inc. All rights reserved.
Multiple Issue and Static Scheduling
To achieve CPI < 1, need to complete multiple instructions per clock
Solutions: Statically scheduled superscalar
processors VLIW (very long instruction word)
processors dynamically scheduled superscalar
processors
Mu
ltiple
Issue
an
d S
tatic S
che
du
ling
Copyright © 2012, Elsevier Inc. All rights reserved.
Multiple Issue
Mu
ltiple
Issue
an
d S
tatic S
che
du
ling
2
5
Tomasulo PerformanceObserve at the EX stage, how many cycles to execute this code?
LW R2,45(R3)
ADD R6,R2,R4
SUB R10,R0,R6
ADD R10,R10,R12
Assume load takes 1 cycle, ALU 1 cycle
ReorderBufferDecode
FU1 FU2
RS RS
Fetch Unit
Rename
L-bufS-buf
DM
Regfile
IM
6
Tomasulo vs MIPS Pipeline
How many cycles on the 5-stage MIPS pipeline?
Why does the simple pipeline run faster?
IF
ID
EX
MEM
WB
Stall checkData forwarding
7
Review Tomasulo Inst SchedulingBoth in RS, no contention on CDB or FU
ADD R2,R2,45 # R2=>tag p, result = A SUB R6,R2,R4 # R4 is ready, = B
Cycle 1: ADD starts at FU, producing ACycle 2: ADD broadcast p + A
SUB matches on p and accepts ACycle 3: SUB starts execution, FU calc A-B
A is produced at cycle 1, but consumed at cycle 3 -- unavoidable?
8
Review Data ForwardingMIPS pipeline data
forwarding:FU/MEM => FUWhy not in Tomasulo?
Cycle 2: forward A from FU output to FU input…
FU
But tag broadcasting has one cycle delay!!
When is it known that A will be ready?
Cycle 1: A is to be readyCycle 2: A and its tag are
broadcast
If tag is broadcast one-cycle earlier …
REG/ROB
ROB bypass
3
9
Revised SchedulingRS1: ADD R6,R2,R4RS2: SUB R10,R0,R6RS3: ADD R12,R10,R6
ADD(1) has been ready and selected1. - ADD(1)’s tag is broadcast, and
operands are sent to FU; - SUB is waken up and selected;
2. - SUB’s tag is broadcast, operands are sent to FU; - forwarding logic replace 2nd FU operand with FU output; - ADD(2) is waken up and accepts FU output, and is selected
3. So on and so forth…
RS can be centralized or distributed
SELECT
RS1
RS2
RS3
RS4
RS5
FU
One cycle earlierHow to address CDB contention? 10
How to Handle Variable Latency?
SELECT
RS1
RS2
RS3
RS4
RS5
One method: Use result shift register to track latency and control tag/data bus
Tag broadcastCycle n+k-1
Control data bus Cycle n+k
FU of K-cycle latency
Cycle n
11
Revised Pipeline Stages
Fetc
h
Dis
patc
h/Re
nam
e
ROB
FUFUby
pass
D-c
ache
executecommit
Reg
Wakeupselect
• As efficient as MIPS pipeline (instruction throughput) • With data forwarding and bypassing
RS
SuperScalar Microarchitecture
FPU
Instructiondispatchbuses
FP operand busesGP operand buses
LSUMCFSU BPUXSU1XSU0
Reservationstations
FP registersGP registers
Completion #unit with
reorder buffer
Register nos.Register nos.
Register nos.
Register nos.
FP result busesGP result buses
Datacache
Instructioncache
Fetchunit
Dispatch unit with 8-entry
instruction queue#
Instructionoperationbuses
Result status buses
Branch correction Reorder buffer information
4
13Copyright © 2012, Elsevier Inc. All rights reserved.
Dyn
am
ic Sch
ed
ulin
g, M
ultip
le Issu
e, a
nd
Sp
ecu
latio
n
Overview of Design
14Copyright © 2012, Elsevier Inc. All rights reserved.
Loop: LD R2,0(R1) ;R2=array element
DADDIU R2,R2,#1 ;increment R2
SD R2,0(R1) ;store result
DADDIU R1,R1,#8 ;increment pointer
BNE R2,R3,LOOP ;branch if not last element
Dyn
am
ic Sch
ed
ulin
g, M
ultip
le Issu
e, a
nd
Sp
ecu
latio
n
Example
15Copyright © 2012, Elsevier Inc. All rights reserved.
Dyn
am
ic Sch
ed
ulin
g, M
ultip
le Issu
e, a
nd
Sp
ecu
latio
n
Example (No Speculation)
16Copyright © 2012, Elsevier Inc. All rights reserved.
Dyn
am
ic Sch
ed
ulin
g, M
ultip
le Issu
e, a
nd
Sp
ecu
latio
n
Example
5
17
Dispatch Unit
opDS1S2
opDS1S2
opDS1S2
opDS1S2
CC
#comp = 2(k-1) + 2(k-2) + … + 2 = k(k-1)
Dispatch Stage- Dependence Check Logic
19
Dispatch Stage – Rename Table
20
6
21
Rethink RS and ROB designData broadcasting to
all RS stations:Broadcasting saves
reg-write to reg-read delayn child instructions
can receive data simultaneously
However,RS and ROB may store duplicate valuesNot all n child instructions may fu-execute next cycle
22
Rethink ROB DesignDoes every ROB entry store register output?
One solution: Split ROB into ROB + Rename Register
Arch. Regs Rename Regs ROB (no reg value)
From FU
Used in some PowerPC processors
23
Alternative: Register Mapping and Issue Queue
op Qj Qk Vj Vk
i-type
RS entry IQ entry
ROB entrydest resultPC valid
p1busyp2p3
p_n
Physical register: Central collection of all register values, architectural or temporary
Physical register
Arch. Regs:Virtual regs
removed
replaced
24
Register Mapping TableRename architecturalregister to physicalregisterNO real architectural registers (now virtual register)RS => issue queueRename stage: allocate issue queue entry, allocate ROB, allocate physical registerWhat is tag now?
p1p2p3
p_n
Mapping Table
ra rb rc pc
pa pb
papb
vala valb
freelist
alloc
7
25
Mis-speculation RecoveryRS+ROB: no changes to arch. registers, so just clear pipeline and re-fetchCorrectness issue: software does not see incorrect register contents
Recovery for mapping approach: Roll back mapping table to the mis-speculation point
Architectural registers=> virtual registers
p1p2p3
p_n
Committed mapping
mapping 1
mapping 2
ROB
Alpha 21264: One mapping table per instruction in issue queue, with selective flushingHow to implement mappingtable supporting recovery?
mapping tablestatus
26
“Data in ROB” Design(HP PA8000, Pentium Pro, Core2Duo)
• On dispatch into ROB, ready sources can be in regfile or in ROB dest (copied into src1/src2 if ready before dispatch)• On completion, write to dest field and broadcast to src fields.• On issue, read from ROB src fields
Register Fileholds only committed state
Reorderbuffer
LoadUnit FU FU FU Store
Unit
< t, result >
t1t2..tn
Ins# use exec op p1 src1 p2 src2 pd dest data
Commit
27
Unified Physical Register File(MIPS R10K, Alpha 21264, Pentium 4)
• One regfile for both committed and speculative values (no data in ROB)• During decode, instruction result allocated new physical register, source
regs translated to physical regs through rename table• Instruction reads data from regfile at start of execute (not in decode)• Write-back updates reg. busy bits on instructions in ROB (assoc. search)• Snapshots of rename table taken at every branch to recover mispredicts• On exception, renaming undone in reverse order of issue (MIPS R10000)
Rename Table
r1 tir2 tj
FU FU StoreUnit
< t, result >
FULoadUnit
FU
t1t2.tn
RegFileSnapshots for
mispredict recovery
(ROB not shown)
28
Pipeline Design with Physical Regfile
Fetch Decode & Rename Reorder BufferPC
BranchPrediction
Update predictors
Commit
BranchResolution
BranchUnit ALU MEM Store
Buffer D$
Execute
In-Order
In-OrderOut-of-Order
Physical Reg. File
kill
killkill
kill
8
29
Lifetime of Physical Registers
ld r1, (r3)add r3, r1, #4sub r6, r7, r9add r3, r3, r6ld r6, (r1)add r6, r6, r3st r6, (r1)ld r6, (r11)
ld P1, (Px)add P2, P1, #4sub P3, Py, Pzadd P4, P2, P3ld P5, (P1)add P6, P5, P4st P6, (P1)ld P7, (Pw)
Rename
When can we reuse a physical register?When next write of same architectural register commits
• Physical regfile holds committed and speculative values• Physical registers decoupled from ROB entries (no data in ROB)
30
Physical Register Management
op p1 PR1 p2 PR2exuse Rd PRdLPRd
<R6>P5<R7>P6<R3>P7
P0
Pn
P1P2P3P4
R5P5R6P6R7
R0P8R1
R2P7R3
R4
ROB
Rename Table
Physical Regs Free List
ld r1, 0(r3)add r3, r1, #4sub r6, r7, r6add r3, r3, r6ld r6, 0(r1)
ppp
P0P1P3P2P4
(LPRd requires third read port on Rename Table for each instruction)
<R1>P8 p
31
Physical Register Management
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld r1, 0(r3)add r3, r1, #4sub r6, r7, r6add r3, r3, r6ld r6, 0(r1)
Free ListP0P1P3P2P4
<R6>P5<R7>P6<R3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 r1 P0
R5P5R6P6R7
R0P8R1
R2P7R3
R4
Rename Table
P0
P8
32
Physical Register Management
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld r1, 0(r3)add r3, r1, #4sub r6, r7, r6add r3, r3, r6ld r6, 0(r1)
Free ListP0P1P3P2P4
<R6>P5<R7>P6<R3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 r1 P0
R5P5R6P6R7
R0P8R1
R2P7R3
R4
Rename Table
P0
P8P7
P1
x add P0 r3 P1
9
33
Physical Register Management
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld r1, 0(r3)add r3, r1, #4sub r6, r7, r6add r3, r3, r6ld r6, 0(r1)
Free ListP0P1P3P2P4
<R6>P5<R7>P6<R3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 r1 P0
R5P5R6P6R7
R0P8R1
R2P7R3
R4
Rename Table
P0
P8P7
P1
x add P0 r3 P1P5
P3
x sub p P6 p P5 r6 P3
34
Physical Register Management
op p1PR1 p2PR2exuse Rd PRdLPRdROB
ld r1, 0(r3)add r3, r1, #4sub r6, r7, r6add r3, r3, r6ld r6, 0(r1)
Free ListP0P1P3P2P4
<R6>P5<R7>P6<R3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 r1 P0
R5P5R6P6R7
R0P8R1
R2P7R3
R4
Rename Table
P0
P8P7
P1
x add P0 r3 P1P5
P3
x sub p P6 p P5 r6 P3P1
P2
x add P1 P3 r3 P2
9/13/2007 35
Physical Register Management
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld r1, 0(r3)add r3, r1, #4sub r6, r7, r6add r3, r3, r6ld r6, 0(r1)
Free ListP0P1P3P2P4
<R6>P5<R7>P6<R3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 r1 P0
R5P5R6P6R7
R0P8R1
R2P7R3
R4
Rename Table
P0
P8P7
P1
x add P0 r3 P1P5
P3
x sub p P6 p P5 r6 P3P1
P2
x add P1 P3 r3 P2x ld P0 r6 P4P3
P4
36
op p1PR1 p2PR2exuse Rd PRdLPRdROB
x ld p P7 r1 P0x add P0 r3 P1x sub p P6 p P5 r6 P3
x ld p P7 r1 P0
Physical Register Management
ld r1, 0(r3)add r3, r1, #4sub r6, r7, r6add r3, r3, r6ld r6, 0(r1)
Free ListP0P1P3P2P4
<R6>P5<R7>P6<R3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
R5P5R6P6R7
R0P8R1
R2P7R3
R4
Rename Table
P0
P8P7
P1
P5
P3
P1
P2
x add P1 P3 r3 P2x ld P0 r6 P4P3
P4
Execute & Commitp
p
p<R1>
P8
x
10
37
op p1PR1 p2PR2exuse Rd PRdLPRdROB
x sub p P6 p P5 r6 P3x add P0 r3 P1x add P0 r3 P1
Physical Register Management
ld r1, 0(r3)add r3, r1, #4sub r6, r7, r6add r3, r3, r6ld r6, 0(r1)
Free ListP0P1P3P2P4
<R6>P5<R7>P6<R3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
P8
x x ld p P7 r1 P0
R5P5R6P6R7
R0P8R1
R2P7R3
R4
Rename Table
P0
P8P7
P1
P5
P3
P1
P2
x add P1 P3 r3 P2x ld P0 r6 P4P3
P4
Execute & Commitp
p
p<R1>
P8
x
p
p<R3>
P7
38
Reorder Buffer HoldsActive Instruction Window
…ld r1, (r3)add r3, r1, r2sub r6, r7, r9add r3, r3, r6ld r6, (r1)add r6, r6, r3
st r6, (r1)ld r6, (r1)…
(Older instructions)
(Newer instructions)
Cycle t
…ld r1, (r3)add r3, r1, r2sub r6, r7, r9add r3, r3, r6ld r6, (r1)add r6, r6, r3st r6, (r1)ld r6, (r1)…
Commit
Fetch
Cycle t + 1
Execute
39
Superscalar Register Renaming• During decode, instructions allocated new physical destination register• Source operands renamed to physical register with newest value• Execution unit only sees physical register numbers
Rename Table
Op Src1 Src2Dest Op Src1 Src2Dest
Register Free List
Op PSrc1 PSrc2PDestOp PSrc1 PSrc2PDest
UpdateMapping
Does this work?
Inst 1 Inst 2
Read Addresses
Read DataWrite
Po
rts
40
Superscalar Register Renaming
Rename Table
Op Src1 Src2Dest Op Src1 Src2Dest
Register Free List
Op PSrc1 PSrc2PDestOp PSrc1 PSrc2PDest
UpdateMapping
Inst 1 Inst 2
Read Addresses
Read DataWrite
Po
rts =?=?
Must check for RAW hazards between instructions issuing in same cycle. Can be done in parallel with rename lookup.
MIPS R10K renames 4 serially-RAW-dependent insts/cycle
11
41
Scheduling with Issue Queue
FETCH
RENAME
REG
SCHEDULE
COMMIT
EXE/WB
ROB
Decode
FU1 FU2
Fetch Unit
Rename
L-bufS-buf
DM
phy. regfile
IM
Issue Queue
42
Alpha 21264 Pipeline
43
Examples: Intel P6
Fetch
Decode
Rename
ROB Rd
Use RS + ROB
• 40-entry ROB
• 20-entry RS station
• Register Alias Table
…
…
44
Example: Intel Pentium 4
128entries
AllocRenameRenameQueueSchdSchdSchdDispDispRegRegEx
Use issue queue + phy. regs
12
45
Generic Superscalar Processor Models
Fetc
h
Rena
me
Wak
eup
sele
ct
Regf
ile FUFUby
pass
D-c
ache
executecommit
Fetc
h
Rena
me
ROB
FUFUby
pass
D-c
ache
executecommit
Reg
Wakeupselect
Issue queue based
Reservation based
Source: Paracharla PhD thesis 199846
Dynamic Scheduling in P6 (Pentium Pro, II, III)
Q: How to pipeline 1 to 17 byte, 80x86 instructions?P6 doesn’t pipeline 80x86 instructionsP6 decode unit translates the Intel instructions into
72-bit micro-operations (~ MIPS)Sends micro-operations to reorder buffer &
reservation stationsMany instructions translate to 1 to 4 micro-operationsComplex 80x86 instructions are executed by a
conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations
14 clocks in total pipeline (~ 3 state machines)
47 48
Pentium III Die PhotoEBL/BBL - Bus logic, Front, BackMOB - Memory Order BufferPacked FPU - MMX Fl. Pt. (SSE)IEU - Integer Execution UnitFAU - Fl. Pt. Arithmetic UnitMIU - Memory Interface UnitDCU - Data Cache UnitPMH - Page Miss HandlerDTLB - Data TLBBAC - Branch Address CalculatorRAT - Register Alias TableSIMD - Packed Fl. Pt.RS - Reservation StationBTB - Branch Target BufferIFU - Instruction Fetch Unit (+I$)ID - Instruction DecodeROB - Reorder BufferMS - Micro-instruction Sequencer1st Pentium III, Katmai: 9.5 M transistors, 12.3 *
10.4 mm in 0.25-mi. with 5 layers of aluminum
13
49
AMD AlthonSimilar to P6 microarchitecture (Pentium III), but more resourcesTransistors: PIII 24M v. Althon 37MDie Size: 106 mm2 v. 117 mm2
Power: 30W v. 76WCache: 16K/16K/256K v. 64K/64K/256KWindow size: 40 vs. 72 uopsRename registers: 40 v. 36 int +36 Fl. Pt.BTB: 512 x 2 v. 4096 x 2Pipeline: 10-12 stages v. 9-11 stagesClock rate: 1.0 GHz v. 1.2 GHzMemory bandwidth: 1.06 GB/s v. 2.12 GB/s
50
Pentium 4Still translate from 80x86 to micro-opsP4 has better branch predictor, more FUsInstruction Cache holds micro-operations vs. 80x86 instructions no decode stages of 80x86 on cache hit called “trace cache” (TC)
Faster memory bus: 400 MHz v. 133 MHzCaches Pentium III: L1I 16KB, L1D 16KB, L2 256 KB Pentium 4: L1I 12K uops, L1D 8 KB, L2 256 KB Block size: PIII 32B v. P4 128B; 128 v. 256 bits/clock
Clock rates: Pentium III 1 GHz v. Pentium IV 1.5 GHz
51
Pentium 4 featuresMultimedia instructions 128 bits wide vs. 64 bits wide => 144 new instructions When used by programs? Faster Floating Point: execute 2 64-bit FP Per clock Memory FU: 1 128-bit load, 1 128-store /clock to
MMX regsSupporting RAMBUS DRAM Bandwidth faster, latency same as SDRAM Cost 2X-3X vs. SDRAM
ALUs operate at 2X clock rate for many opsPipeline doesn’t stall at this clock rate: uops replayRename registers: 40 vs. 128; Window: 40 v. 126BTB: 512 vs. 4096 entries (Intel: 1/3 improvement)
52
Basic Pentium 4 Pipeline
1-2 trace cache next instruction pointer
3-4 fetch uops from Trace Cache
5 drive upos to alloc6 alloc resources (ROB,
reg, …)7-8 rename arch. reg to
128 physical reg9 put renamed uops into
queue
10-12 write uops into scheduler
13-14 move up to 6 uops to FU
15-16 read registers17 FU execution18 computer flags e.g. for
branch instructions19 check branch output
with branch prediction20 drive branch check
result to frontend
TC Nxt IP DriveTC Fetch Alloc Rename Queue Schd
Schd Schd Disp Disp Reg Reg Ex Flags Br Chk Drive
14
53
Block Diagram of Pentium 4 Microarchitecture
BTB = Branch Target Buffer (branch predictor)I-TLB = Instruction TLB, Trace Cache = Instruction cacheRF = Register File; AGU = Address Generation Unit"Double pumped ALU" means ALU clock rate 2X => 2X ALU F.U.s
From “Pentium 4 (Partially) Previewed,” Microprocessor Report, 8/28/00 54
Pentium 4 Die Photo42M Xtors PIII: 26M
217 mm2
PIII: 106 mm2
L1 Execution Cache Buffer
12,000 Micro-Ops
8KB data cache256KB L2$
ARM Cortex-A8 Pipeline
55
ARM Cortex-A8
Dual-issue, statically scheduled superscalarDynamic branch predictor (512 entry, 2-way set-associative BTB, 4K global history buffer – indexed by BHR+PC13 stage pipeline3 cycles – IF, 4 cycles – ID, 5 cycles int pipeline
56
15
Copyright © 2011, Elsevier Inc. All rights Reserved.
five-stage instruction decode of the A8. In the first stage, a PC produced by the fetch unit (either from the branch target buffer or the PC incrementer) is used to retrieve an 8-byte block from the cache. Up to two instructions are decoded and placed into the decode queue; if neither instruction is a branch, the PC is incremented for the next fetch. Once in the decode queue, the scoreboard logic decides when the instructions can issue. In the issue, the register operands are read; recall that in a simple scoreboard, the operands always come from the registers. The register operands and opcode are sent to the instruction execution portion of the pipeline.
Copyright © 2011, Elsevier Inc. All rights Reserved.
The five-stage instruction decode of the A8. Multiply operations are always performed in ALU pipeline 0.
ARM Cortex-A8: Neon pipeline
59Copyright © 2011, Elsevier Inc.
All rights Reserved.
The estimated composition of the CPI on the ARM A8 shows that pipeline stalls are the primary addition to the base CPI. eon deserves some special mention, as it does integer-based graphics calculations (ray tracing) and has very few cache misses. It is computationally intensive with heavy use of multiples, and the single multiply pipeline becomes a major bottleneck. This estimate is obtained by using the L1 and L2 miss rates and penalties to compute the L1 and L2 generated stalls per instruction. These are subtracted from the CPI measured by a detailed simulator to obtain the pipeline stalls. Pipeline stalls include all three hazards plus minor effects such as way misprediction.
16
Copyright © 2011, Elsevier Inc. All rights Reserved.
The total pipeline depth is 14 stages, with branch mispredictions costing 17 cycles. There are 48 load and 32 store buffers. The six independent functional units can each begin execution of a ready micro-op in the same cycle.
Intel CoreI7 Pipeline
Intel Core i7
Out-of-order speculative µ-architecture.IF – multilevel branch prediction. RA stack. Misprediction penalty 15 cycles. Fetches 16 bytes from I-cache16B in predecode inst buffer. Macro-op fusion fuses inst such as compare followed by a branch into a single op.
62
Intel Core i7
Predecode breaks 16B chunk into x86 inst.µ-op decode – 3 simple decoders, one complex decoder reducing x86 instinto upto 4 µ-ops.Loop-stream detection (micro-fusion) for loops of less than 28 inst or 256B.Inst issue.
63
Intel Core i7
i7 uses 36-entry centralized RS shared by 6 FUs. Upto 6 µ-ops can be selected per cycle.Results sent to RS and reg retirement unit.
64
17
Copyright © 2011, Elsevier Inc. All rights Reserved.
The amount of “wasted work” is plotted by taking the ratio of dispatched micro-ops that do not graduate to all dispatched micro-ops. For example, the ratio is 25% for sjeng, meaning that 25% of the dispatched and executed micro-ops are thrown away.
Copyright © 2011, Elsevier Inc. All rights Reserved.
The CPI for the 19 SPECCPU2006 benchmarks shows an average CPI for 0.83 for both the FP and integer benchmarks, although the behavior is quite different. In the integer case, the CPI values range from 0.44 to 2.66 with a standard deviation of 0.77, while the variation in the FP case is from 0.62 to 1.38 with a standard deviation of 0.25.
67
Workstation Microprocessors 3/2001
Source: Microprocessor Report, www.MPRonline.com
Max issue: 4 instructions (many CPUs)Max rename registers: 128 (Pentium 4) Max BHT: 4K x 9 (Alpha 21264B), 16Kx2 (Ultra III)Max Window Size (OOO): 126 intructions (Pent. 4)Max Pipeline: 22/24 stages (Pentium 4)
68
SPEC 2000 Performance 3/2001 Source: Microprocessor Report, www.MPRonline.com
1.6X
3.8X
1.2X
1.7X
1.5X
18
69
Benchmarks: Pentium 4 v. PIII v. AlthonSPECbase2000 Int, [email protected] GHz: 524, PIII@1GHz: 454, AMD [email protected]:? FP, [email protected] GHz: 549, PIII@1GHz: 329, AMD
[email protected]:304WorldBench 2000 benchmark (business) PC World magazine, Nov. 20, 2000 (bigger is better) P4 : 164, PIII : 167, AMD Althon: 180
Quake 3 Arena: P4 172, Althon 151SYSmark 2000 composite: P4 209, Althon 221Office productivity: P4 197, Althon 209S.F. Chronicle 11/20/00: "… the challenge for AMD now will be to argue that frequency is not the most important thing-- precisely the position Intel has argued while its Pentium III lagged behind the Athlon in clock speed."
70
Summary of Dynamic SchedulingPipeline stages Renaming (in-order) Schedule Commit (in-order)
Two organizations Mapping table + phy reg +
issue queue + ROB;REN => SCHD => REG
Reg alias table + RS + ROB, reg in RS and ROB;REN => REG => SCHD
Scheduling methods Tag broadcasting vs.
scoreboarding (later)
CDC6600: introduces scoreboarding
Tomasulo: introduces renaming and tag broadcasting
Reorder buffer: provides in-order commit
Real OOO processors very complicated (like a
vehicle) bring impl variants but all root in those basic
designs