evolution of microprocessors · this is all we need § can build any logic function (shannon) not...
TRANSCRIPT
Evolution of Microprocessors
Olivier Temam INRIA Saclay
What Is The Goal of A Microprocessor ?
§ Executing any algorithm: • Computations • Evaluating logic
expressions • Storing data
A=0 Label:
A=A+1 if (A<5) goto Label
This is All We Need
§ Can build any logic function (Shannon)
NOT
AND
OR
A Z = A’
A B
Z = A . B
A B
Z = A + B
A B Z 0 0 0 0 1 1 1 0 1 1 1 1
A B Z 0 0 0 0 1 0 1 0 0 1 1 1
A Z 0 1 1 0
0: false 1: true
Logic & Computational Circuits
A B C
Z
Z = C’.A.B’ + C’.A.B + C.A’.B + C.A.B
MU
X A
B
Z
c
R
S
r x y
R = x.y + r.(x+y)
S = r’(x’.y + x.y’) + r.(x.y + x’.y’)
+
y x
r
S
R
The Heart of the Processor: ALU
§ ALU: Arithmetic and Logic Unit § Embryo of instruction set:
• ADD, AND, NOT
MU
X
ALU
c1 c0
y x
r
+
R
c1 c0 ALU 0 0 ADD 0 1 AND
1 0 NOT
1 1 d
n
n X
Y
n Z
c1 c0
Connecting Several Operators
§ Barriers (Registers): separation and memorization
+ + + +
0 0 0 0
0 0 1 1 1 1 0 0
1 0 0 1
1 0 0 1 clock
T=cycle
Clock signal
0/1
1 0 0 1
+ +
0 0 1 1
0 1 1 1 0 0 1 1
+ 0 0 1 1 + 0 0 1 1 + 0 0 0 0
0 0 1 0 1 0 1
Complex Circuits
§ Multiplier as a function
+ + + +
+ + + +
+ + + +
+ + + +
0 0 1 1
0 1 1 1 0 0 1 1
+ 0 0 1 1 0 0 1 0 0 1
+ 0 0 1 1 0 0
0 1 0 1 0 1
+ 0 0 0 0 0 0 0
0 0 1 0 1 0 1
Complex Circuits
§ Multiplier as a sequence of operations
Intermediate result
Operand 2
Result
+
Operand 1
Operand 2
Result
+ Reset Wresultat
WOperand Shift Operand 1
Controlling Complex Circuits
Op1→Operand 1 Op2→Operand 2
0→Result
End Yes
WOperand
Reset
Op1+Result→Result
1
WResult
0
End ?
NOR (Op2) End ?
Shift Operands
No
Shift
Operand2i Operand2i
0
Converting an Algorithm into a Circuit
Step 1 Signal 1
Step 2 Signal 2
Flip-Flop D
Signal 1
Signal 2 →
→
x → x 1
Multiplier Control Circuit
Op1→Operand 1 Op2→Operand 2
0→Result
End Yes
WOperand
Reset
Op1+Result→Result
1
WResult
0
Shift Operands
No
Shift
End ?
Operand2i Operand2i
Flip-Flop D
Flip-Flop D
End ?
WOperand, Reset
End
WResult
Shift
Flip-Flop D
Generalization ?
Flip-Flop D
Flip-Flop D
Flip-Flop D
Operand 2
Result
+
Operand 1
Operand2i
End ?
WOperand, Reset
End
WResult
Shift
Reset Wresultat
WOperand Shift
NOR (Op2) End ?
Operand2i
Init
If (Op2i==1) goto Fin?
Result += Op1
If (End) goto …
Shift Op1, Op2
ALU
Op1→Operand 1 Op2→Operand 2
0→Result
Yes
Op1+Result→Result
1
0
Shift Operands
No
End ?
Operand2i
End
WOperand
Reset
WResult
Shift
Generalization ?
Init
If (Op2i==1) goto Fin?
Result += Op1
If (End) goto …
Shift Op1, Op2
Operand 2
Result
Operand 1
ALU
Mem
ory
Inst @
+ 1
IAS - 1946
Control
Mem
ory
+
IAS - 1946
AC MQ
DR Mem
ory
AR
Control
PC IR
+
Opcode Operand
Instruction
IAS - 1946
30 MQ
DR Mem
ory
AR
Control
PC IR
8
40 40
12
12
40 bits
+ 1
Data Reg.
Address Reg.
Instruction Reg. AC←AC+M(3) 35
AC←AC+M(3)
Opcode
35
35
28 3
3
28
28
58
AC←AC+M(3)
AR ← PC
DR ← M(AR)
IR ← DR(0:7) AR ← DR(8:19)
PC ← PC + 1
AC ← AC+M(X)
AC ← AC + DR
DR ← M(AR)
36
17
Processors (Intel 4004, 1971)
Intel 4004 - 1971
740kHz, 2250 transistors
Intel Ivy Bridge - 2013
3.5GHz, 1.4 Billion transistors
IAS – Instruction Set
§ Very few instructions
§ AC ← M(X) § M(X) ← AC § goto M(X) § if AC ≥ 0 goto M(X) § AC ← AC +/- M(X) § AC.MQ ← MQ ×/÷ M(X) § AC ← AC <</>> 1 § M(X,8:19) ← AC(0:11)
IBM S/360 – 1964 § Huge number of instructions
• type of data • methods for accessing memory • variable-size instructions • diversity of operations • compound instructions • …
Opcode
0 8 15 R1 R2
Opcode R1 X2
0 8 12 B2
16 S2
20 31
Opcode R1 R3
0 8 12 B2
16 S2
20 31
Opcode I2
0 8 B1
16 S1
20 31
Opcode L
0 8 B1
16 S1
20 31 B1 S1
36 47 32
B=Base S=Shift I=Immediate L=Byte size R=Register X=Index register Register
Bank
Control
IR
Register Bank
AR PC DR
MC
FPU ALU BCD
Memory
PSW
AC MQ
DR Mem
ory
AR Control
PC IR
+
§ CISC ➡ RISC IAS
IBM S/360
It’s All About the Technology
§ More transistors § Faster switching (faster clock)
Keyn
ote
Kath
y Ye
lick,
ISCA 2
008
Gate Length
Canal Width
Insulator Thickness
CISC vs. RISC
§ CISC: Complex Instruction Set Computer § RISC: Reduced Instruction Set Computer
Keyn
ote
Kath
y Ye
lick,
ISCA 2
008
RISC perf.
CISC perf.
Long Clock Cycle ≃ One Instruction
§ Fetch instruction § Decode instruction § Compute § Write result
But Speed = Short Cycles
§ Fetch instruction § Decode instruction § Compute § Write result
Fast Clock = Pipelining
§ Fetch instruction § Decode instruction § Compute § Write result
Instruction 1 Instruction 2 Instruction 3 Instruction 4
Inst 1 Inst 1 Inst 1 Inst 1 Inst 2 Inst 2 Inst 2 Inst 3 Inst 3 Inst 4
Architects More (Too ?) Ambitious
§ Superscalar processors: Several instructions in parallel
Inst 1 Inst 2 Inst 3 Inst 4
Inst 1 Inst 2 Inst 3 Inst 4
Inst 5 Inst 6 Inst 7 Inst 8
Processor Too Fast vs. Memory
§ Increasing Processor/Memory gap
1
10
100
1000
Memory (DRAM)
Processor
Gap ≈ 50% / year
Per
form
ance
Moore’s law
Years
Bridge Gap: Cache Memory
§ Instructions and data reused many times
Processor
Memory
Cache Smaller Faster
Performance (In Theory...)
§ But, cycle: 1.3x10-6 sec. ➡ 2.8x10-10 sec.
54.8! 62.6!
17.9!
75.2!
118.7!
150.1!
0!
50!
100!
150!
200!
!gcc!
�espresso! li! fpppp! doducd! tomcatv!
Inst
ruct
ions
per
cycl
e!Performance (…In Practice)
2.1 ! 1.9 ! 2.3 ! 2.6 ! 2.2 ! 2.6 !0.0 !
50.0 !
gcc! expresso! li! fpppp! doducd! tomcatv!
Inst
ruct
ions
per
cycl
e!
31
1st Technology Shock: Clock Scaling
§ Nb transistors keeps increasing Key
note
Kat
hy Y
elic
k, I
SCA 2
008
show low cost of silicon
Multi-Cores GPUs
20%!!
40%!!
2nd Shock: Dark Silicon
§ Cannot use all transistors § Multi-Cores ?
Emre
Oze
r, AR
M, C
onsu
ltatio
n m
eetin
g,
Brus
sels
, Nov
. 200
9
100%!
25%!10%!
#Transistors scaling: x4 x16 Power scaling:
@45nm freq @peak freq
2/3 1/3 1 2/3
Technology node 45nm 22nm 11nm
Exploitable silicon
Specialization: Towards Heterogeneous Multi-Cores
§ Huge energy ratio proc./specialized circ. § Compatible with Dark Silicon
Und
erst
andi
ng S
ourc
es o
f Ine
ffici
ency
in
Gen
eral
-Pur
pose
Chi
ps, R
ehan
Ham
eed
et a
l., IS
CA
2010
Accelerators
34
Which Accelerators ?
§ Flexibility/Efficiency tradeoff ?
FPGAs CGRAs
show different accelerators: tuned cores, GPUs, FPGAs, multi-purpose accelerators
GPUs More efficient
More specialized
Heterogeneous multi-cores
Thank You !