evolution of microprocessors · this is all we need § can build any logic function (shannon) not...

Evolution of Microprocessors

Olivier Temam INRIA Saclay

What Is The Goal of A Microprocessor ?

§  Executing any algorithm: •  Computations •  Evaluating logic

expressions •  Storing data

A=0 Label:

A=A+1 if (A<5) goto Label

This is All We Need

§  Can build any logic function (Shannon)

NOT

AND

OR

A Z = A’

A B

Z = A . B

A B

Z = A + B

A B Z 0 0 0 0 1 1 1 0 1 1 1 1

A B Z 0 0 0 0 1 0 1 0 0 1 1 1

A Z 0 1 1 0

0: false 1: true

Logic & Computational Circuits

A B C

Z

Z = C’.A.B’ + C’.A.B + C.A’.B + C.A.B

MU

X A

B

Z

c

R

S

r x y

R = x.y + r.(x+y)

S = r’(x’.y + x.y’) + r.(x.y + x’.y’)

+

y x

r

S

R

The Heart of the Processor: ALU

§  ALU: Arithmetic and Logic Unit §  Embryo of instruction set:

•  ADD, AND, NOT

MU

X

ALU

c1 c0

y x

r

+

R

c1 c0 ALU 0 0 ADD 0 1 AND

1 0 NOT

1 1 d

n

n X

Y

n Z

c1 c0

Connecting Several Operators

§  Barriers (Registers): separation and memorization

+ + + +

0 0 0 0

0 0 1 1 1 1 0 0

1 0 0 1

1 0 0 1 clock

T=cycle

Clock signal

0/1

1 0 0 1

+ +

0 0 1 1

0 1 1 1 0 0 1 1

+ 0 0 1 1 + 0 0 1 1 + 0 0 0 0

0 0 1 0 1 0 1

Complex Circuits

§  Multiplier as a function

+ + + +

+ + + +

+ + + +

+ + + +

0 0 1 1

0 1 1 1 0 0 1 1

+ 0 0 1 1 0 0 1 0 0 1

+ 0 0 1 1 0 0

0 1 0 1 0 1

+ 0 0 0 0 0 0 0

0 0 1 0 1 0 1

Complex Circuits

§  Multiplier as a sequence of operations

Intermediate result

Operand 2

Result

+

Operand 1

Operand 2

Result

+ Reset Wresultat

WOperand Shift Operand 1

Controlling Complex Circuits

Op1→Operand 1 Op2→Operand 2

0→Result

End Yes

WOperand

Reset

Op1+Result→Result

1

WResult

0

End ?

NOR (Op2) End ?

Shift Operands

No

Shift

Operand2i Operand2i

0

Converting an Algorithm into a Circuit

Step 1 Signal 1

Step 2 Signal 2

Flip-Flop D

Signal 1

Signal 2 →

→

x → x 1

Multiplier Control Circuit


0→Result

End Yes

WOperand

Reset

Op1+Result→Result

1

WResult

0

Shift Operands

No

Shift

End ?

Operand2i Operand2i

Flip-Flop D

Flip-Flop D

End ?

WOperand, Reset

End

WResult

Shift

Flip-Flop D

Generalization ?

Flip-Flop D

Flip-Flop D

Flip-Flop D

Operand 2

Result

+

Operand 1

Operand2i

End ?

WOperand, Reset

End

WResult

Shift

Reset Wresultat

WOperand Shift

NOR (Op2) End ?

Operand2i

Init

If (Op2i==1) goto Fin?

Result += Op1

If (End) goto …

Shift Op1, Op2

ALU


0→Result

Yes

Op1+Result→Result

1

0

Shift Operands

No

End ?

Operand2i

End

WOperand

Reset

WResult

Shift

Generalization ?

Init

If (Op2i==1) goto Fin?

Result += Op1

If (End) goto …

Shift Op1, Op2

Operand 2

Result

Operand 1

ALU

Mem

ory

Inst @

+ 1

IAS - 1946

Control

Mem

ory

+

IAS - 1946

AC MQ

DR Mem

ory

AR

Control

PC IR

+

Opcode Operand

Instruction

IAS - 1946

30 MQ

DR Mem

ory

AR

Control

PC IR

8

40 40

12

12

40 bits

+ 1

Data Reg.

Address Reg.

Instruction Reg. AC←AC+M(3) 35

AC←AC+M(3)

Opcode

35

35

28 3

3

28

28

58

AC←AC+M(3)

AR ← PC

DR ← M(AR)

IR ← DR(0:7) AR ← DR(8:19)

PC ← PC + 1

AC ← AC+M(X)

AC ← AC + DR

DR ← M(AR)

36

17

Processors (Intel 4004, 1971)

Intel 4004 - 1971

740kHz, 2250 transistors

Intel Ivy Bridge - 2013

3.5GHz, 1.4 Billion transistors

IAS – Instruction Set

§  Very few instructions

§  AC ← M(X) §  M(X) ← AC §  goto M(X) §  if AC ≥ 0 goto M(X) §  AC ← AC +/- M(X) §  AC.MQ ← MQ ×/÷ M(X) §  AC ← AC <</>> 1 §  M(X,8:19) ← AC(0:11)

IBM S/360 – 1964 §  Huge number of instructions

•  type of data •  methods for accessing memory •  variable-size instructions •  diversity of operations •  compound instructions •  …

Opcode

0 8 15 R1 R2

Opcode R1 X2

0 8 12 B2

16 S2

20 31

Opcode R1 R3

0 8 12 B2

16 S2

20 31

Opcode I2

0 8 B1

16 S1

20 31

Opcode L

0 8 B1

16 S1

20 31 B1 S1

36 47 32

B=Base S=Shift I=Immediate L=Byte size R=Register X=Index register Register

Bank

Control

IR

Register Bank

AR PC DR

MC

FPU ALU BCD

Memory

PSW

AC MQ

DR Mem

ory

AR Control

PC IR

+

§  CISC ➡ RISC IAS

IBM S/360

It’s All About the Technology

§  More transistors §  Faster switching (faster clock)

Keyn

ote

Kath

y Ye

lick,

ISCA 2

008

Gate Length

Canal Width

Insulator Thickness

CISC vs. RISC

§  CISC: Complex Instruction Set Computer §  RISC: Reduced Instruction Set Computer

Keyn

ote

Kath

y Ye

lick,

ISCA 2

008

RISC perf.

CISC perf.

Long Clock Cycle ≃ One Instruction

§  Fetch instruction §  Decode instruction §  Compute §  Write result

But Speed = Short Cycles


Fast Clock = Pipelining


Instruction 1 Instruction 2 Instruction 3 Instruction 4

Inst 1 Inst 1 Inst 1 Inst 1 Inst 2 Inst 2 Inst 2 Inst 3 Inst 3 Inst 4

Architects More (Too ?) Ambitious

§  Superscalar processors: Several instructions in parallel

Inst 1 Inst 2 Inst 3 Inst 4



Processor Too Fast vs. Memory

§  Increasing Processor/Memory gap

1

10

100

1000

Memory (DRAM)

Processor

Gap ≈ 50% / year

Per

form

ance

Moore’s law

Years

Bridge Gap: Cache Memory

§  Instructions and data reused many times

Processor

Memory

Cache Smaller Faster

Performance (In Theory...)

§  But, cycle: 1.3x10-6 sec. ➡ 2.8x10-10 sec.

54.8! 62.6!

17.9!

75.2!

118.7!

150.1!

0!

50!

100!

150!

200!

!gcc!

�espresso! li! fpppp! doducd! tomcatv!

Inst

ruct

ions

per

cycl

e!Performance (…In Practice)

2.1 ! 1.9 ! 2.3 ! 2.6 ! 2.2 ! 2.6 !0.0 !

50.0 !

gcc! expresso! li! fpppp! doducd! tomcatv!

Inst

ruct

ions

per

cycl

e!

31

1st Technology Shock: Clock Scaling

§  Nb transistors keeps increasing Key

note

Kat

hy Y

elic

k, I

SCA 2

008

show low cost of silicon

Multi-Cores GPUs

20%!!

40%!!

2nd Shock: Dark Silicon

§  Cannot use all transistors §  Multi-Cores ?

Emre

Oze

r, AR

M, C

onsu

ltatio

n m

eetin

g,

Brus

sels

, Nov

. 200

9

100%!

25%!10%!

#Transistors scaling: x4 x16 Power scaling:

@45nm freq @peak freq

2/3 1/3 1 2/3

Technology node 45nm 22nm 11nm

Exploitable silicon

Specialization: Towards Heterogeneous Multi-Cores

§  Huge energy ratio proc./specialized circ. §  Compatible with Dark Silicon

Und

erst

andi

ng S

ourc

es o

f Ine

ffici

ency

in

Gen

eral

-Pur

pose

Chi

ps, R

ehan

Ham

eed

et a

l., IS

CA

2010

Accelerators

34

Which Accelerators ?

§  Flexibility/Efficiency tradeoff ?

FPGAs CGRAs

show different accelerators: tuned cores, GPUs, FPGAs, multi-purpose accelerators

GPUs More efficient

More specialized

Heterogeneous multi-cores

Thank You !

evolution of microprocessors · this is all we need § can build any logic function (shannon) not...

Documents