dezső sima fall 2007 (ver. 2.2) dezső sima, 2007 multicore processors (1)

Dezső Sima

Fall 2007

(Ver. 2.2) Dezső Sima, 2007

Multicore Processors (1)

Foreword

In a relative short time interval of about two years multicore processors became the standard. As multiple cores provide a substantially higher processing performance than their predecessors, their memory and I/O subsystems need also be largely enhanced to avoid processing bottlenecks. Consequently, while discussing multicore processors the focus of interest shifts from their microarchitecture to their macroarchitecture incorporating their cache hierarchy, the way how memory and I/O controllers are attached and how on-chip interconnects needed are implemented.Accordingly, the first nine Sections are devoted to the design space of the macroachitecture of multicore processors. Particular Sections deal with the main dimensions that span the design space and identify how occuring concepts became implemented in major multicore lines.Section 10 is a huge repository, providing relevant materials and facts published in connection with major multicore lines. In a concrete course appropriate parts of the repository can be selected according to the scope and attendance of the course. Each part of the repository concludes with a list of available literature to allow a more detailed study of a multicore line or processor of interest.

2 Overview of MCPs•

5 Layout of L2 caches•

9 Basic alternatives of the macro-architecture of MCPs•

7 Layout of the memory and I/O architecture•

Overview

6 Layout of L3 caches•

1 Introduction•

3 The design space of the macroarchitecture of MCPs•

8 Implementation of the on-chip interconnections•

4 Layout of the cores•

10 Multicore repository•

1. Introduction

Figure 1.1: Single-stream performance vs. cost

Source: Marr T.T. et al. „Hyper-Threading Technology Architecture and MicroarchitectureIntel Technology Journal, Vol. 06, Issue 01, Febr 14, 2002, pp. 4-16

1. Introduction (1)

1. Introduction (2)

Figure 1.2: Historical evolution of transistor counts, clock speed, power and ILP

For single core processors:

Increasing number of transistors diminishing return in performance

Moore’s law:

Transistor count: doubles approximately every 24 month

Since ~ 2003/2004:

Use available surplus transistors for multiple cores

1. Introduction (3)

The era of single core processors the era of multi core processors

1. Introduction (4)

The speed of change:

illustrated by Intel’s roadmap revealed in 2005:

1. Introduction (5)

1. Introduction (6)

Figure 1.3: Intel’s roadmap revealed in 2005

1. Introduction (7)

Expected increase of the core count:

As far as Moore’s law remains valid,

transistor count is doubling approximately every 24 month,

accordingly

core count is doubling approximately every 24 month as well.

2. Overview of MCPs

Figure 2.1: Intel’s dual-core mobile processor lines

8CST

QCMT

QCST

DCMT

DCST

20051Q 2Q

20061Q 2Q 3Q 4Q3Q 4Q

Core Duo

1/06

(Yonah Duo)

151 mtrs./31 W65 nm/90 mm2

Core 2 Duo

7/06

281 mtrs./34 W

(Merom)T5000/T7000

65 nm/143 mm2 T2000

2. Overview of MCPs (1)

Figure 2.2: Intel’s multi-core processor lines

8CST

QCMT

QCST

DCMT

DCST

2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q

230 mtrs./95 W

Pentium D 800(Smithfield)

5/05

90 nm/2*103 mm2

Core 2 Duo E6000/X6800(Conroe)

291 mtrs./65/75 W

7/06

65 nm/143 mm2

Pentium D 900(Presler)

2*188 mtrs./130 W65 nm/2*81 mm2

1/06

Core 2 Extrem QX6700(Kentsfield)

2*291 mtrs./130 W

11/06

65 nm/2*143 mm2

Pentium EE 955/965(Presler)

2*188 mtrs./130 W

1/06

65 nm/2*81 mm2

2-way MT/core

Pentium EE 840(Smithfield)

230 mtrs./130 W

5/05

90 nm/2*103 mm2

2-way MT/core


Figure 2.3: Intel’s dual-core Xeon UP-line

8CST

QCMT

QCST

DCMT

DCST

20061Q 2Q

20071Q 2Q 3Q 4Q3Q 4Q

Xeon 3000

9/06

291 mtrs./65 W 65 nm/143 mm2

(Conroe)


Figure 2.4.: Intel’s dual-core Xeon DP-lines

8CST

QCMT

QCST

DCMT

DCST

20051Q 2Q

20061Q 2Q 3Q 4Q3Q 4Q

Xeon DP 2.8

10/05

2*169 mtrs./135 W 90 nm/2*135 mm2

(Paxville DP)

Xeon 5300

11/06

2*291 mtrs./80/120 W 65 nm/2*143 mm2

(Clovertown)

Xeon 5100

6/06

291 mtrs./65/80 W 65 nm/143 mm2

(Woodcrest)

2-way MT/core2*188 mtrs./95/130 W

Xeon 5000

6/06

65 nm/2*81 mm2

(Dempsey)

2-way MT/core


Figure 2.5.: Intel’s dual-core Xeon MP-lines

8CST

QCMT

QCST

DCMT

DCST

20051Q 2Q

20061Q 2Q 3Q 4Q3Q 4Q

Xeon 7000

11/05

2*169 mtrs./95/150 W 90 nm/2*135 mm2

(Paxville MP)

2-way MT/core1328 mtrs./95/150 W

Xeon 7100

8/06

65 nm/435 mm2

(Tulsa)

2-way MT/core


Figure 2.6.: Intel’s dual-core EPIC-based server line

8CST

QCMT

QCST

DCMT

DCST

20051Q 2Q

20061Q 2Q 3Q 4Q3Q 4Q

2-way MT/core

9x00(Montecito)

1720 mtrs./104 W

7/06

90 nm/596 mm2


Figure 2.7.: AMD’s multi-core desktop processor lines

8CST

Athlon64 QC(Barcelona)

9/2007

QCMT

QCST

DCMT

DCST

2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q

3800/4200/4600(Manchester S. E2)

154 mtrs./89/110 W

5/05

90 nm/2*147 mm2

3800-4800(Toledo S. E6)

233 mtrs./89/110 W

5/05

90 nm/199 mm2

4000-5000(Brisbane S. G1)

154 mtrs./65 W

12/06

65 nm/126 mm2

3600-5600(Windsor S. F2/F3)

154/227 mtrs./65/89 W

5/06

90 nm/183/230 mm2

20071Q 2Q 3Q 4Q

65 nm/291 mm2


Figure 2.8.: AMD’s dual-core high-end desktop/entry level server Athlon 64 FX line

8CST

QCMT

QCST

DCMT

DCST

2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q

Athlon 64 FX(Toledo/Windsor)

233/227 mtrs./110/125 W

1/06

90 nm/199/230 mm2


Figure 2.9.: AMD’s dual-core Opteron UP-lines

8CST

QCMT

QCST

DCMT

DCST

2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q

233 mtrs./110 W

Opteron 100

8/05

90 nm/199 mm2

(Denmark)Opteron 1000

8/06

227 mtrs./103/125 W 90 nm/230 mm2

(Santa Ana)


Figure 2.10.: AMD’s dual-core Opteron DP-lines

8CST

QCMT

QCST

DCMT

DCST

2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q

227 mtrs./95/110 W

Opteron 2000

8/06

90 nm/230 mm2

(Santa Rosa)Opteron 200

3/05

233 mtrs./95 W 90 nm/199 mm2

(Italy)


Figure 2.11.: AMD’s dual-core Opteron MP-lines

8CST

QCMT

QCST

DCMT

DCST

2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q

243 mtrs./95/119 W

Opteron 8000

8/06

90 nm/220 mm2

(Santa Rosa)Opteron 800

4/05

233 mtrs./95 W 90 nm/199 mm2

(Egypt)


Figure 2.12.: IBM’s multi-core server lines

POWER4 POWER4+

POWER5 POWER5+

Cell BE

POWER6

8CST

174 mtrs./115/125 W

10/01 11/02

QCMT

QCST

DCMT

DCST

2001 20023Q 4Q 3Q 4Q

~~ ~~

5/04

20041Q 2Q

~~

10/05

276 mtrs./70 W

3Q 4Q

2006

234 mtrs./95 W

20061Q 2Q 3Q 4Q

5/2007

750 mtrs./~100W

20071Q 2Q

180 nm/412 mm2

90 nm/230 mm2

90 nm/221 mm2

65 nm/341 mm2

276 mtrs./80W (est.)130 nm/389 mm2

184 mtrs./70 W130 nm/380 mm2

2-way MT/core 2-way MT/core

(PPE:2-way MT)

2-way MT/core

(SSEs: no MT)


Figure 2.13.: Sun’s and Fujitsu’s multi-core server lines

8CST

QCMT

QCST

DCMT

DCST

20041Q 2Q

20051Q 2Q 3Q 4Q

UltraSPARC IV+

9/05

295 mtrs./90 W

(Panther)

20071Q 2Q 3Q 4Q

20081Q 2Q

66 mtrs./108 W

UltraSPARC IV

7/04

(Jaguar)

~~3Q

8CMTUltraSPARC T1

11/2005

279 mtrs./63 W

(Niagara)UltraSPARC T2

2007

(Niagara II)

APL SPARC64 VI

2007

540 mtrs./120 W

(Olympus)

APL SPARC64 VII

2008

(Jupiter)

130 nm/352 mm2 90 nm/335 mm2

90 nm/421 mm2

65 nm/464 mm2

65 nm/342 mm290 nm/379 mm2

80 mtrs./32 W

Gemini

4/04

130 nm/206 mm2

3Q

4-way MT/core 8-way MT/core

2-way MT/core

4Q

~120 W2-way MT/core

72 W (est.)


Figure 2.14.: HP’s PA-8x00 server line

8CST

QCMT

QCST

DCMT

DCST

20041Q 2Q

20051Q 2Q 3Q 4Q

PA 8800

2/2004

300 mtrs./55 W

(Mako)PA 8900

5/05

317 mtrs.

(Shortfin)

3Q 4Q

130 nm/366 mm2 130 nm/366 mm2


Figure 2.15.: RMI’s multi-core XLR-line (scalar RISC)

8CST

QCMT

QCST

DCMT

DCST

20051Q 2Q 3Q 4Q

20061Q 2Q 3Q 4Q

8CMT

XLR 5xx

5/05

333 mtrs./10-50 W90 nm/~220 mm2

4-way MT/core


3. Design space of the macro architecture of MCPs

Building blocks of multicore processors (MC)

Cores• L2 cache(s) (L2)•

FSB controller (FSB c.)

•

•

L3 cache(s), if available (L3)•

Memory controller (M. c.)•

Interconnection network (IN)•

L2 I L2 D

L3

L2 I L2 D

L3

IN

FSB

FSB c.

C C

or

3. Design space of the macro architecture of MCPs (1)

Bus controller (B. c.)

Layout of L2 caches

Layout of the cores

Layout of the I/O andmemory architecture

Macro architecture of multi-core (MC) processors

Layout of L3 caches (if available)

3. Design space of the macro architecture of MCPs (2)

Layout of the interconnections

4. Layout of the cores

Layout of L2 caches

Layout of the cores



5. Layout of the cores (1)

Layout of the on-chip interconnections


Physical implementationBasic layout Advanced features

Layout of the cores



HMC(Heterogeneous MC)

Basic layout

SMC(Symmetrical MC)

Cell BE (1 PPE+8 SPE)All other MCs


Virus protection

Advanced features of the cores

Remote management

Power/clock speed management

Data protection

Support ofvirtualisation

Details on the next slide!


12

Pentium M-based line: Core Duo T2000Prescott based desktop lines:Pentium D 8xx, EE 840

3

Xeon lines: Paxwille DP, MP

Cedar Mill based desktop lines: Pentium D 9xx, EE 955,965

4 Core basd mobile line: T5xxx/T7xxx (Merom)desktop line: Core 2 Duo E6xx, X6xxx, QX67xx

Xeon lines: 5000 (Dempsey), 7100 (Tulsa)

Xeon lines: 3000, 5100 (Woodcrest), 5300 (Clovertown)

Intel

Netburst

Prescott-based2

Cedar Mill-based3

Core-based4

Itanium 2

AMD

Athlon 64 XR

Athlon 64 FX

Opteron

x00 lines

x000 lines

Pentium M-based1

ED-bitP Vanderpool (partly)

La Grande AMT2

Nx-bit Cool 'n' Quiet Pacifica (partly)

Presidio

ED-bit EIST 5/02

ED-bit

ED-bit

EIST (partly)

EIST (partly)

EIST (partly) Vanderpool (partly)

DBS Silvervale

Nx-bit

Nx-bit

Nx-bit

Cool 'n' Quiet

PowerNow!

PowerNow!

Pacifica

AMD - V

AMD - V

Power/clock speed management

Support of virtualisation

Data protection

Remote management

Virus protection

Table 4.1 : Comparison of advanced features provided by Intel’s and AMD’s MC lines. For a brief introduction of the related Intel techniques see the appropriate Intel processor descriptions

EIST: Enhanced Intel SpeedStep Technology

First implemented in Intel’s mobile and server platforms,It allows the system to dynamically adjust processor voltage and core frequency,to decrease average power consumption and thus average heat production.This technology is similar to AMD’s Cool’n’Quiet and PowerNow! techniques. For more information see http://en.wikipedia.org/wiki/SpeedStep

The ED bit with OS support is now a widely used technique to prevent certain classes of malicious buffer overflow attacks. It is used e.g. in Intel’s Itanium, Pentium M, Pentium 4 Prescottand subsequent processors as well as in AMD’s x86-64 line (designated as the NX bit (No Execute bit). In buffer overflow attacks a malicious worm inserts its code into another program’s data spaceand creates a flood of code that overwhelms the processor,allowing the worm to propagate itself to the network, and to other computers.

The Execute Disable bit allows the processor to classify areas in memory by where application code can be executed and where it cannot. Actually, the ED bit is implementedas the 63. bit of the Page Table Entry. If its value is 1, code cannot be executed from that page.When a malicious worm attempts to insert code in the buffer,the processor disables code execution, preventing damage and worm propagation. To became active the ED feature need to be supported by the OS.For more information see: http://en.wikipedia.org/wiki/XD_Bit

ED: Execute Disable bit (often designated as the NX bit (No Execute bit))

Brief description of the advanced features pointed out while using Intel’s notations (1)


Virtualisation techniques allow a platform to run multiple operating systems and applications in independent partitions. The hardver support improves the performance and robustness of traditional software-based virtualization solutions. Intel introduced hardware virtualisation support first in its Itanium line (called the Silvervale tecnique)followed by its Presler based and subsequent processors (designated as the Vanderpool technique).AMD uses this technique in its recent lines - dubbed as Pacifica - since Mai 2006. For more informationsee http://en.wikipedia.org/wiki/Virtualization_Technology

VT: Virtualization Technology

It is a Trusted Execution Technology to protect users from software-based attacks aimed at stealing vitaldata, initiated by Intel. It includes a set of hardware enhancements intended to provide multiple separatedexecution environments.For more information see Intel® Trusted Execution TechnologyPreliminary Architecture Specification at http://www.intel.com/technology/security/ orhttp://en.wikipedia.org/wiki/LaGrande_Technology

Intel’s Active Management Technology is a feature of its vPro technology. It allows to managethe computer system remotely, even if the computer is switched off or the hard drive has failed. By means of this technology system administrators can e.g. remotely download software updates or fix and hail system problems. It can also be utilized to protect the network by proactively blocking incoming threats. For more information see Intel Active Management Technology at http://www3.intel.com/cd/network/communications/emea/eng/203372.htm

La Grande Technology

AMT2

4. Layout of the cores (7)Brief description of the advanced features pointed out while using Intel’s notations

(2)

http://www.intel.com/technology/security/downloads/315168.htm

http://www.intel.com/technology/security/downloads/315168.htm

http://www.intel.com/technology/security/or




Multi-chip implementation

Physical implementation

Monolithic implementation

Paxville MP

Pentium D 9xx

Paxville DP

Kentsfield

Yonah Duo

Pentium D 8xx

Tulsa

Montecito

AMD, IBM, Sun

Dempsey

Clovertown

Fujitsu, HP andRMI

Woodcrest

Merom Conroe

Pentium EE 840

All DCs of

Pentium EE 955/965


5. Layout of L2 caches

Layout of L2 caches

Layout of the cores



5. Layout of L2 caches (1)

Layout of the on-chip interconnections


Inclusion policy

Allocation to the cores

Banking policy

Layout of L2 caches

Use by instructions/data

Integration of L2 caches to the proc. chip


Shared L2 cache for all cores

Allocation of L2 caches to the cores

Private L2 cache for each core

Athlon 64 X2 (2005)

IN (Xbar)

Memory

System Request Queue

B. c. M. c.

HT-bus

L2 L2

Core Core

Core Duo (2006)Core 2 Duo (2006)

L2 contr.

Core

L2

Core

FSB c.

FSB

Examples:


Shared L2 cache for all cores

Allocation of L2 caches to the cores

Private L2 cache for each core

POWER4 (2001)

Montecito (2006)

UltraSPARC IV (2004)

Athlon 64 X2 and AMD's Opteron lines

(2005/2006)

POWER5 (2005)

Core Duo (Yonah),Core 2 Duo (Core)based lines (2006)

UltraSPARC T1 (2005)UltraSPARC T2 (2005)

POWER6 (2007)

PA 8800 (2004)PA 8900 (2005)

Smithfield, Presler,Irwindale and Cedar Millbased lines (2005/2006)

SPARC64 VI (2007)SPARC64 VII (2008)

Cell BE (2006)

XLR (2005)

UltraSPARC IV+ (2005)


IBM: The effect of changing from shared L2 to private L2 • inprove latency

• eliminating any impact for multicore competition for capacity.

Source: IBM JRD Nov 2007

Inclusion policy


Banking policy

Attaching L2 caches to MCPs




Lines replaced (victimized) in the L1 are written to L2.References to data missing in L1 but available in L2 initiate reloading the pertaining cache line to L1.Reloaded data is deleted from L2L2 operates usually as a write back cache (only modified data that is replaced in the L2 is written back to the memory),Unmodified data that is replaced in L2 is deleted.

L1

M.c.

Replaced (victimised)lines

Modified data(low traffic)

Lines missing in L1 are reloaded and deleted from L2

L2

Memory

Data missing in L1/L2

(high traffic)

L1

L2

Memory

Exclusive L2

Inclusion policy of L2 caches

Inclusive L2

(Victim cache)


Figure 5.1: Implementation of exclusive L2 caches

Source: Zheng, Y., Davis, B.T., Jordan, M.: “ Performance evaluation of exclusive cache hierarchies”, 2004 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS),

2004, pp. 89-96.


Exclusive L2


Inclusive L2

Most implementations Athlon 64X2 (2005)Dual-core Opteron lines (2005)

Expected trend


Inclusion policy


Banking policy





Unified instr./data cache(s)


Split instr./data cache(s)

Expected trend

Typical implementation Montecito (2006)


Inclusion policy


Banking policy





Single-banked implementation

Banking policy

Multi-banked implementation

Typical implementation


UltraSPARC T1 (2005) (Niagara)(8 cores/4xL2 banks)

Core

X-bar

Core

L2 L2

Memory

M. c.

Memory

M. c.

The 128-byte long L2 cache lines are hashed acrossthe 3 modules. Hashing is performed by modulo 3arithmetric applied on a large number of real address bits.

The four L2 modules are interleaved at 64-byte blocks.Mapping of addresses to the banks:

Addr.067

0 21

Modulo 3∑0

256

64128

196

Mapping of addresses to the banks:

POWER4 (2001)POWER5 (2005)

Core

X-bar

Core

L2 L2L2

Fabric Bu SContr.IN (Fabric Bus Contr.)

L3 tags/contr.B. c.

GX bus

Figure 5.2: Alternatives of mapping addresses to memory modules

Mapping of addresses to memory modules


Inclusion policy


Banking policy





Partial integration of L2

Integration to the processor chip

Full integration of L2

All other lines considered

UltraSPARC IV (2004) UltraSPARC V (2005)

Expected trend

PA 8800 (2004)PA 8900 (2005)

On-chip L2 tags/control,off-chip data Entire L2 on-chip


6. Layout of L3 caches

Layout of L2 caches

Layout of the cores






Inclusion policy

Allocation to the L2 cache(s)

Banking policy

Layout of L3 caches




Shared L3 cache for all L2s

Allocation of L3 caches to the L2 caches

Private L3 cache for each L2

Montecito (2006)


POWER5 (2005) POWER4 (2001)

Athlon 64 X2–Barcelona (2007)


Inclusion policy


Banking policy

Layout of L3 caches




Lines replaced (victimized) in the L2 are written to L3.References to data missing in L2 but available in L3 initiate reloading the pertaining cache line to L2.Reloaded data is deleted from L3L3 operates usually as a write back cache (only modified data that is replaced in the L3 is written back to the memory),Unmodified data that is replaced in L3 is deleted.

L2

M.c.



Lines missing in L2 are reloaded

and deleted from L3

L3

Memory


(high traffic)

L2

L3

Memory

Exclusive L3


Inclusive L3

(Victimcache)


Exclusive L3


Inclusive L3

Montecito (2006)


POWER4 (2001) POWER5 (2005)

Athlon 64 X2–Barcelona (2007)

Expected trend


POWER6 (2007)

Inclusion policy


Banking policy

Layout of L3 caches




Unified instr./data cache(s)


Split instr./data caches

All multicore processorsunveiled until now hold

both instructions and data


Inclusion policy


Banking policy

Layout of L3 caches




Single-banked implementation

Banking policy

Multi-banked implementation

All multicore processorsunveiled until now are multi-banked


Inclusion policy


Banking policy

Layout of L3 caches




L3 partially integrated

Integration to the processor chip

L3 fully integrated

POWER4 (2001)


POWER5 (2005)

Montecito (2006)

Expected trend

POWER6 (2007)

On-chip L3 tags/control,off-chip data Entire L3 on-chip


Examples of partially integrated L3 caches:

Core

L3 tags/contr.

L3 data

Interconn. network

M. c..

Memory

B. c.

Fire Planebus

Core

L2

Fabric Bus Contr.

L2

L2

L2

Memory

M. c.

L3 tags/contr.

L3 tags/contr.

L3 tags/contr.

L3 data

L3 data

L3 data

POWER5 (2005): UltraSPARC IV+ (2005):



Figure 6.1: Cache architecture of MC processors(Not differentiating between inclusive/exclusive alternatives in the figure, but denotingit in the examples)

L2 private

L2 partially integrated,no L3

Cache architecture

L2 shared

L2 private

L2 shared

L2 fully integrated,no L3

L2 private

L2 shared

L2 fully integrated, L3 partially integrated

L2 private

L2 shared

L2 fully integrated,L3 fully integrated

L3 private

L3 shared

L3private

L3shared

L3 private

L3shared

L3 private

L3shared

PA-8800

UltraSPARC IV

Montecito

POWER6

Core Duo(Yonah)

Core 2 Duo based proc-s

Cell BEUltraSPARC T1/T2SPARC 64 VI/VII

XLR

POWER4POWER5(excl.)

UltraSPARC IV+(excl.)

PA-8900

Smithfield/Presler based proc.s

Athlon 64X2 (excl.)Opteron dual core

(excl.) Gemini


Examples for cache architectures (1)

L2 partially integrated, privateno L3


Core

Interconn. network

M. c.

Memory

B. c.

Fire Planebus

Core

L2 data L2 data

L2 tags/contr. L2 tags/contr.

L2 partially integrated, sharedno L3

PA-8800 (2004) PA-8900 (2005)

L2 data

CoreL2 tags/contr.Core

FSB c.

FSB

UltraSPARC T1 (2005)

L2L2 M. c.

B. c.

L2

L2

L2Core 7

M. c.

M. c.

M. c.

Core 0

X

b

a

r

Memory

Memory

Memory

Memory

JBus

L2 fully integrated, privateno L3

L2 fully integrated, sharedno L3

Athlon 64 X2 (2005)

IN (Xbar)

Memory


B. c. M. c.

HT-bus

L2 L2

Core Core

Core Duo (2006)Core 2 Duo (2006)

L2 contr.

Core

L2

Core

FSB c.

FSB





Core

L3 tags/contr.

L3 data

Interconn. network

M. c.

Memory

B. c.

Fire Planebus

Core

L2

POWER4 (2001)

IN (Fabric Bus Contr.)

M. c.

Memory

B. c. L3 tagscontr.

L3 data

L2 L2 L2

GX-bus

Chip-to-chip/Mem.-to-Mem.interconn.

L2 fully integrated, sharedL3 partially integrated


Core

L2 I L2 D

L3

Core

L2 I L2 D

L3

FSB c.

FSB

Montecito (2006)

L2 fully integrated, private,L3 fully integrated



7. Layout of memory and I/O architecture

Layout of L2 caches

Layout of the cores



7. Layout of memory and I/O architecture (1)




The concept of connection

Layout of the I/O and memory connection

Point of attachment Physical implementations of the M. c.


Connection via the FSB

The concept of connection

Connection via the M. c and B. c

Used typically in connection with off-chip M. c.-s

Used in connection with on-chip M. c.-s

Examples:

Core

L2 I L2 D

L3

Core

L2 I L2 D

L3

FSB c.

FSB

Core 2 Duo Montecito

IN

Core

L2

Core

FSB c.

FSB

L2L2 M. c..

B. c..

L2

L2

L2Core 7

M. c..

M. c..

M. c..

Core 0

X

b

a

r

Memory

Memory

Memory

Memory

JBus UltraSPARC T1


The highest cache level (via an IN)

The point of attachment

The IN connecting the two highest cache levels

L2 (private/shared)

L3 (private/shared)

The IN connecting the cores and a shared L2

The IN connecting a shared L2 and a shared L3

L2

IN

L2 L3

IN

L3

IN1

L2 L2

CC

IN

L3 L3 L3

L2 L2 L2

(2-level caches) (3-level caches) (2-level caches) (3-level caches)

Examples:



(high traffic)

Memory

L2

M.c.



Lines missing in L2 are reloaded and deleted from L3

L3

Memory

L2

IN

L2

L3 L3

M.c.

L3 L3

M.c.

Memory

Memory

L3

L2

Montecito (2006)POWER4 (2001)

UltraSPARC IV+ (2004)POWER5 (2004)

Exclusive L3


Inclusive L3

Cache inclusivity vs. memory attachment


The highest cache level (via an IN)

The point of attachment

The IN connecting the two highest cache levels

L2 (private/shared)

L3 (private/shared)

The IN connecting the cores and a shared L2

The IN connecting a shared L2 and a shared L3

L2

IN

L2 L3

IN

L3

IN1

L2 L2

CC

IN

L3 L3 L3

L2 L2 L2

(2-level caches) (3-level caches) (2-level caches) (3-level caches)

The M. c is connected usually in this way if the highest level cache is exclusive.

The M. c is connected usually in this way if the highest level cache is inclusive.

Examples:


Core

L2 I L2 D

L3

Core

L2 I L2 D

L3

FSB c.

FSB

Montecito (2006)

IN (Xbar)

Memory


B. c. M. c.

HT-bus

L2 L2

Core Core

L2 contr.

Core

L2

Core

FSB c.

FSB

Athlon 64 X2 (2005)Core 2 Duo (2006)

Examples for attaching memory and I/O via the highest cache level

In case of a two-level cache hierarchy In case of a three-level cache hierarchy


UltraSPARC T1 (2005) UltraSPARC IV+ (2005)

Examples for attaching memory and I/O via the interconnection network connecting the two highest levels of the cache hierarcy

In case of a two-level cache hierarchy In case of a three-level cache hierarchy

Core

L3 tags/contr.

L3 data

Interconn. network

M. c.

Memory

B. c.

Fire Planebus

Core

L2

L2L2 M. c.

B. c.

L2

L2

L2Core 7

M. c.

M. c.

M. c.

Core 0

X

b

a

r

Memory

Memory

Memory

Memory

JBus

(exclusive L3)

Off-chip memory controller On-chip memory controller

Integration of the memory controller to the processor chip

POWER4 (2001)


POWER5 (2005)

Montecito (2006?)

UltraSPARC T1 (2005)


Athlon 64 X2 (2005)Presler (2005)Smithfield (2005)

PA-8800 (2004)PA-8900 (2005)

Core (2006)Yonah Duo (2006)

Expected trend


Longer access times,but provides independency of

memory technology

Shortens access times,but causes dependency of

memory technology and speed

8. Layout of the on-chip interconnections

Layout of L2 caches

Layout of the cores



1. Layout of interconnections (1)




IN1

L2 L2

C CL2

IN

L2

L3 L3

L2/L3

IN

L2/L3

B. c. M. c. FSB c.

I/O bus FSBMemory

ChipModule

Chip

Chip

Chipor

Between private or shared L2 modules

and shared L3 modules

Between cores and shared L2 modules

Between private or shared L2/L3 modules and the B. c./M. c. or

alternatively the FSB c.

On-chip interconnections

Chip-to-chip and module-to-module

interconnects


Arbiter/Multi-port cache implementations Crossbar

Implementation of interconnections

Ring

For a small number ofsources/destinations

For a larger number ofsources/destinations

Cell BE (2006)XRI (2005)

eg. to connect dual-coresto shared L2 caches

UltraSPARC T1 (2005)UltraSPARC T2 (2007)

Quantitative aspects, such as the number of sources/destinations or bandwidth requirement affect which implementation alternative is the most beneficial.

9. Basic alternatives of macro-architectures of MC processors

9. Basic alternatives of macro-architectures of MC processors (1)

Basic dimensions spanning the design space

• Kind of related cache architecture

• Layout of the memory and I/O connection

• Layout of the IN(s) involved

• Cache inclusivity affects the layout of the memory connectionInterrelationships between particular dimensions


Considering a selected part of the design space of the macroarchitecture of MC processors

Assumptions:

• L2 is supposed to be partially integrated on the chip (i.e. only the tags are on the chip). This is not indicated in the design space tree but denoted in the examples. • We do not consider how the involved INs is implemented.• In case of an off-chip M. controller the use of an FSB is assumed.

• A two-level cache hierarchy is considered (no L3).


Off-chip M. c. On-chip M. c.

Design space of the macroarchitecture of MC processors (in case of a two-level cache hierarchy)

L2 private L2 shared L2 private L2 shared

Part of the design-space specifying the point of attachment

for the memory and I/O


Core 2 Duo based processors (2006)(2 cores)

SPARC64 VI (2007)(2 cores)

SPARC64 VII (2008)(4 cores)

Smithfield/Presler based processors (2005/2006)(2 cores)

PA-8800 (2004)(2 cores, L2 data off-chip)

FSB c.

FSB

IN1

L2 L2

IN2

L2

CC C

FSB c.

IN

L2 L2 L2

FSB

CC C

L2 private L2 shared

M . c. off-chip

Design space of the macroarchitecture of MC processors (2-level hierarchy) (1)

9. Basic alternatives of macro-architectures of MC processors (5)Design space of the macroarchitecture of MC processors (2-level hierarchy) (2)

M. c. on-chip

L2 private L2 shared

Attaching both I/O and M. c. to L2 via IN2

Attaching M. c. to L2 via IN2 and I/O to IN1

Attaching both I/O and M.c. via IN1

Attaching I/O. to L2 via IN2 and M.c. to IN1

UltraSPARC T1 (2005) (8 cores)

Cell BE (2006)(8 cores)

Opteron dual core (2 cores),(2005)

UltraSPARC IV(2 cores, L2 data off-chip), (2004)

Athlon 64 X2 (2005)(2 cores)

IN

L2 L2

B. c.

I/O-bus Memory

M. c.

CC

IN1

L2

B. c.

I/O-bus Memory

M. c.

IN2

L2

CC

I/O-bus

IN1

L2

B. c.

Memory

M. c.

IN2

L2

CC

IN1

L2B. c.

I/O-bus

Memory

M. c.

L2

CC

IN1

L2

M. c.

Memory

I/O-bus

B. c.

IN2

L2

CC

dezső sima fall 2007 (ver. 2.2) dezső sima, 2007 multicore processors (1)

Documents