dezső sima fall 2007 (ver. 2.2) dezső sima, 2007 multicore processors (1)
DESCRIPTION
2 Overview of MCPs 5 Layout of L2 caches 9 Basic alternatives of the macro-architecture of MCPs 7 Layout of the memory and I/O architecture Overview 6 Layout of L3 caches 1 Introduction 3 The design space of the macroarchitecture of MCPs 8 Implementation of the on-chip interconnections 4 Layout of the cores 10 Multicore repositoryTRANSCRIPT
Dezső Sima
Fall 2007
(Ver. 2.2) Dezső Sima, 2007
Multicore Processors (1)
Foreword
In a relative short time interval of about two years multicore processors became the standard. As multiple cores provide a substantially higher processing performance than their predecessors, their memory and I/O subsystems need also be largely enhanced to avoid processing bottlenecks. Consequently, while discussing multicore processors the focus of interest shifts from their microarchitecture to their macroarchitecture incorporating their cache hierarchy, the way how memory and I/O controllers are attached and how on-chip interconnects needed are implemented.Accordingly, the first nine Sections are devoted to the design space of the macroachitecture of multicore processors. Particular Sections deal with the main dimensions that span the design space and identify how occuring concepts became implemented in major multicore lines.Section 10 is a huge repository, providing relevant materials and facts published in connection with major multicore lines. In a concrete course appropriate parts of the repository can be selected according to the scope and attendance of the course. Each part of the repository concludes with a list of available literature to allow a more detailed study of a multicore line or processor of interest.
2 Overview of MCPs•
5 Layout of L2 caches•
9 Basic alternatives of the macro-architecture of MCPs•
7 Layout of the memory and I/O architecture•
Overview
6 Layout of L3 caches•
1 Introduction•
3 The design space of the macroarchitecture of MCPs•
8 Implementation of the on-chip interconnections•
4 Layout of the cores•
10 Multicore repository•
1. Introduction
Figure 1.1: Single-stream performance vs. cost
Source: Marr T.T. et al. „Hyper-Threading Technology Architecture and MicroarchitectureIntel Technology Journal, Vol. 06, Issue 01, Febr 14, 2002, pp. 4-16
1. Introduction (1)
1. Introduction (2)
Figure 1.2: Historical evolution of transistor counts, clock speed, power and ILP
For single core processors:
Increasing number of transistors diminishing return in performance
Moore’s law:
Transistor count: doubles approximately every 24 month
Since ~ 2003/2004:
Use available surplus transistors for multiple cores
1. Introduction (3)
The era of single core processors the era of multi core processors
1. Introduction (4)
The speed of change:
illustrated by Intel’s roadmap revealed in 2005:
1. Introduction (5)
1. Introduction (6)
Figure 1.3: Intel’s roadmap revealed in 2005
1. Introduction (7)
Expected increase of the core count:
As far as Moore’s law remains valid,
transistor count is doubling approximately every 24 month,
accordingly
core count is doubling approximately every 24 month as well.
2. Overview of MCPs
Figure 2.1: Intel’s dual-core mobile processor lines
8CST
QCMT
QCST
DCMT
DCST
20051Q 2Q
20061Q 2Q 3Q 4Q3Q 4Q
Core Duo
1/06
(Yonah Duo)
151 mtrs./31 W65 nm/90 mm2
Core 2 Duo
7/06
281 mtrs./34 W
(Merom)T5000/T7000
65 nm/143 mm2 T2000
2. Overview of MCPs (1)
Figure 2.2: Intel’s multi-core processor lines
8CST
QCMT
QCST
DCMT
DCST
2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q
230 mtrs./95 W
Pentium D 800(Smithfield)
5/05
90 nm/2*103 mm2
Core 2 Duo E6000/X6800(Conroe)
291 mtrs./65/75 W
7/06
65 nm/143 mm2
Pentium D 900(Presler)
2*188 mtrs./130 W65 nm/2*81 mm2
1/06
Core 2 Extrem QX6700(Kentsfield)
2*291 mtrs./130 W
11/06
65 nm/2*143 mm2
Pentium EE 955/965(Presler)
2*188 mtrs./130 W
1/06
65 nm/2*81 mm2
2-way MT/core
Pentium EE 840(Smithfield)
230 mtrs./130 W
5/05
90 nm/2*103 mm2
2-way MT/core
2. Overview of MCPs (2)
Figure 2.3: Intel’s dual-core Xeon UP-line
8CST
QCMT
QCST
DCMT
DCST
20061Q 2Q
20071Q 2Q 3Q 4Q3Q 4Q
Xeon 3000
9/06
291 mtrs./65 W 65 nm/143 mm2
(Conroe)
2. Overview of MCPs (3)
Figure 2.4.: Intel’s dual-core Xeon DP-lines
8CST
QCMT
QCST
DCMT
DCST
20051Q 2Q
20061Q 2Q 3Q 4Q3Q 4Q
Xeon DP 2.8
10/05
2*169 mtrs./135 W 90 nm/2*135 mm2
(Paxville DP)
Xeon 5300
11/06
2*291 mtrs./80/120 W 65 nm/2*143 mm2
(Clovertown)
Xeon 5100
6/06
291 mtrs./65/80 W 65 nm/143 mm2
(Woodcrest)
2-way MT/core2*188 mtrs./95/130 W
Xeon 5000
6/06
65 nm/2*81 mm2
(Dempsey)
2-way MT/core
2. Overview of MCPs (4)
Figure 2.5.: Intel’s dual-core Xeon MP-lines
8CST
QCMT
QCST
DCMT
DCST
20051Q 2Q
20061Q 2Q 3Q 4Q3Q 4Q
Xeon 7000
11/05
2*169 mtrs./95/150 W 90 nm/2*135 mm2
(Paxville MP)
2-way MT/core1328 mtrs./95/150 W
Xeon 7100
8/06
65 nm/435 mm2
(Tulsa)
2-way MT/core
2. Overview of MCPs (5)
Figure 2.6.: Intel’s dual-core EPIC-based server line
8CST
QCMT
QCST
DCMT
DCST
20051Q 2Q
20061Q 2Q 3Q 4Q3Q 4Q
2-way MT/core
9x00(Montecito)
1720 mtrs./104 W
7/06
90 nm/596 mm2
2. Overview of MCPs (6)
Figure 2.7.: AMD’s multi-core desktop processor lines
8CST
Athlon64 QC(Barcelona)
9/2007
QCMT
QCST
DCMT
DCST
2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q
3800/4200/4600(Manchester S. E2)
154 mtrs./89/110 W
5/05
90 nm/2*147 mm2
3800-4800(Toledo S. E6)
233 mtrs./89/110 W
5/05
90 nm/199 mm2
4000-5000(Brisbane S. G1)
154 mtrs./65 W
12/06
65 nm/126 mm2
3600-5600(Windsor S. F2/F3)
154/227 mtrs./65/89 W
5/06
90 nm/183/230 mm2
20071Q 2Q 3Q 4Q
65 nm/291 mm2
2. Overview of MCPs (7)
Figure 2.8.: AMD’s dual-core high-end desktop/entry level server Athlon 64 FX line
8CST
QCMT
QCST
DCMT
DCST
2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q
Athlon 64 FX(Toledo/Windsor)
233/227 mtrs./110/125 W
1/06
90 nm/199/230 mm2
2. Overview of MCPs (8)
Figure 2.9.: AMD’s dual-core Opteron UP-lines
8CST
QCMT
QCST
DCMT
DCST
2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q
233 mtrs./110 W
Opteron 100
8/05
90 nm/199 mm2
(Denmark)Opteron 1000
8/06
227 mtrs./103/125 W 90 nm/230 mm2
(Santa Ana)
2. Overview of MCPs (9)
Figure 2.10.: AMD’s dual-core Opteron DP-lines
8CST
QCMT
QCST
DCMT
DCST
2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q
227 mtrs./95/110 W
Opteron 2000
8/06
90 nm/230 mm2
(Santa Rosa)Opteron 200
3/05
233 mtrs./95 W 90 nm/199 mm2
(Italy)
2. Overview of MCPs (10)
Figure 2.11.: AMD’s dual-core Opteron MP-lines
8CST
QCMT
QCST
DCMT
DCST
2005 20061Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q
243 mtrs./95/119 W
Opteron 8000
8/06
90 nm/220 mm2
(Santa Rosa)Opteron 800
4/05
233 mtrs./95 W 90 nm/199 mm2
(Egypt)
2. Overview of MCPs (11)
Figure 2.12.: IBM’s multi-core server lines
POWER4 POWER4+
POWER5 POWER5+
Cell BE
POWER6
8CST
174 mtrs./115/125 W
10/01 11/02
QCMT
QCST
DCMT
DCST
2001 20023Q 4Q 3Q 4Q
~~ ~~
5/04
20041Q 2Q
~~
10/05
276 mtrs./70 W
3Q 4Q
2006
234 mtrs./95 W
20061Q 2Q 3Q 4Q
5/2007
750 mtrs./~100W
20071Q 2Q
180 nm/412 mm2
90 nm/230 mm2
90 nm/221 mm2
65 nm/341 mm2
276 mtrs./80W (est.)130 nm/389 mm2
184 mtrs./70 W130 nm/380 mm2
2-way MT/core 2-way MT/core
(PPE:2-way MT)
2-way MT/core
(SSEs: no MT)
2. Overview of MCPs (12)
Figure 2.13.: Sun’s and Fujitsu’s multi-core server lines
8CST
QCMT
QCST
DCMT
DCST
20041Q 2Q
20051Q 2Q 3Q 4Q
UltraSPARC IV+
9/05
295 mtrs./90 W
(Panther)
20071Q 2Q 3Q 4Q
20081Q 2Q
66 mtrs./108 W
UltraSPARC IV
7/04
(Jaguar)
~~3Q
8CMTUltraSPARC T1
11/2005
279 mtrs./63 W
(Niagara)UltraSPARC T2
2007
(Niagara II)
APL SPARC64 VI
2007
540 mtrs./120 W
(Olympus)
APL SPARC64 VII
2008
(Jupiter)
130 nm/352 mm2 90 nm/335 mm2
90 nm/421 mm2
65 nm/464 mm2
65 nm/342 mm290 nm/379 mm2
80 mtrs./32 W
Gemini
4/04
130 nm/206 mm2
3Q
4-way MT/core 8-way MT/core
2-way MT/core
4Q
~120 W2-way MT/core
72 W (est.)
2. Overview of MCPs (13)
Figure 2.14.: HP’s PA-8x00 server line
8CST
QCMT
QCST
DCMT
DCST
20041Q 2Q
20051Q 2Q 3Q 4Q
PA 8800
2/2004
300 mtrs./55 W
(Mako)PA 8900
5/05
317 mtrs.
(Shortfin)
3Q 4Q
130 nm/366 mm2 130 nm/366 mm2
2. Overview of MCPs (14)
Figure 2.15.: RMI’s multi-core XLR-line (scalar RISC)
8CST
QCMT
QCST
DCMT
DCST
20051Q 2Q 3Q 4Q
20061Q 2Q 3Q 4Q
8CMT
XLR 5xx
5/05
333 mtrs./10-50 W90 nm/~220 mm2
4-way MT/core
2. Overview of MCPs (15)
3. Design space of the macro architecture of MCPs
Building blocks of multicore processors (MC)
Cores• L2 cache(s) (L2)•
FSB controller (FSB c.)
•
•
L3 cache(s), if available (L3)•
Memory controller (M. c.)•
Interconnection network (IN)•
L2 I L2 D
L3
L2 I L2 D
L3
IN
FSB
FSB c.
C C
or
3. Design space of the macro architecture of MCPs (1)
Bus controller (B. c.)
Layout of L2 caches
Layout of the cores
Layout of the I/O andmemory architecture
Macro architecture of multi-core (MC) processors
Layout of L3 caches (if available)
3. Design space of the macro architecture of MCPs (2)
Layout of the interconnections
4. Layout of the cores
Layout of L2 caches
Layout of the cores
Layout of the I/O andmemory architecture
Layout of L3 caches (if available)
5. Layout of the cores (1)
Layout of the on-chip interconnections
Macro architecture of multi-core (MC) processors
Physical implementationBasic layout Advanced features
Layout of the cores
4. Layout of the cores (2)
4. Layout of the cores (3)
HMC(Heterogeneous MC)
Basic layout
SMC(Symmetrical MC)
Cell BE (1 PPE+8 SPE)All other MCs
4. Layout of the cores (4)
Virus protection
Advanced features of the cores
Remote management
Power/clock speed management
Data protection
Support ofvirtualisation
Details on the next slide!
4. Layout of the cores (5)
12
Pentium M-based line: Core Duo T2000Prescott based desktop lines:Pentium D 8xx, EE 840
3
Xeon lines: Paxwille DP, MP
Cedar Mill based desktop lines: Pentium D 9xx, EE 955,965
4 Core basd mobile line: T5xxx/T7xxx (Merom)desktop line: Core 2 Duo E6xx, X6xxx, QX67xx
Xeon lines: 5000 (Dempsey), 7100 (Tulsa)
Xeon lines: 3000, 5100 (Woodcrest), 5300 (Clovertown)
Intel
Netburst
Prescott-based2
Cedar Mill-based3
Core-based4
Itanium 2
AMD
Athlon 64 XR
Athlon 64 FX
Opteron
x00 lines
x000 lines
Pentium M-based1
ED-bitP Vanderpool (partly)
La Grande AMT2
Nx-bit Cool 'n' Quiet Pacifica (partly)
Presidio
ED-bit EIST 5/02
ED-bit
ED-bit
EIST (partly)
EIST (partly)
EIST (partly) Vanderpool (partly)
DBS Silvervale
Nx-bit
Nx-bit
Nx-bit
Cool 'n' Quiet
PowerNow!
PowerNow!
Pacifica
AMD - V
AMD - V
Power/clock speed management
Support of virtualisation
Data protection
Remote management
Virus protection
Table 4.1 : Comparison of advanced features provided by Intel’s and AMD’s MC lines. For a brief introduction of the related Intel techniques see the appropriate Intel processor descriptions
EIST: Enhanced Intel SpeedStep Technology
First implemented in Intel’s mobile and server platforms,It allows the system to dynamically adjust processor voltage and core frequency,to decrease average power consumption and thus average heat production.This technology is similar to AMD’s Cool’n’Quiet and PowerNow! techniques. For more information see http://en.wikipedia.org/wiki/SpeedStep
The ED bit with OS support is now a widely used technique to prevent certain classes of malicious buffer overflow attacks. It is used e.g. in Intel’s Itanium, Pentium M, Pentium 4 Prescottand subsequent processors as well as in AMD’s x86-64 line (designated as the NX bit (No Execute bit). In buffer overflow attacks a malicious worm inserts its code into another program’s data spaceand creates a flood of code that overwhelms the processor,allowing the worm to propagate itself to the network, and to other computers.
The Execute Disable bit allows the processor to classify areas in memory by where application code can be executed and where it cannot. Actually, the ED bit is implementedas the 63. bit of the Page Table Entry. If its value is 1, code cannot be executed from that page.When a malicious worm attempts to insert code in the buffer,the processor disables code execution, preventing damage and worm propagation. To became active the ED feature need to be supported by the OS.For more information see: http://en.wikipedia.org/wiki/XD_Bit
ED: Execute Disable bit (often designated as the NX bit (No Execute bit))
Brief description of the advanced features pointed out while using Intel’s notations (1)
4. Layout of the cores (6)
Virtualisation techniques allow a platform to run multiple operating systems and applications in independent partitions. The hardver support improves the performance and robustness of traditional software-based virtualization solutions. Intel introduced hardware virtualisation support first in its Itanium line (called the Silvervale tecnique)followed by its Presler based and subsequent processors (designated as the Vanderpool technique).AMD uses this technique in its recent lines - dubbed as Pacifica - since Mai 2006. For more informationsee http://en.wikipedia.org/wiki/Virtualization_Technology
VT: Virtualization Technology
It is a Trusted Execution Technology to protect users from software-based attacks aimed at stealing vitaldata, initiated by Intel. It includes a set of hardware enhancements intended to provide multiple separatedexecution environments.For more information see Intel® Trusted Execution TechnologyPreliminary Architecture Specification at http://www.intel.com/technology/security/ orhttp://en.wikipedia.org/wiki/LaGrande_Technology
Intel’s Active Management Technology is a feature of its vPro technology. It allows to managethe computer system remotely, even if the computer is switched off or the hard drive has failed. By means of this technology system administrators can e.g. remotely download software updates or fix and hail system problems. It can also be utilized to protect the network by proactively blocking incoming threats. For more information see Intel Active Management Technology at http://www3.intel.com/cd/network/communications/emea/eng/203372.htm
La Grande Technology
AMT2
4. Layout of the cores (7)Brief description of the advanced features pointed out while using Intel’s notations
(2)
Multi-chip implementation
Physical implementation
Monolithic implementation
Paxville MP
Pentium D 9xx
Paxville DP
Kentsfield
Yonah Duo
Pentium D 8xx
Tulsa
Montecito
AMD, IBM, Sun
Dempsey
Clovertown
Fujitsu, HP andRMI
Woodcrest
Merom Conroe
Pentium EE 840
All DCs of
Pentium EE 955/965
4. Layout of the cores (8)
5. Layout of L2 caches
Layout of L2 caches
Layout of the cores
Layout of the I/O andmemory architecture
Layout of L3 caches (if available)
5. Layout of L2 caches (1)
Layout of the on-chip interconnections
Macro architecture of multi-core (MC) processors
Inclusion policy
Allocation to the cores
Banking policy
Layout of L2 caches
Use by instructions/data
Integration of L2 caches to the proc. chip
5. Layout of L2 caches (2)
Shared L2 cache for all cores
Allocation of L2 caches to the cores
Private L2 cache for each core
Athlon 64 X2 (2005)
IN (Xbar)
Memory
System Request Queue
B. c. M. c.
HT-bus
L2 L2
Core Core
Core Duo (2006)Core 2 Duo (2006)
L2 contr.
Core
L2
Core
FSB c.
FSB
Examples:
5. Layout of L2 caches (3)
Shared L2 cache for all cores
Allocation of L2 caches to the cores
Private L2 cache for each core
POWER4 (2001)
Montecito (2006)
UltraSPARC IV (2004)
Athlon 64 X2 and AMD's Opteron lines
(2005/2006)
POWER5 (2005)
Core Duo (Yonah),Core 2 Duo (Core)based lines (2006)
UltraSPARC T1 (2005)UltraSPARC T2 (2005)
POWER6 (2007)
PA 8800 (2004)PA 8900 (2005)
Smithfield, Presler,Irwindale and Cedar Millbased lines (2005/2006)
SPARC64 VI (2007)SPARC64 VII (2008)
Cell BE (2006)
XLR (2005)
UltraSPARC IV+ (2005)
5. Layout of L2 caches (4)
IBM: The effect of changing from shared L2 to private L2 • inprove latency
• eliminating any impact for multicore competition for capacity.
Source: IBM JRD Nov 2007
Inclusion policy
Allocation to the cores
Banking policy
Attaching L2 caches to MCPs
Use by instructions/data
Integration of L2 caches to the proc. chip
5. Layout of L2 caches (5)
Lines replaced (victimized) in the L1 are written to L2.References to data missing in L1 but available in L2 initiate reloading the pertaining cache line to L1.Reloaded data is deleted from L2L2 operates usually as a write back cache (only modified data that is replaced in the L2 is written back to the memory),Unmodified data that is replaced in L2 is deleted.
L1
M.c.
Replaced (victimised)lines
Modified data(low traffic)
Lines missing in L1 are reloaded and deleted from L2
L2
Memory
Data missing in L1/L2
(high traffic)
L1
L2
Memory
Exclusive L2
Inclusion policy of L2 caches
Inclusive L2
(Victim cache)
5. Layout of L2 caches (6)
Figure 5.1: Implementation of exclusive L2 caches
Source: Zheng, Y., Davis, B.T., Jordan, M.: “ Performance evaluation of exclusive cache hierarchies”, 2004 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS),
2004, pp. 89-96.
5. Layout of L2 caches (7)
Exclusive L2
Inclusion policy of L2 caches
Inclusive L2
Most implementations Athlon 64X2 (2005)Dual-core Opteron lines (2005)
Expected trend
5. Layout of L2 caches (8)
Inclusion policy
Allocation to the cores
Banking policy
Attaching L2 caches to MCPs
Use by instructions/data
Integration of L2 caches to the proc. chip
5. Layout of L2 caches (9)
Unified instr./data cache(s)
Use by instructions/data
Split instr./data cache(s)
Expected trend
Typical implementation Montecito (2006)
5. Layout of L2 caches (10)
Inclusion policy
Allocation to the cores
Banking policy
Attaching L2 caches to MCPs
Use by instructions/data
Integration of L2 caches to the proc. chip
5. Layout of L2 caches (11)
Single-banked implementation
Banking policy
Multi-banked implementation
Typical implementation
5. Layout of L2 caches (12)
UltraSPARC T1 (2005) (Niagara)(8 cores/4xL2 banks)
Core
X-bar
Core
L2 L2
Memory
M. c.
Memory
M. c.
The 128-byte long L2 cache lines are hashed acrossthe 3 modules. Hashing is performed by modulo 3arithmetric applied on a large number of real address bits.
The four L2 modules are interleaved at 64-byte blocks.Mapping of addresses to the banks:
Addr.067
0 21
Modulo 3∑0
256
64128
196
Mapping of addresses to the banks:
POWER4 (2001)POWER5 (2005)
Core
X-bar
Core
L2 L2L2
Fabric Bu SContr.IN (Fabric Bus Contr.)
L3 tags/contr.B. c.
GX bus
Figure 5.2: Alternatives of mapping addresses to memory modules
Mapping of addresses to memory modules
5. Layout of L2 caches (13)
Inclusion policy
Allocation to the cores
Banking policy
Attaching L2 caches to MCPs
Use by instructions/data
Integration of L2 caches to the proc. chip
5. Layout of L2 caches (14)
Partial integration of L2
Integration to the processor chip
Full integration of L2
All other lines considered
UltraSPARC IV (2004) UltraSPARC V (2005)
Expected trend
PA 8800 (2004)PA 8900 (2005)
On-chip L2 tags/control,off-chip data Entire L2 on-chip
5. Layout of L2 caches (15)
6. Layout of L3 caches
Layout of L2 caches
Layout of the cores
Layout of the I/O andmemory architecture
Layout of L3 caches (if available)
6. Layout of L3 caches (1)
Layout of the interconnections
Macro architecture of multi-core (MC) processors
Inclusion policy
Allocation to the L2 cache(s)
Banking policy
Layout of L3 caches
Use by instructions/data
Integration of L3 caches to the proc. chip
6. Layout of L3 caches (2)
Shared L3 cache for all L2s
Allocation of L3 caches to the L2 caches
Private L3 cache for each L2
Montecito (2006)
UltraSPARC IV+ (2004)
POWER5 (2005) POWER4 (2001)
Athlon 64 X2–Barcelona (2007)
6. Layout of L3 caches (3)
Inclusion policy
Allocation to the L2 cache(s)
Banking policy
Layout of L3 caches
Use by instructions/data
Integration of L3 caches to the proc. chip
6. Layout of L3 caches (4)
Lines replaced (victimized) in the L2 are written to L3.References to data missing in L2 but available in L3 initiate reloading the pertaining cache line to L2.Reloaded data is deleted from L3L3 operates usually as a write back cache (only modified data that is replaced in the L3 is written back to the memory),Unmodified data that is replaced in L3 is deleted.
L2
M.c.
Replaced (victimised)lines
Modified data(low traffic)
Lines missing in L2 are reloaded
and deleted from L3
L3
Memory
Data missing in L2/L3
(high traffic)
L2
L3
Memory
Exclusive L3
Inclusion policy of L3 caches
Inclusive L3
(Victimcache)
6. Layout of L3 caches (5)
Exclusive L3
Inclusion policy of L3 caches
Inclusive L3
Montecito (2006)
UltraSPARC IV+ (2005)
POWER4 (2001) POWER5 (2005)
Athlon 64 X2–Barcelona (2007)
Expected trend
6. Layout of L3 caches (6)
POWER6 (2007)
Inclusion policy
Allocation to the L2 cache(s)
Banking policy
Layout of L3 caches
Use by instructions/data
Integration of L3 caches to the proc. chip
6. Layout of L3 caches (7)
Unified instr./data cache(s)
Use by instructions/data
Split instr./data caches
All multicore processorsunveiled until now hold
both instructions and data
6. Layout of L3 caches (8)
Inclusion policy
Allocation to the L2 cache(s)
Banking policy
Layout of L3 caches
Use by instructions/data
Integration of L3 caches to the proc. chip
6. Layout of L3 caches (9)
Single-banked implementation
Banking policy
Multi-banked implementation
All multicore processorsunveiled until now are multi-banked
6. Layout of L3 caches (10)
Inclusion policy
Allocation to the L2 cache(s)
Banking policy
Layout of L3 caches
Use by instructions/data
Integration of L3 caches to the proc. chip
6. Layout of L3 caches (11)
L3 partially integrated
Integration to the processor chip
L3 fully integrated
POWER4 (2001)
UltraSPARC IV+ (2005)
POWER5 (2005)
Montecito (2006)
Expected trend
POWER6 (2007)
On-chip L3 tags/control,off-chip data Entire L3 on-chip
6. Layout of L3 caches (12)
Examples of partially integrated L3 caches:
Core
L3 tags/contr.
L3 data
Interconn. network
M. c..
Memory
B. c.
Fire Planebus
Core
L2
Fabric Bus Contr.
L2
L2
L2
Memory
M. c.
L3 tags/contr.
L3 tags/contr.
L3 tags/contr.
L3 data
L3 data
L3 data
POWER5 (2005): UltraSPARC IV+ (2005):
6. Layout of L3 caches (13)
6. Layout of L3 caches (14)
Figure 6.1: Cache architecture of MC processors(Not differentiating between inclusive/exclusive alternatives in the figure, but denotingit in the examples)
L2 private
L2 partially integrated,no L3
Cache architecture
L2 shared
L2 private
L2 shared
L2 fully integrated,no L3
L2 private
L2 shared
L2 fully integrated, L3 partially integrated
L2 private
L2 shared
L2 fully integrated,L3 fully integrated
L3 private
L3 shared
L3private
L3shared
L3 private
L3shared
L3 private
L3shared
PA-8800
UltraSPARC IV
Montecito
POWER6
Core Duo(Yonah)
Core 2 Duo based proc-s
Cell BEUltraSPARC T1/T2SPARC 64 VI/VII
XLR
POWER4POWER5(excl.)
UltraSPARC IV+(excl.)
PA-8900
Smithfield/Presler based proc.s
Athlon 64X2 (excl.)Opteron dual core
(excl.) Gemini
6. Layout of L3 caches (15)
Examples for cache architectures (1)
L2 partially integrated, privateno L3
UltraSPARC IV (2004)
Core
Interconn. network
M. c.
Memory
B. c.
Fire Planebus
Core
L2 data L2 data
L2 tags/contr. L2 tags/contr.
L2 partially integrated, sharedno L3
PA-8800 (2004) PA-8900 (2005)
L2 data
CoreL2 tags/contr.Core
FSB c.
FSB
UltraSPARC T1 (2005)
L2L2 M. c.
B. c.
L2
L2
L2Core 7
M. c.
M. c.
M. c.
Core 0
X
b
a
r
Memory
Memory
Memory
Memory
JBus
L2 fully integrated, privateno L3
L2 fully integrated, sharedno L3
Athlon 64 X2 (2005)
IN (Xbar)
Memory
System Request Queue
B. c. M. c.
HT-bus
L2 L2
Core Core
Core Duo (2006)Core 2 Duo (2006)
L2 contr.
Core
L2
Core
FSB c.
FSB
Examples for cache architectures (2)
6. Layout of L3 caches (16)
6. Layout of L3 caches (17)
UltraSPARC IV+ (2005)
Core
L3 tags/contr.
L3 data
Interconn. network
M. c.
Memory
B. c.
Fire Planebus
Core
L2
POWER4 (2001)
IN (Fabric Bus Contr.)
M. c.
Memory
B. c. L3 tagscontr.
L3 data
L2 L2 L2
GX-bus
Chip-to-chip/Mem.-to-Mem.interconn.
L2 fully integrated, sharedL3 partially integrated
Examples for cache architectures (3)
Core
L2 I L2 D
L3
Core
L2 I L2 D
L3
FSB c.
FSB
Montecito (2006)
L2 fully integrated, private,L3 fully integrated
Examples for cache architectures (4)
6. Layout of L3 caches (18)
7. Layout of memory and I/O architecture
Layout of L2 caches
Layout of the cores
Layout of the I/O andmemory architecture
Layout of L3 caches (if available)
7. Layout of memory and I/O architecture (1)
Layout of the interconnections
Macro architecture of multi-core (MC) processors
7. Layout of memory and I/O architecture (2)
The concept of connection
Layout of the I/O and memory connection
Point of attachment Physical implementations of the M. c.
7. Layout of memory and I/O architecture (3)
Connection via the FSB
The concept of connection
Connection via the M. c and B. c
Used typically in connection with off-chip M. c.-s
Used in connection with on-chip M. c.-s
Examples:
Core
L2 I L2 D
L3
Core
L2 I L2 D
L3
FSB c.
FSB
Core 2 Duo Montecito
IN
Core
L2
Core
FSB c.
FSB
L2L2 M. c..
B. c..
L2
L2
L2Core 7
M. c..
M. c..
M. c..
Core 0
X
b
a
r
Memory
Memory
Memory
Memory
JBus UltraSPARC T1
7. Layout of memory and I/O architecture (4)
The highest cache level (via an IN)
The point of attachment
The IN connecting the two highest cache levels
L2 (private/shared)
L3 (private/shared)
The IN connecting the cores and a shared L2
The IN connecting a shared L2 and a shared L3
L2
IN
L2 L3
IN
L3
IN1
L2 L2
CC
IN
L3 L3 L3
L2 L2 L2
(2-level caches) (3-level caches) (2-level caches) (3-level caches)
Examples:
7. Layout of memory and I/O architecture (5)
Data missing in L2/L3
(high traffic)
Memory
L2
M.c.
Replaced (victimised)lines
Modified data(low traffic)
Lines missing in L2 are reloaded and deleted from L3
L3
Memory
L2
IN
L2
L3 L3
M.c.
L3 L3
M.c.
Memory
Memory
L3
L2
Montecito (2006)POWER4 (2001)
UltraSPARC IV+ (2004)POWER5 (2004)
Exclusive L3
Inclusion policy of L3 caches
Inclusive L3
Cache inclusivity vs. memory attachment
7. Layout of memory and I/O architecture (6)
The highest cache level (via an IN)
The point of attachment
The IN connecting the two highest cache levels
L2 (private/shared)
L3 (private/shared)
The IN connecting the cores and a shared L2
The IN connecting a shared L2 and a shared L3
L2
IN
L2 L3
IN
L3
IN1
L2 L2
CC
IN
L3 L3 L3
L2 L2 L2
(2-level caches) (3-level caches) (2-level caches) (3-level caches)
The M. c is connected usually in this way if the highest level cache is exclusive.
The M. c is connected usually in this way if the highest level cache is inclusive.
Examples:
7. Layout of memory and I/O architecture (7)
Core
L2 I L2 D
L3
Core
L2 I L2 D
L3
FSB c.
FSB
Montecito (2006)
IN (Xbar)
Memory
System Request Queue
B. c. M. c.
HT-bus
L2 L2
Core Core
L2 contr.
Core
L2
Core
FSB c.
FSB
Athlon 64 X2 (2005)Core 2 Duo (2006)
Examples for attaching memory and I/O via the highest cache level
In case of a two-level cache hierarchy In case of a three-level cache hierarchy
7. Layout of memory and I/O architecture (8)
UltraSPARC T1 (2005) UltraSPARC IV+ (2005)
Examples for attaching memory and I/O via the interconnection network connecting the two highest levels of the cache hierarcy
In case of a two-level cache hierarchy In case of a three-level cache hierarchy
Core
L3 tags/contr.
L3 data
Interconn. network
M. c.
Memory
B. c.
Fire Planebus
Core
L2
L2L2 M. c.
B. c.
L2
L2
L2Core 7
M. c.
M. c.
M. c.
Core 0
X
b
a
r
Memory
Memory
Memory
Memory
JBus
(exclusive L3)
Off-chip memory controller On-chip memory controller
Integration of the memory controller to the processor chip
POWER4 (2001)
UltraSPARC IV+ (2005)
POWER5 (2005)
Montecito (2006?)
UltraSPARC T1 (2005)
UltraSPARC IV (2004)
Athlon 64 X2 (2005)Presler (2005)Smithfield (2005)
PA-8800 (2004)PA-8900 (2005)
Core (2006)Yonah Duo (2006)
Expected trend
7. Layout of memory and I/O architecture (9)
Longer access times,but provides independency of
memory technology
Shortens access times,but causes dependency of
memory technology and speed
8. Layout of the on-chip interconnections
Layout of L2 caches
Layout of the cores
Layout of the I/O andmemory architecture
Layout of L3 caches (if available)
1. Layout of interconnections (1)
Layout of the interconnections
Macro architecture of multi-core (MC) processors
8. Layout of interconnections (2)
IN1
L2 L2
C CL2
IN
L2
L3 L3
L2/L3
IN
L2/L3
B. c. M. c. FSB c.
I/O bus FSBMemory
ChipModule
Chip
Chip
Chipor
Between private or shared L2 modules
and shared L3 modules
Between cores and shared L2 modules
Between private or shared L2/L3 modules and the B. c./M. c. or
alternatively the FSB c.
On-chip interconnections
Chip-to-chip and module-to-module
interconnects
8. Layout of interconnections (3)
Arbiter/Multi-port cache implementations Crossbar
Implementation of interconnections
Ring
For a small number ofsources/destinations
For a larger number ofsources/destinations
Cell BE (2006)XRI (2005)
eg. to connect dual-coresto shared L2 caches
UltraSPARC T1 (2005)UltraSPARC T2 (2007)
Quantitative aspects, such as the number of sources/destinations or bandwidth requirement affect which implementation alternative is the most beneficial.
9. Basic alternatives of macro-architectures of MC processors
9. Basic alternatives of macro-architectures of MC processors (1)
Basic dimensions spanning the design space
• Kind of related cache architecture
• Layout of the memory and I/O connection
• Layout of the IN(s) involved
• Cache inclusivity affects the layout of the memory connectionInterrelationships between particular dimensions
9. Basic alternatives of macro-architectures of MC processors (2)
Considering a selected part of the design space of the macroarchitecture of MC processors
Assumptions:
• L2 is supposed to be partially integrated on the chip (i.e. only the tags are on the chip). This is not indicated in the design space tree but denoted in the examples. • We do not consider how the involved INs is implemented.• In case of an off-chip M. controller the use of an FSB is assumed.
• A two-level cache hierarchy is considered (no L3).
9. Basic alternatives of macro-architectures of MC processors (3)
Off-chip M. c. On-chip M. c.
Design space of the macroarchitecture of MC processors (in case of a two-level cache hierarchy)
L2 private L2 shared L2 private L2 shared
Part of the design-space specifying the point of attachment
for the memory and I/O
9. Basic alternatives of macro-architectures of MC processors (4)
Core 2 Duo based processors (2006)(2 cores)
SPARC64 VI (2007)(2 cores)
SPARC64 VII (2008)(4 cores)
Smithfield/Presler based processors (2005/2006)(2 cores)
PA-8800 (2004)(2 cores, L2 data off-chip)
FSB c.
FSB
IN1
L2 L2
IN2
L2
CC C
FSB c.
IN
L2 L2 L2
FSB
CC C
L2 private L2 shared
M . c. off-chip
Design space of the macroarchitecture of MC processors (2-level hierarchy) (1)
9. Basic alternatives of macro-architectures of MC processors (5)Design space of the macroarchitecture of MC processors (2-level hierarchy) (2)
M. c. on-chip
L2 private L2 shared
Attaching both I/O and M. c. to L2 via IN2
Attaching M. c. to L2 via IN2 and I/O to IN1
Attaching both I/O and M.c. via IN1
Attaching I/O. to L2 via IN2 and M.c. to IN1
UltraSPARC T1 (2005) (8 cores)
Cell BE (2006)(8 cores)
Opteron dual core (2 cores),(2005)
UltraSPARC IV(2 cores, L2 data off-chip), (2004)
Athlon 64 X2 (2005)(2 cores)
IN
L2 L2
B. c.
I/O-bus Memory
M. c.
CC
IN1
L2
B. c.
I/O-bus Memory
M. c.
IN2
L2
CC
I/O-bus
IN1
L2
B. c.
Memory
M. c.
IN2
L2
CC
IN1
L2B. c.
I/O-bus
Memory
M. c.
L2
CC
IN1
L2
M. c.
Memory
I/O-bus
B. c.
IN2
L2
CC