arquitectura de computadores ii paulo marques departamento de eng. informática universidade de...
TRANSCRIPT
Arquitectura de Computadores II
Paulo MarquesDepartamento de Eng. InformáticaUniversidade de [email protected]
2004
/200
5
4. Exemplos de Alguns Processadores Actuais4.1. Arquitectura IA-32
2
» The x86 isn’t that all complex – It just doesn’t make a lot of sense « Mike Johnson, Leader of the 80x86 design at
AMDMicroprocessor Report (1994)
3
Uma breve história...
1978: The Intel 8086 is announced (16 bit architecture) 1980: The 8087 floating point coprocessor is added 1982: The 80286 increases address space to 24 bits,
+instructions 1985: The 80386 extends to 32 bits, new addressing modes 1989-1995: The 80486, Pentium, Pentium Pro add a few
instructions (mostly designed for higher performance) 1997: 57 new “MMX” instructions are added, Pentium II 1999: The Pentium III added another 70 instructions (SSE) 2001: Another 144 instructions (SSE2) 2003: AMD extends the architecture to increase address
space to 64 bits, widens all registers to 64 bits and other changes (AMD64)
2004: Intel capitulates and embraces AMD64 (calls it EM64T) and adds more media extensions
Problema do “legado” e “compatibilidade para trás”
4
Visão geral
Complexidade: Instruções podem ter um tamanho de 1 a 17 bytes Um operando funciona sempre como origem e destino Um operando pode vir de memória Formas de endereçamento complexas
O que “salvou” a arquitectura ao longo dos anos: As instruções mais frequentes não são difíceis de
implementar Os compiladores não geram as instruções lentas e não
usam a parte da arquitectura que é lenta O processador foi convertido à arquitectura RISC,
mantendo apenas um front-end que descodifica as instruções complexas em µOPs RISC, simples.
... Volume de mercado
5
Registos (FP não mostrados)
6
Instruções
De dois operandos (e.g. ADD AX, BX) Diferentes tipos de origem/destino
Register/Register Register/Immediate Register/Memory Memory/Register Memory/Immediate
Múltiplos modos de endereçamento Absoluto (e.g. MOV AX, [1000]) Indirecto via Registo (e.g. MOV AX, [SI]) Base mode with 8/16/32 displacement (e.g. MOV AX,
[SI+100]) Indexed (e.g. MOV AX, [SI+BX]) Based Indexed (e.g. MOV AX, [SI+BX+100]) Base+Scaled Indexed (endereço =
BaseReg+2^Scale*IndexReg) Base+Scaled Index with Displacement (como acima + displ.)
7
Múltiplos modos de endereçamento
8
Instruções (apenas algumas...) Os registos, em muitos casos,não são General Purpose!
9
Codificação das Instruções
10
Extensões à arquitectura IA-32
Instruções MMX, SSE, SSE2
Consistem em: MMX: Operações sobre vectores de inteiros (vectores de
64 bits contendo números de 8, 16 ou 32 bits) SSE: Operações sobre vectores de virgula flutuante
simples (vectores de 4 floats IEEE745) SSE2: Operações sobre vectores de vírgula flutuante
dupla (vectores de 2 double IEEE754) + extensão aos vectores de inteiros (vectores de 128 bits contendo números de 8, 16, 32 ou 64 bits)
Arquitectura de Computadores II
Paulo MarquesDepartamento de Eng. InformáticaUniversidade de [email protected]
2004
/200
5
4. Exemplos de Alguns Processadores Actuais4.2. Intel Pentium 4
12
Instruções IA-32 e µOPs
Todas as implementações modernas da arquitectura IA-32 convertem as instruções originais numa sequência de micro-instruções. No caso da Intel, estas são chamadas µOPs As µOPS são bastante semelhantes às instruções RISC:
tamanho constante, formato uniforme, etc. Uma instrução IA-32 é no mínimo 1 µOP. Uma instrução
complexa pode corresponder a centenas delas (!) (e.g. REP MOVSB)
MOV AX, [1000]
µOP 1
µOP 2
µOP 3
µOP 4
13
Algumas das características do Pentium 4 (2000)
Pipeline com execução especulativa com diversas unidades funcionais (Arquitectura NetBurst) Pipeline de 20 fases 7 Unidades Funcionais Até 126 µOPs em Execução no Pipeline (dos quais 48
LOADs e 24 STOREs) Completa até 3 µOPs por ciclo de relógio ALUs funcionam ao dobro da velocidade de relógio
Utilização de uma Trace Cache Dois Branch Target Buffers
Front-end: 4K entradas Trace-cache: 512 entradas
Utilização de Register Renaning (8 registos 128) para além de um Re-order Buffer Register Renaning elimina dependências de nome Re-order buffer garante a ordem de commit das instruções
14
Visão Geral do Pentium4
15
Aspecto do Pipeline
16
Trace Cache
Uma trace cache é uma versão sofisticada de uma Instruction Cache (L1)
Quando a trace cache é acedida com o endereço de uma certa instrução IA-32, acontece uma de 3 coisas: A tradução da instrução está na cache. Até 3 µOPs são
produzidas. As 3 podem representar entre 1 e 3 instruções IA-32. Portanto, o PC IA-32 é avançado entre 1 e 3 instruções.
A tradução da instrução está na cache, mas são necessárias mais do que 4 µOPs para a mesma. No caso destas “instruções complexas”, o controlo é passado a um programa numa micro-ROM até que a sequência completa é produzida.
A tradução não está na cache. Neste caso, o descodificador IA-32 é utilizado para traduzir a instrução. O resultado é colocado na cache.
Note-se que da próxima vez que a instrução for executada, tipicamente já estará descodificada na cache
17
Trace Cache (2)
A Trace-Cache guarda sequências de instruções executadas para além dos saltos
18
Visão Detalhada do Pentium 4 (2000)
19
Pentium 4 Die
Arquitectura de Computadores II
Paulo MarquesDepartamento de Eng. InformáticaUniversidade de [email protected]
2004
/200
5
4. Exemplos de Alguns Processadores Actuais4.3. AMD Opteron (& Athlon64)
21
Top processors on SPEC2000 (July/04)
CPU INTEGER PERFORMANCE
0
200
400
600
800
1000
1200
1400
1600
1800
Intel Pentium4 HT3.4GHz
ExtremeEdition(Mar/04)
AMD Opteron1502.4GHz (May/04)
Intel Xeon 3.2GHz(Feb/04)
Fujitsu SPARC64V1.9GHz (Jun/04)
Itanium2 1.5GHz(Dec/03)
IBM POWER4+1.9GHz (May/04)
Alpha 21264C1.2GHz (Nov/02)
PowerMac G52.0GHz (Dec/03)***
CP
UIN
T2
00
0
22
0
500
1000
1500
2000
2500
HP / Itanium21.5GHz (Feb/04)
FujitsuSPARC64V
1.9GHz (Jun/04)
IBM POWER4+1.7GHz (May/04)
AMD Opteron2482.2GHz (May/04)
Pentium4 HT3.4GHz
ExtremeEdition(Mar/04)
Alpha213641.2GHz (May/03)
AMD AthlonFX-512.2GHz (Sep/03)
Xeon 3.2GHz(Apr/04)
CF
P20
00
CPU FLOATING POINT PERFORMANCE
Top processors on SPEC2000 (July/04)
23
Processor Market
The PC market has lead Intel and AMD to really boost the integer performance of their processors To a point they largely passed the performance available in
classical RISC chips
Floating point performance is increasing although RISC/Vector/VLIW processors still have an edge No consumer need in the PC market Scientific workstations need FP performance
In the server market the important is not so much the peek performance, but throughput and reliability Xeon systems Itanium POWER4+
24
64-bit World
64-bit machines have been available for a long time in the scientific and business market e.g. SPARCv9, Alpha, POWER4+, ...
What does 64-bit brings? Increased address space (32-bit: 4GByte max; 64-bit:
16.384PByte!) Increased dynamic range for variables (32-bit int:0-
4294967295; 64-bit int: 0-18446744073709551615)
64-bit does not bring increased performance automatically! It may have the contrary effect, memory traffic doubles
when going from 32-bit to 64-bit!
25
Main contenders in the 64-bit server market
SPARCv9 (Sun and Fujitsu)
Intel Itanium2
AMD64 Opteron (and Athlon64)
Intel’s Extended Memory 64 Processors
Future uncertain, mostly used on high-end market, keeps on going partly because of installed consumer base.
Future uncertain. AMDs are much better and Intel EM64T is a copy of AMD. Bad performance for its price when compared with the competition.
Have taken the lead of the market by proposing an architecture thatenables to execute 32 and 64 bit applications with performance.Superior memory bandwidth. Problem: IT’S NOT INTEL!
Intel licensed the AMD technology and has launched an architectureexactly (or almost) equal. It is currently available in high-end Xeon machines
Note: IBM POWER4+ still dominates on the high-end multi-way server market
26
AMD64 – Dual Mode
AMD has proposed an architecture which allows the execution of 32 and 64-bit applications (x32-64)
No need to recompile old applications
32-bit applications execute with same performance
64-bit applications take advantage of a larger address space, more registers, etc.
Operating System Support: Linux (SuSE, Redhat, ...) Windows Server 2003 (beta) Solaris (2nd Half 2004) FreeBSD & NetBSD “Java 1.5”
Operating System(e.g. Linux64 or Windows2003-64)
“Legacy” 32-bitApplication
(4GB memory limit)
64-bitApplication
27
The Instruction Set Architecture
RAX
63
GGPPRR
xx8877
079
31
AHEAX AL
0715In x86
XMM0127 0
XMM7
EAX
EIP
Added by AMD64
EDI
XMM8XMM8
XMM8
XMM15
R8
R15
Registers
IA-32 instructions + new prefixes
Next 64-bit mode instructionsInstructions
(INTEL’slook alike!)
28
Why More Registers?
Number of Registers Each Function in the Program Needs
Question: If processors do Register Renaming, why dowe need more programmer visible registers?
29
The memory controller is included in the CPU 6.4GB/sec
HyperTransport Point-to-point link for
high-speed circuits standard (international consortium)
3x 6.4GB/sec inter-processor connections
Up to 19.2GB/s peak aggregate bandwidth
(AMD Athlon64 only has one HyperTransport link)
L2Cache
L1Instruction
Cache
L1Data
Cache
AMD64 Core
DDR Memory Controller
HyperTransport™technology
To other processors/devices
AMD Opteron™ processor architecture
Directly to
memory
AMD Opteron Architecture
30
Difference to traditional systems
PCI
PCI-X
IDE, FDC,USB, Etc.
DDR Memory
DDR
PCI-XBridge
PCI-XBridge
I/OHub
I/OHub
OpteronCPU
OpteronCPU
Other CPUsor devices
CPUCPU
North BridgeNorth
Bridge
SouthBridgeSouthBridge
PCI
PCI-X
IDE, FDC,USB, Etc.
DDR
DDR PCI-XBridgePCI-XBridge
Other CPUsor devices
31
AMD64 Core (Opteron – Hammer)
Superscalar Out-of-Order Multi-Issue Processor 10 Execution Units
3 Integer ALUs 3 FP ALUs 3 Address calculation Units 1 Load/Store Unit
12 stage pipeline 17 stages for FP
The IA-32 instructions are translated into MacroOps (MOPS) single-part MOps: arithmetic operations or memory accesses two-part MOps: an arithmetic operation and a memory access
Dynamic Branch Prediction Local history table + Global history table (16K entries) Branch Target Buffer: 2K branches
Integrated DDR Memory Controller
32
Opteron’s Core
33
Moving Instructions from Memory to Cache
When code is first moved into the Athlon's L1 instruction cache, the processor's predecode logic examines the newly cached lump of code in order to detect individual instruction boundaries, and it marks those boundaries with a small amount of "metadata" so that the front end has less work to perform. The predecode logic also marks static branches.
This predecoding process moves some of the front-end work to an earlier portion of the pipeline, speeding the actual fetch and decode phases later. The drawback is that the extra metadata eats up valuable L1 I-cache space
Processador
Cache Instruções
Memória
34
Processor Frontend
16 bytes are readat a time ( 5 IA-32instructions)
FastPath Decoder(instr. that translateinto 2 MOPs max)- max 3 IA-32 Instr. clock- max 3 MOPs clock
Micro ROM (everything else)- max 1 IA-32 Instr. clock- max 3 MOPs clock
issue slots(3 instructions)
35
Opteron’s Pipeline
36
Opteron’s Die
37
Material para ler
Computer Architecture: A Quantitative Approach Secção 3.10 Apêndice D
Artigos Jon "Hannibal" Stokes, “The Pentium 4 and the G4e: an
Architectural Comparison: Part I”, in Ars Technica, July 2001 http://arstechnica.com/articles/paedia/cpu/p4andg4e.ars/1
Jon "Hannibal" Stokes, “The Pentium 4 and the G4e: an Architectural Comparison: Part II”, in Ars Technica, July 2001 http://arstechnica.com/articles/paedia/cpu/p4andg4e2.ars
Jon "Hannibal" Stokes, “Inside AMD's Hammer: the 64-bit architecture behind the Opteron and Athlon 64”, in Ars Technica, January 2005 http://arstechnica.com/articles/paedia/cpu/amd-hammer-1.ars
Viktor Kartunov, “Facts & Assumptions about the Architecture of AMD Opteron and Athlon 64”, in Digit-Lifehttp://www.digit-life.com/articles2/amd-hammer-family/index.html