Download - Superscalar Architectures
-
7/27/2019 Superscalar Architectures
1/36
Superscalar Pipeline
ArchitecturesBy: Matthew Osborne, Phi l ip Ho, Xun Chen
April 19, 2004
-
7/27/2019 Superscalar Architectures
2/36
Superscalar Architecture
Relatively new, first appeared in early
1990s
Builds on the concept of pipelining
Superscalar architectures can process
multiple instructions in one clock cycle
(multiple instruction execution units)
Allows for instruction execution rate to
exceed the clock rate (CPI of less than 1)
-
7/27/2019 Superscalar Architectures
3/36
Overview of Selected Superscalar
Architectures
Intel
MIPS
PowerPC T 1000 Architectures
Hobbes: A Multi-threaded superscalar
-
7/27/2019 Superscalar Architectures
4/36
Intel Superscalar Architecture
According to Sara Sarimento, in her essay Recent History of Intel
ArchitectureA Refresher
- Intels first use of a superscalar
architecture was its Pentium Processor- Instruction Level Parallelism -
instructions independent of the outcome of
one another execute concurrently to utilizemore of the available hardware resources
and increase instruction throughput.
-
7/27/2019 Superscalar Architectures
5/36
Intel P5 Microarchitecture
Used in initial Pentium processor
Could execute up to 2 instructions simultaneouslyInstructions sent through the pipeline in order - if the next two
instructions had a dependency issue, only one instruction (pipe)
would be executed and the second execution unit (pipe) went
unused for that clock cycle.
-
7/27/2019 Superscalar Architectures
6/36
Intel P6 Microarchitecture
- Used in the Pentium II, III and Pro processors
-3 instruction decoders, which break each CISC instruction (macro-op) into
equivalent micro-operations (ops) for the Out-of-Order Execution unit
-10 stage instruction pipeline utilized in this architecture
-
7/27/2019 Superscalar Architectures
7/36
Intel P6 Microarchitecture
Out of Order instruction execution - executes
instructions without data dependency issues out
of order for a higher level of hardware utilization
Scheduler unit resolves data dependencyissues between individual instructions
Re-Order Buffer puts instructions back in order
before writing them back to memory
Up to 3 instructions can be retired concurrently
to memory
-
7/27/2019 Superscalar Architectures
8/36
Intel NetBurst MicroArchitecture
-New architecture used for the Intel Pentium IV and
Pentium Xeon processors
-
7/27/2019 Superscalar Architectures
9/36
Intel NetBurst Microarchitecture
Changes from P6 Architecture
Only one instruction decoder present
Decoder moved outside the Out-of-OrderExecution Unit; an Execution Trace Cachewas added in its place
Increased number of pipeline stages to 20
Improved branch prediction algorithms ALUs operate twice as quickly as their P6
counterparts
-
7/27/2019 Superscalar Architectures
10/36
Intel NetBurst Microarchitecture
Execution Trace Cache
Alleviates delays in fetching and translating CISCinstructions to their appropriate ops
Instructions are now decoded by a translation engine,
with the resulting ops stored as traces (sequence ofops) in the Execution trace cache.
Traces stored in path of predicted program executionflow, with results of branches in the code integrated intothis path
Delivers up to 3 ops to the core of the Execution Unitper clock cycle
-
7/27/2019 Superscalar Architectures
11/36
Intel NetBurst Microarchitecture
Branch Prediction
Branch targets are predicted based on theirlinear address using branch prediction logic andfetched as soon as possible
Targets are fetched from the Execution TraceCache if cached there; otherwise they arefetched form the memory hierarchy
Downside: despite the improved prediction
algorithm, one of the biggest costs of thisarchitecture is mispredicted branches becauseof the longer instruction pipeline than previousarchitectures.
-
7/27/2019 Superscalar Architectures
12/36
MIPS Superscalar Architecture
MIPS is a RISC instruction platform,
versus Intels CISC instruction platform
(made design of Superscalar Architecture
easier than for Intels CISC platform)
First MIPS processor with a Superscalar
Architecture was the MIPS R8000 64 bit,
released in 1994.
-
7/27/2019 Superscalar Architectures
13/36
-
7/27/2019 Superscalar Architectures
14/36
MIPS R8000 Features
Superscalar
Can support/process 4 in-order
instructions each cycle
Multi-component chip set (Integer Unit,
Floating Point Unit, Tag RAMs and Data
Streaming Cache)
Designed for peak performance with
Floating Point Operations
-
7/27/2019 Superscalar Architectures
15/36
MIPS R8000 Pitfalls
Integer operation performance limited
Very high cost
As a result of these two key factors:
The R8000 was only in the marketplace
for about a year. This processor was mainly used only in
the scientific community
-
7/27/2019 Superscalar Architectures
16/36
MIPS R10000 Processor
Superscalar Pipeline Architecture for the R10000 processor.
Diagram courtesy of R10000 Microprocessor Users Manual.
http://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Dev
eloper/books/R10K_UM/sgi_html/t5.Ver.2.0.book_12.html
-
7/27/2019 Superscalar Architectures
17/36
-
7/27/2019 Superscalar Architectures
18/36
-
7/27/2019 Superscalar Architectures
19/36
MIPS R10000 Processor
5 Execution Pipelines
- Load/Store Unit
- Two Integer ALUs
- Floating Point Adder
- Floating Point Multiplier
Can process up to 4 out of order instructions
simultaneously Base architecture core that all successor MIPS
processors have been built from
-
7/27/2019 Superscalar Architectures
20/36
PowerPC
Direct descendent of IBM 801, RT PC and
RS/6000
All are RISC
RS/6000 first superscalar
PowerPC 601 superscalar design similar
to RS/6000 Later versions extend superscalar concept
-
7/27/2019 Superscalar Architectures
21/36
-
7/27/2019 Superscalar Architectures
22/36
PowerPC 601 Pipeline
-
7/27/2019 Superscalar Architectures
23/36
PowerPC 601 General View
-
7/27/2019 Superscalar Architectures
24/36
PowerPC storage model
Supports for byte(8-bits), halfword(16-bits),word(32-bits) and doubleword(64-bits) datatypes.
Handles string operations for multi-byte stringsup to 128 bytes
32-bit PowerPC implementations supports a 4-GB effective address space.
64-bits PowerPC implementations supports a16-exabyte effiective address space.
-
7/27/2019 Superscalar Architectures
25/36
General-purpose registers
(GPR) User Instruction Set architecture specifies
all implementations have 32 GPRs
GPRs are the source and destination of all
integer operations
No lookup is done for GPR0s contents.
-
7/27/2019 Superscalar Architectures
26/36
Floating-point registers (FPR)
All implementations have 32 FPRs.
FPR are source and destination operands
of all floating-point operations.
Contains 32-bit and 64-bit signed and
unsigned integer vlaues, single-precision
and double-precision floating-point values.
-
7/27/2019 Superscalar Architectures
27/36
Special-purpose registers (SPR)
Give status and control of resources within the
processor core.
Read and written by applications without support
from a system service include the CountRegister, the Link Register and the Integer
Exception Register.
Can only be ready by applications with support
form a system service include the Time Base
and other timers.
-
7/27/2019 Superscalar Architectures
28/36
-
7/27/2019 Superscalar Architectures
29/36
Hobbes
A multi-threaded architecture attempt to increase
pipeline utilization by concurrently executing instructions
from different threads.
The architecture chosen was the aggressive speculative
and out-of-order superscalar processor based on theMIPS R2000 instruction set.
The Hobbes architecture combines multi-threading with
superscalar issue, with the supposition that strengths of
one should offset the weaknesses of the other.
By supporting superscalar issue from more than one
thread, the architecture overcomes the lack of
instruction-level parallelism that plagues other
superscalar structures.
-
7/27/2019 Superscalar Architectures
30/36
-
7/27/2019 Superscalar Architectures
31/36
Multi-threaded Architectures
Multi-threaded processors can concurrently execute
instructions from more than one thread.
The contexts of multiple threads are stored on-board,
which allows instructions to be issued from different
threads.
Traditional multi-threaded architectures have usually
implemented a round-robin execution strategy with
switched that instruction execution to a new a thread
every cycle.
-
7/27/2019 Superscalar Architectures
32/36
The Thread Unit of Hobbes
The Thread unit contains all of the elements
required to support a single thread.
It consists of a fetch buffer, issue buffer, decode
logic, branch adder and the thread state storage.
-
7/27/2019 Superscalar Architectures
33/36
The Thread Unit
Instruction fetch is performed
by reading an entire cacheline
of four words and storing it in
the fetch buffer.
Each thread decodes and
issues its instructions in
program order. After and
instruction has been decoded,
it is stalled until all of its
operands are available.
Once the operands are ready,
the instruction is placed into
the issue buffer and the issue
unit is notified.
The register file is very
similar to that found on
the R2000. The register
file has two write ports
and both of these may befrom the same thread.
Branches which do not
affect the register file are
executed in the threadunit and are not issued to
the execution unit.
-
7/27/2019 Superscalar Architectures
34/36
The Execution Units of Hobbes
The Hobbes architecture
has an almost identical
set of execution units as
out-oforder superscalar
processor. The characteristics of the
execution units
approximately correspond
to those of theR2000/R2010.
Execution Units
Integer:2 ALUs,
Shifter, Multiply /
Divide, Load / Store,Data cache interface
FP:FP Convert, FP
Add, FP Multiply, FP
Divide
-
7/27/2019 Superscalar Architectures
35/36
Superscalar Architecture
Superscalar processors improve performance by
reducing the average number of cycles required to
execute each instruction
This is accomplished by issuing and executing more
than one independent instruction per cycle, rather thanlimiting execution to just on instruction per cycle as
traditional pipelined architectures.
For superscalar architectures to experience speed-up
over traditional pipelined architectures they require theaverage level of available instruction-level parallelism to
be greater than one.
-
7/27/2019 Superscalar Architectures
36/36
References
Hennessy, John L and Patterson, David A. Computer Organization and Design, TheHardware/Software Interface. San Francisco: Morgan Kaufmann Publishers 1998.
Sarimento, Sara. Recent History of IntelArchitecture A Refresher. 17 April 2004. IntelCorporation www.intel.com 18 April 2004 http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htm
Zhou & Martonosi. Augmenting Modern Suuperscalar Architectures with Configurable ExtendedInstructions. 19 April 2004. http://ipdps.eece.unm.edu/2000/raw/18000943.pdf
Kish & Preiss. Hobbes: A Multi-Threaded Superscalar Architecture 19, April 2004
http://www.brpreiss.com/page75.html R10000 Processor Users Manual. 9 Dec 1996. SGI Corporation. 22 April 2004http://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.html#HEADING1
MIPS Architecture. 17 April 2004. Wikipedia, The Free Encyclopediahttp://en.wikipedia.org/wiki/Main_Page 23 April 2004http://en.wikipedia.org/wiki/MIPS_architecture.
Mapleson, Ian. Indigo 2 and Power Indigo 2 Technical Report. SiliconGraphics. 23 April 2004http://sgi.cartsys.net/i2sec7.html.
Power PC Architecture 23 April 2004 http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html
http://www.intel.com/http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://ipdps.eece.unm.edu/2000/raw/18000943.pdfhttp://www.brpreiss.com/page75.htmlhttp://www.brpreiss.com/page75.htmlhttp://www.brpreiss.com/page75.htmlhttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://en.wikipedia.org/wiki/Main_Pagehttp://en.wikipedia.org/wiki/MIPS_architecturehttp://sgi.cartsys.net/i2sec7.htmlhttp://sgi.cartsys.net/i2sec7.htmlhttp://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/Main_Pagehttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://www.brpreiss.com/page75.htmlhttp://ipdps.eece.unm.edu/2000/raw/18000943.pdfhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/