superscalar architectures

7/27/2019 Superscalar Architectures

1/36

Superscalar Pipeline

ArchitecturesBy: Matthew Osborne, Phi l ip Ho, Xun Chen

April 19, 2004


2/36

Superscalar Architecture

Relatively new, first appeared in early

1990s

Builds on the concept of pipelining

Superscalar architectures can process

multiple instructions in one clock cycle

(multiple instruction execution units)

Allows for instruction execution rate to

exceed the clock rate (CPI of less than 1)


3/36

Overview of Selected Superscalar

Architectures

Intel

MIPS

PowerPC T 1000 Architectures

Hobbes: A Multi-threaded superscalar


4/36

Intel Superscalar Architecture

According to Sara Sarimento, in her essay Recent History of Intel

ArchitectureA Refresher

- Intels first use of a superscalar

architecture was its Pentium Processor- Instruction Level Parallelism -

instructions independent of the outcome of

one another execute concurrently to utilizemore of the available hardware resources

and increase instruction throughput.


5/36

Intel P5 Microarchitecture

Used in initial Pentium processor

Could execute up to 2 instructions simultaneouslyInstructions sent through the pipeline in order - if the next two

instructions had a dependency issue, only one instruction (pipe)

would be executed and the second execution unit (pipe) went

unused for that clock cycle.


6/36


- Used in the Pentium II, III and Pro processors

-3 instruction decoders, which break each CISC instruction (macro-op) into

equivalent micro-operations (ops) for the Out-of-Order Execution unit

-10 stage instruction pipeline utilized in this architecture


7/36


Out of Order instruction execution - executes

instructions without data dependency issues out

of order for a higher level of hardware utilization

Scheduler unit resolves data dependencyissues between individual instructions

Re-Order Buffer puts instructions back in order

before writing them back to memory

Up to 3 instructions can be retired concurrently

to memory


8/36

Intel NetBurst MicroArchitecture

-New architecture used for the Intel Pentium IV and

Pentium Xeon processors


9/36

Intel NetBurst Microarchitecture

Changes from P6 Architecture

Only one instruction decoder present

Decoder moved outside the Out-of-OrderExecution Unit; an Execution Trace Cachewas added in its place

Increased number of pipeline stages to 20

Improved branch prediction algorithms ALUs operate twice as quickly as their P6

counterparts


10/36


Execution Trace Cache

Alleviates delays in fetching and translating CISCinstructions to their appropriate ops

Instructions are now decoded by a translation engine,

with the resulting ops stored as traces (sequence ofops) in the Execution trace cache.

Traces stored in path of predicted program executionflow, with results of branches in the code integrated intothis path

Delivers up to 3 ops to the core of the Execution Unitper clock cycle


11/36


Branch Prediction

Branch targets are predicted based on theirlinear address using branch prediction logic andfetched as soon as possible

Targets are fetched from the Execution TraceCache if cached there; otherwise they arefetched form the memory hierarchy

Downside: despite the improved prediction

algorithm, one of the biggest costs of thisarchitecture is mispredicted branches becauseof the longer instruction pipeline than previousarchitectures.


12/36

MIPS Superscalar Architecture

MIPS is a RISC instruction platform,

versus Intels CISC instruction platform

(made design of Superscalar Architecture

easier than for Intels CISC platform)

First MIPS processor with a Superscalar

Architecture was the MIPS R8000 64 bit,

released in 1994.


13/36


14/36

MIPS R8000 Features

Superscalar

Can support/process 4 in-order

instructions each cycle

Multi-component chip set (Integer Unit,

Floating Point Unit, Tag RAMs and Data

Streaming Cache)

Designed for peak performance with

Floating Point Operations


15/36

MIPS R8000 Pitfalls

Integer operation performance limited

Very high cost

As a result of these two key factors:

The R8000 was only in the marketplace

for about a year. This processor was mainly used only in

the scientific community


16/36

MIPS R10000 Processor

Superscalar Pipeline Architecture for the R10000 processor.

Diagram courtesy of R10000 Microprocessor Users Manual.

http://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Dev

eloper/books/R10K_UM/sgi_html/t5.Ver.2.0.book_12.html


17/36


18/36


19/36

MIPS R10000 Processor

5 Execution Pipelines

- Load/Store Unit

- Two Integer ALUs

- Floating Point Adder

- Floating Point Multiplier

Can process up to 4 out of order instructions

simultaneously Base architecture core that all successor MIPS

processors have been built from


20/36

PowerPC

Direct descendent of IBM 801, RT PC and

RS/6000

All are RISC

RS/6000 first superscalar

PowerPC 601 superscalar design similar

to RS/6000 Later versions extend superscalar concept


21/36


22/36

PowerPC 601 Pipeline


23/36

PowerPC 601 General View


24/36

PowerPC storage model

Supports for byte(8-bits), halfword(16-bits),word(32-bits) and doubleword(64-bits) datatypes.

Handles string operations for multi-byte stringsup to 128 bytes

32-bit PowerPC implementations supports a 4-GB effective address space.

64-bits PowerPC implementations supports a16-exabyte effiective address space.


25/36

General-purpose registers

(GPR) User Instruction Set architecture specifies

all implementations have 32 GPRs

GPRs are the source and destination of all

integer operations

No lookup is done for GPR0s contents.


26/36

Floating-point registers (FPR)

All implementations have 32 FPRs.

FPR are source and destination operands

of all floating-point operations.

Contains 32-bit and 64-bit signed and

unsigned integer vlaues, single-precision

and double-precision floating-point values.


27/36

Special-purpose registers (SPR)

Give status and control of resources within the

processor core.

Read and written by applications without support

from a system service include the CountRegister, the Link Register and the Integer

Exception Register.

Can only be ready by applications with support

form a system service include the Time Base

and other timers.


28/36


29/36

Hobbes

A multi-threaded architecture attempt to increase

pipeline utilization by concurrently executing instructions

from different threads.

The architecture chosen was the aggressive speculative

and out-of-order superscalar processor based on theMIPS R2000 instruction set.

The Hobbes architecture combines multi-threading with

superscalar issue, with the supposition that strengths of

one should offset the weaknesses of the other.

By supporting superscalar issue from more than one

thread, the architecture overcomes the lack of

instruction-level parallelism that plagues other

superscalar structures.


30/36


31/36

Multi-threaded Architectures

Multi-threaded processors can concurrently execute

instructions from more than one thread.

The contexts of multiple threads are stored on-board,

which allows instructions to be issued from different

threads.

Traditional multi-threaded architectures have usually

implemented a round-robin execution strategy with

switched that instruction execution to a new a thread

every cycle.


32/36

The Thread Unit of Hobbes

The Thread unit contains all of the elements

required to support a single thread.

It consists of a fetch buffer, issue buffer, decode

logic, branch adder and the thread state storage.


33/36

The Thread Unit

Instruction fetch is performed

by reading an entire cacheline

of four words and storing it in

the fetch buffer.

Each thread decodes and

issues its instructions in

program order. After and

instruction has been decoded,

it is stalled until all of its

operands are available.

Once the operands are ready,

the instruction is placed into

the issue buffer and the issue

unit is notified.

The register file is very

similar to that found on

the R2000. The register

file has two write ports

and both of these may befrom the same thread.

Branches which do not

affect the register file are

executed in the threadunit and are not issued to

the execution unit.


34/36

The Execution Units of Hobbes

The Hobbes architecture

has an almost identical

set of execution units as

out-oforder superscalar

processor. The characteristics of the

execution units

approximately correspond

to those of theR2000/R2010.

Execution Units

Integer:2 ALUs,

Shifter, Multiply /

Divide, Load / Store,Data cache interface

FP:FP Convert, FP

Add, FP Multiply, FP

Divide


35/36

Superscalar Architecture

Superscalar processors improve performance by

reducing the average number of cycles required to

execute each instruction

This is accomplished by issuing and executing more

than one independent instruction per cycle, rather thanlimiting execution to just on instruction per cycle as

traditional pipelined architectures.

For superscalar architectures to experience speed-up

over traditional pipelined architectures they require theaverage level of available instruction-level parallelism to

be greater than one.


36/36

References

Hennessy, John L and Patterson, David A. Computer Organization and Design, TheHardware/Software Interface. San Francisco: Morgan Kaufmann Publishers 1998.

Sarimento, Sara. Recent History of IntelArchitecture A Refresher. 17 April 2004. IntelCorporation www.intel.com 18 April 2004 http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htm

Zhou & Martonosi. Augmenting Modern Suuperscalar Architectures with Configurable ExtendedInstructions. 19 April 2004. http://ipdps.eece.unm.edu/2000/raw/18000943.pdf

Kish & Preiss. Hobbes: A Multi-Threaded Superscalar Architecture 19, April 2004

http://www.brpreiss.com/page75.html R10000 Processor Users Manual. 9 Dec 1996. SGI Corporation. 22 April 2004http://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.html#HEADING1

MIPS Architecture. 17 April 2004. Wikipedia, The Free Encyclopediahttp://en.wikipedia.org/wiki/Main_Page 23 April 2004http://en.wikipedia.org/wiki/MIPS_architecture.

Mapleson, Ian. Indigo 2 and Power Indigo 2 Technical Report. SiliconGraphics. 23 April 2004http://sgi.cartsys.net/i2sec7.html.

Power PC Architecture 23 April 2004 http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html
http://www.intel.com/http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://ipdps.eece.unm.edu/2000/raw/18000943.pdfhttp://www.brpreiss.com/page75.htmlhttp://www.brpreiss.com/page75.htmlhttp://www.brpreiss.com/page75.htmlhttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://en.wikipedia.org/wiki/Main_Pagehttp://en.wikipedia.org/wiki/MIPS_architecturehttp://sgi.cartsys.net/i2sec7.htmlhttp://sgi.cartsys.net/i2sec7.htmlhttp://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/Main_Pagehttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://www.brpreiss.com/page75.htmlhttp://ipdps.eece.unm.edu/2000/raw/18000943.pdfhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/

superscalar architectures

Documents