superscalar architectures

Upload: bijay-mishra

Post on 02-Apr-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Superscalar Architectures

    1/36

    Superscalar Pipeline

    ArchitecturesBy: Matthew Osborne, Phi l ip Ho, Xun Chen

    April 19, 2004

  • 7/27/2019 Superscalar Architectures

    2/36

    Superscalar Architecture

    Relatively new, first appeared in early

    1990s

    Builds on the concept of pipelining

    Superscalar architectures can process

    multiple instructions in one clock cycle

    (multiple instruction execution units)

    Allows for instruction execution rate to

    exceed the clock rate (CPI of less than 1)

  • 7/27/2019 Superscalar Architectures

    3/36

    Overview of Selected Superscalar

    Architectures

    Intel

    MIPS

    PowerPC T 1000 Architectures

    Hobbes: A Multi-threaded superscalar

  • 7/27/2019 Superscalar Architectures

    4/36

    Intel Superscalar Architecture

    According to Sara Sarimento, in her essay Recent History of Intel

    ArchitectureA Refresher

    - Intels first use of a superscalar

    architecture was its Pentium Processor- Instruction Level Parallelism -

    instructions independent of the outcome of

    one another execute concurrently to utilizemore of the available hardware resources

    and increase instruction throughput.

  • 7/27/2019 Superscalar Architectures

    5/36

    Intel P5 Microarchitecture

    Used in initial Pentium processor

    Could execute up to 2 instructions simultaneouslyInstructions sent through the pipeline in order - if the next two

    instructions had a dependency issue, only one instruction (pipe)

    would be executed and the second execution unit (pipe) went

    unused for that clock cycle.

  • 7/27/2019 Superscalar Architectures

    6/36

    Intel P6 Microarchitecture

    - Used in the Pentium II, III and Pro processors

    -3 instruction decoders, which break each CISC instruction (macro-op) into

    equivalent micro-operations (ops) for the Out-of-Order Execution unit

    -10 stage instruction pipeline utilized in this architecture

  • 7/27/2019 Superscalar Architectures

    7/36

    Intel P6 Microarchitecture

    Out of Order instruction execution - executes

    instructions without data dependency issues out

    of order for a higher level of hardware utilization

    Scheduler unit resolves data dependencyissues between individual instructions

    Re-Order Buffer puts instructions back in order

    before writing them back to memory

    Up to 3 instructions can be retired concurrently

    to memory

  • 7/27/2019 Superscalar Architectures

    8/36

    Intel NetBurst MicroArchitecture

    -New architecture used for the Intel Pentium IV and

    Pentium Xeon processors

  • 7/27/2019 Superscalar Architectures

    9/36

    Intel NetBurst Microarchitecture

    Changes from P6 Architecture

    Only one instruction decoder present

    Decoder moved outside the Out-of-OrderExecution Unit; an Execution Trace Cachewas added in its place

    Increased number of pipeline stages to 20

    Improved branch prediction algorithms ALUs operate twice as quickly as their P6

    counterparts

  • 7/27/2019 Superscalar Architectures

    10/36

    Intel NetBurst Microarchitecture

    Execution Trace Cache

    Alleviates delays in fetching and translating CISCinstructions to their appropriate ops

    Instructions are now decoded by a translation engine,

    with the resulting ops stored as traces (sequence ofops) in the Execution trace cache.

    Traces stored in path of predicted program executionflow, with results of branches in the code integrated intothis path

    Delivers up to 3 ops to the core of the Execution Unitper clock cycle

  • 7/27/2019 Superscalar Architectures

    11/36

    Intel NetBurst Microarchitecture

    Branch Prediction

    Branch targets are predicted based on theirlinear address using branch prediction logic andfetched as soon as possible

    Targets are fetched from the Execution TraceCache if cached there; otherwise they arefetched form the memory hierarchy

    Downside: despite the improved prediction

    algorithm, one of the biggest costs of thisarchitecture is mispredicted branches becauseof the longer instruction pipeline than previousarchitectures.

  • 7/27/2019 Superscalar Architectures

    12/36

    MIPS Superscalar Architecture

    MIPS is a RISC instruction platform,

    versus Intels CISC instruction platform

    (made design of Superscalar Architecture

    easier than for Intels CISC platform)

    First MIPS processor with a Superscalar

    Architecture was the MIPS R8000 64 bit,

    released in 1994.

  • 7/27/2019 Superscalar Architectures

    13/36

  • 7/27/2019 Superscalar Architectures

    14/36

    MIPS R8000 Features

    Superscalar

    Can support/process 4 in-order

    instructions each cycle

    Multi-component chip set (Integer Unit,

    Floating Point Unit, Tag RAMs and Data

    Streaming Cache)

    Designed for peak performance with

    Floating Point Operations

  • 7/27/2019 Superscalar Architectures

    15/36

    MIPS R8000 Pitfalls

    Integer operation performance limited

    Very high cost

    As a result of these two key factors:

    The R8000 was only in the marketplace

    for about a year. This processor was mainly used only in

    the scientific community

  • 7/27/2019 Superscalar Architectures

    16/36

    MIPS R10000 Processor

    Superscalar Pipeline Architecture for the R10000 processor.

    Diagram courtesy of R10000 Microprocessor Users Manual.

    http://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Dev

    eloper/books/R10K_UM/sgi_html/t5.Ver.2.0.book_12.html

  • 7/27/2019 Superscalar Architectures

    17/36

  • 7/27/2019 Superscalar Architectures

    18/36

  • 7/27/2019 Superscalar Architectures

    19/36

    MIPS R10000 Processor

    5 Execution Pipelines

    - Load/Store Unit

    - Two Integer ALUs

    - Floating Point Adder

    - Floating Point Multiplier

    Can process up to 4 out of order instructions

    simultaneously Base architecture core that all successor MIPS

    processors have been built from

  • 7/27/2019 Superscalar Architectures

    20/36

    PowerPC

    Direct descendent of IBM 801, RT PC and

    RS/6000

    All are RISC

    RS/6000 first superscalar

    PowerPC 601 superscalar design similar

    to RS/6000 Later versions extend superscalar concept

  • 7/27/2019 Superscalar Architectures

    21/36

  • 7/27/2019 Superscalar Architectures

    22/36

    PowerPC 601 Pipeline

  • 7/27/2019 Superscalar Architectures

    23/36

    PowerPC 601 General View

  • 7/27/2019 Superscalar Architectures

    24/36

    PowerPC storage model

    Supports for byte(8-bits), halfword(16-bits),word(32-bits) and doubleword(64-bits) datatypes.

    Handles string operations for multi-byte stringsup to 128 bytes

    32-bit PowerPC implementations supports a 4-GB effective address space.

    64-bits PowerPC implementations supports a16-exabyte effiective address space.

  • 7/27/2019 Superscalar Architectures

    25/36

    General-purpose registers

    (GPR) User Instruction Set architecture specifies

    all implementations have 32 GPRs

    GPRs are the source and destination of all

    integer operations

    No lookup is done for GPR0s contents.

  • 7/27/2019 Superscalar Architectures

    26/36

    Floating-point registers (FPR)

    All implementations have 32 FPRs.

    FPR are source and destination operands

    of all floating-point operations.

    Contains 32-bit and 64-bit signed and

    unsigned integer vlaues, single-precision

    and double-precision floating-point values.

  • 7/27/2019 Superscalar Architectures

    27/36

    Special-purpose registers (SPR)

    Give status and control of resources within the

    processor core.

    Read and written by applications without support

    from a system service include the CountRegister, the Link Register and the Integer

    Exception Register.

    Can only be ready by applications with support

    form a system service include the Time Base

    and other timers.

  • 7/27/2019 Superscalar Architectures

    28/36

  • 7/27/2019 Superscalar Architectures

    29/36

    Hobbes

    A multi-threaded architecture attempt to increase

    pipeline utilization by concurrently executing instructions

    from different threads.

    The architecture chosen was the aggressive speculative

    and out-of-order superscalar processor based on theMIPS R2000 instruction set.

    The Hobbes architecture combines multi-threading with

    superscalar issue, with the supposition that strengths of

    one should offset the weaknesses of the other.

    By supporting superscalar issue from more than one

    thread, the architecture overcomes the lack of

    instruction-level parallelism that plagues other

    superscalar structures.

  • 7/27/2019 Superscalar Architectures

    30/36

  • 7/27/2019 Superscalar Architectures

    31/36

    Multi-threaded Architectures

    Multi-threaded processors can concurrently execute

    instructions from more than one thread.

    The contexts of multiple threads are stored on-board,

    which allows instructions to be issued from different

    threads.

    Traditional multi-threaded architectures have usually

    implemented a round-robin execution strategy with

    switched that instruction execution to a new a thread

    every cycle.

  • 7/27/2019 Superscalar Architectures

    32/36

    The Thread Unit of Hobbes

    The Thread unit contains all of the elements

    required to support a single thread.

    It consists of a fetch buffer, issue buffer, decode

    logic, branch adder and the thread state storage.

  • 7/27/2019 Superscalar Architectures

    33/36

    The Thread Unit

    Instruction fetch is performed

    by reading an entire cacheline

    of four words and storing it in

    the fetch buffer.

    Each thread decodes and

    issues its instructions in

    program order. After and

    instruction has been decoded,

    it is stalled until all of its

    operands are available.

    Once the operands are ready,

    the instruction is placed into

    the issue buffer and the issue

    unit is notified.

    The register file is very

    similar to that found on

    the R2000. The register

    file has two write ports

    and both of these may befrom the same thread.

    Branches which do not

    affect the register file are

    executed in the threadunit and are not issued to

    the execution unit.

  • 7/27/2019 Superscalar Architectures

    34/36

    The Execution Units of Hobbes

    The Hobbes architecture

    has an almost identical

    set of execution units as

    out-oforder superscalar

    processor. The characteristics of the

    execution units

    approximately correspond

    to those of theR2000/R2010.

    Execution Units

    Integer:2 ALUs,

    Shifter, Multiply /

    Divide, Load / Store,Data cache interface

    FP:FP Convert, FP

    Add, FP Multiply, FP

    Divide

  • 7/27/2019 Superscalar Architectures

    35/36

    Superscalar Architecture

    Superscalar processors improve performance by

    reducing the average number of cycles required to

    execute each instruction

    This is accomplished by issuing and executing more

    than one independent instruction per cycle, rather thanlimiting execution to just on instruction per cycle as

    traditional pipelined architectures.

    For superscalar architectures to experience speed-up

    over traditional pipelined architectures they require theaverage level of available instruction-level parallelism to

    be greater than one.

  • 7/27/2019 Superscalar Architectures

    36/36

    References

    Hennessy, John L and Patterson, David A. Computer Organization and Design, TheHardware/Software Interface. San Francisco: Morgan Kaufmann Publishers 1998.

    Sarimento, Sara. Recent History of IntelArchitecture A Refresher. 17 April 2004. IntelCorporation www.intel.com 18 April 2004 http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htm

    Zhou & Martonosi. Augmenting Modern Suuperscalar Architectures with Configurable ExtendedInstructions. 19 April 2004. http://ipdps.eece.unm.edu/2000/raw/18000943.pdf

    Kish & Preiss. Hobbes: A Multi-Threaded Superscalar Architecture 19, April 2004

    http://www.brpreiss.com/page75.html R10000 Processor Users Manual. 9 Dec 1996. SGI Corporation. 22 April 2004http://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.html#HEADING1

    MIPS Architecture. 17 April 2004. Wikipedia, The Free Encyclopediahttp://en.wikipedia.org/wiki/Main_Page 23 April 2004http://en.wikipedia.org/wiki/MIPS_architecture.

    Mapleson, Ian. Indigo 2 and Power Indigo 2 Technical Report. SiliconGraphics. 23 April 2004http://sgi.cartsys.net/i2sec7.html.

    Power PC Architecture 23 April 2004 http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html

    http://www.intel.com/http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://ipdps.eece.unm.edu/2000/raw/18000943.pdfhttp://www.brpreiss.com/page75.htmlhttp://www.brpreiss.com/page75.htmlhttp://www.brpreiss.com/page75.htmlhttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://en.wikipedia.org/wiki/Main_Pagehttp://en.wikipedia.org/wiki/MIPS_architecturehttp://sgi.cartsys.net/i2sec7.htmlhttp://sgi.cartsys.net/i2sec7.htmlhttp://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/Main_Pagehttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://techpubs.sgi.com/library/dynaweb_docs/hdwr/SGI_Developer/books/R10K_UM/sgi_html/index.htmlhttp://www.brpreiss.com/page75.htmlhttp://ipdps.eece.unm.edu/2000/raw/18000943.pdfhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/44015.htmhttp://www.intel.com/