processor microarchitectural units (intel pentium 4 used ...rlopes/mod5.4.pdfthe instruction...

Processor Microarchitectural Units (Intel Pentium 4 used as an example)

Pentium 4 Pipeline/Instruction flow:

Instruction flow inside an Intel Pentium 4 processor typically consists of the following Stages: •  Prefetch – anticipate what data would be used next and pull it into the cache before it was needed. •  L2 cache read – 2nd level cache that is read during the prefetch stage. •  Instruction decode – interpret each instruction from L2 cache. •  Branch predict – make a prediction and provide the target address. •  Trace cache write – decoded uops, including any uop branches, are written into the trace cache. They are written into the

trace cache in the expected order of execution, not necessarily the order the macroinstructions appear in memory. •  Microbranch predict – determines which uop should enter the pipeline next. •  Micro-op fetch and drive – Some uops read from the trace cache are actually pointers to uop routines stored in the

microcode ROM. The Pentium 4 pipeline allows for two "drive" cycles where no computation is performed but data is simply traveling from one part of the die to another.

•  Allocation – While the uops in the trace cache still reflect the original program order, it is important to record this order before the uops enter the out-of-order portion of the pipeline. Each uop is allocated an entry in the reorder buffer (ROB).

•  Register rename – updating of the register alias table (RAT) to determine which physical registers hold the uops source data and which will be used to store its result.

•  Schedule & dispatch – When the oldest uop is read from the memory queue, it is loaded into the memory scheduler. Uops can dispatch before older uops if their sources are ready first.

•  Register file read –Values that are used in computations are those stored in the register files. •  Execute & calculate flags – Flag values store information about the result such as whether it was 0, negative, or an

overflow. Any of the flag values can be a condition for a later branch uop. •  Retire – Upon retirement, the uop's results are committed to the current correct architectural state by updating the

retirement RAT and all the resources allocated to the instruction are released.

Pentium 4 Pipeline Prefetch

L2 cache read

Instr. decode

Branch Predict/ uBranch Predict

Trace cache write

Micro-op fetch and drive

Allocation/ Reg. rename

Schedule / Dispatch

Reg. read file

Execute/calculate

Retire

Pentium 4 Pipeline/Instruction flow*:

* Number in parenthesis () = number of pipeline stages

Prefetch

A technique used in microprocessors to speed up the execution of a program by reducing wait states.

L2 cache read

Larger caches have better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger, slower caches. Multi-level caches generally operate by checking the fastest, level 1 (L1) cache first; if it hits, the processor proceeds at high speed. If that smaller cache misses, the next fastest cache (level 2, L2) is checked, and so on, before external memory is checked.

Instruction decode

The opcode fetched from the memory is being decoded for the next steps and moved to the appropriate registers.

Branch predict

A digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure. Without branch prediction, the processor would have to wait until the conditional jump instruction has passed the execute stage before the next instruction can enter the fetch stage in the pipeline. The branch predictor attempts to avoid this waste of time by trying to guess whether the conditional jump is most likely to be taken or not taken. The branch predictor keeps records of whether branches are taken or not taken. When it encounters a conditional jump that has been seen several times before then it can base the prediction on the history. The branch predictor may, for example, recognize that the conditional jump is taken more often than not, or that it is taken every second time.

Trace cache write

Trace caches deal with lost fetch bandwidth caused by branches. It is a structure that overcomes this partial fetch problem by storing logically contiguous instructions (instructions which are adjacent in the instruction stream) in physically contiguous storage. This way, the trace cache is able to deliver multiple, non-contiguous instruction blocks each cycle. Generally, instructions are added to trace caches in groups representing either individual basic blocks or dynamic instruction traces. A dynamic trace ("trace path") contains only instructions whose results are actually used, and eliminates instructions following taken branches (since they are not executed);

Microinstructions

Before translation the machine language instructions are called macroinstructions, and the smaller steps after translation are called microinstructions.

Most CISC architecture macroinstructions can be translated into four or fewer uops. These translations are performed by decode logic on the processor. However, some macroinstructions could require dozens of uops. The translations for these macroinstructions are typically stored in a read-only memory (ROM) built into the processor called the microcode. The microcode ROM contains programs written in uops for executing complex macroinstructions. In addition, the microcode contains programs to handle special events like resetting the processor and handling interrupts and exceptions.

Microbranch predict

The processor really maintains two instructions pointers. One holds the address of the next macroinstruction to be read from the L2 cache by the instruction prefetch. The other holds the address of the next uop to be read from the trace cache. If the last uop was not a branch, the uop pointer is simply incremented to point to the next group of uops in the trace cache. If the last uop fetched was a branch, its address is sent to a trace cache BTB for prediction.

Micro-op fetch and drive

Designers attempt to create a floorplan where blocks that communicate often are placed close together, but inevitably every block cannot be right next to every other block with which it might communicate. The presence of drive cycles in the pipeline shows how transistor speeds have increased to the point where now simple wire delay is an important factor in determining a processor's frequency.

Allocation

The allocator allocates a reorder buffer (ROB) entry:

- Tracks the completion status of a µop Remembers the most current version of each register in the Register Alias Table (RAT) A new instruction knows where to get the correct current instance Allocates the ROB and Register File (RF) separately On retirement, no result data values are actually moved from one physical structure to another

Register rename

This stage renames logical registers to the physical register space In the NetBurst Architecture there are 128 registers with unique names Basically, any references to original IA-32 general purpose registers are renamed to one of the internal physical registers. Also, it removes false register name dependencies between instructions allowing the processor to execute more instructions in parallel. Parallel execution helps keep all resources busy

Several individual µop schedulers - schedules different types of µops for various execution units are tied to four different dispatch ports

Schedule and dispatch

When a uop is dispatched, its destination register and minimum latency are used to update the scoreboard showing which registers have ready data. This means that dependent uops may be scheduled too soon if a uop takes longer than expected, for instance if a load misses in the cache. Uops that are scheduled too soon will have to be replayed, going through dispatch again to receive their correct source data.

Most register files have at least two read/output ports and one write/input port to accommodate sending two values to the ALU and receiving one result. To control a read port we need to be able to specify a register number for the register to be read. The width/number of bits read equals the number of bits per register.

Register file read

Execute & calculate flags

The units marked as (clock x2) can execute two instructions per clock cycle. Ports 0 and 1 can send two microinstructions per clock cycle to these units. The maximum number of microinstructions that can be dispatched per clock cycle is six. (2 on port 0, 2 on port 1, 1 on port 2 and 1 on port 3). Complex instructions may take several cycles. Ex. If port 1 is executing a complex floating point instruction and is taking several cycles, the port 1 dispatch unit won’t stall but instead, will keep sending simple instructions to the ALU while the FPU is busy.

During pipeline stage 18 (after dispatch), the flags register is updated

Retire

The retirement logic is what reorders the instructions, executed in an out-of-order manner, back to the original program order. This retirement logic receives the completion status of the executed instructions from the execution units and processes the results so that the proper architectural state is committed (or retired) according to the program order. This logic also reports branch history information to the branch predictors at the front end of the machine so they can train with the latest known-good branch-history information.

processor microarchitectural units (intel pentium 4 used ...rlopes/mod5.4.pdfthe instruction...

Documents