1 chapter 5 future processor to use fine- grain parallelism

1

Chapter 5Chapter 5

Future processor to use Fine-Grain Parallelism

2

Microprocessors: today (Y2K++)Microprocessors: today (Y2K++)

Chip-Technology 2000/01: – 0.18-m CMOS-technology, 10 to 100 M transistors per chip,

600 MHz to 1.4 GHz cycle rate Example processors:

– Intel Pentium III: 7.5 M transistors, 0.18 m (Bi-)CMOS, up to 1 GHz– Intel Pentium 4: ?? Transistors, 0.18 m (Bi-)CMOS, up to 1.4 GHz, uses a

Trace Cache!– Intel IA-64 Itanium (already announced for 2000?): 0.18 m CMOS, 800 MHz;

successor McKinley (announced for 2001): > 1 GHz– Alpha 21364: 100 M transistors, 0.18 m CMOS (1.5 volt, 100 watt), ? MHz– HAL SPARC uses Trace Cache and Value Prediction– Alpha 21464 will be 4-way simultaneous multithreaded– Sun MAJC will be a two processor-chip, each processor a 4-way block-

multithreaded VLIW

3

Trends and principles in the Trends and principles in the ggiga iga cchip hip eerara

Forecasting the effects of technology is hard:

“Everything that can be invented has been invented.” US Commissioner of Patents, 1899.

“I think there is a world market for about five computers.” Thomas J. Watson Sr., IBM founder, 1943.

4

Microprocessors: Microprocessors: tomorrow (Y2K-2012)tomorrow (Y2K-2012)

Moore's Law: number of transistors per chip double every two years

SIA (semiconductor industries association) prognosis 1998:

Year of 1st shipment 1997 1999 2002 2005 2008 2011 2014Local Clock (GHz) 0,75 1,25 2,1 3,5 6 10 16,9Across Chip (GHz) 0,75 1,2 1,6 2 2,5 3 3,674Chip Size (mm²) 300 340 430 520 620 750 901Dense Lines (nm) 250 180 130 100 70 50 35Number of chip I/O 1515 1867 2553 3492 4776 6532 8935Transistors per chip 11M 21M 76M 200M 520M 1,4B 3,62B

5

Design Design cchallengeshallenges

increasing clock speed, the amount of work that can be performed per cycle, and the number of instructions needed to perform a task.

Today's general trend toward more complex designs is opposed by the wiring delay within the processor chip as main technological problem.

higher clock rates with subquarter-micron designs on-chip interconnecting wires cause a significant portion of the delay time in circuits.

Especially global interconnects within a processor chip cause problems with higher clock rates.

Maintaining the integrity of a signal as it moves from one end of the chip to the other becomes more difficult.

Copper metallization is worth a 20 % to 30 % reduction in wiring delay.

6

Application and Application and eeconomy-conomy-rrelated elated ttrendsrends Applications:

– generally: user interactions like video, audio, voice recognition, speech processing, and 3D graphics.

– large data sets and huge databases, – large data mining applications, – transaction processing, – huge EDA applications like CAD/CAM software, – virtual reality computer games, – signal processing and real-time control.

Colwell from Intel: the real threat for processor designers is shipping 30 million CPUs only to discover they are imperfect and cause a recall.

Economies of scale:– Fabrication plants now cost about $2 billion, a factor of ten more than a

decade ago. Manufacturers can only sustain such development costs if larger markets with greater economies of scale emerge. Workloads will concentrate on human computer interface. Multimedia workloads will grow and influence architectures.

7

Architectural Architectural cchallenges and hallenges and iimplicationsmplications

Preserve object code compatibility (may be avoided by a virtual machine that targets run-time ISAs)

It is necessary to find ways of expressing and exposing more parallelism to the processor. It is doubtful if enough ILP is available.

Buses may probably scale. Expect much wider buses in future. Memory bottleneck: Memory latency may be solved by a combination of

technological improvements in memory chip technology and by applying advanced memory hierarchy techniques (other authors disagree).

Power consumption for mobile computers and appliances. Soft errors by cosmic rays of gamma radiation may be faced with fault-tolerant

design through the chip.

8

Possible solutionsPossible solutions

a focus of processor chips on particular market segments; multimedia pushes desktop personal computers while high-end

microprocessors will serve specialized applications; integrate functionalities to systems on a chip; partition a microprocessor in a client chip part that focuses on general user

interaction enhanced by server chip parts that are tailored for special applications;

a CPU core that works like a large ASIC block and that allows system developers to instantiate various devices on a chip with a simple CPU core;

and reconfigurable on-chip parts that adapt to application requirements.

Functional partitioning becomes more important!

9

Future Future pprocessor rocessor aarchitecture rchitecture pprinciplesrinciples

Speed-up of a single-threaded application: – Trace cache– Superspeculative– Advanced superscalar

Speed-up of multi-threaded applications:– Chip multiprocessors (CMPs)– Simultaneous multithreading

Speed-up of a single-threaded application by multithreading:– Multiscalar processors– Trace processors– DataScalar

Exotics:– Processor-in-memory (PIM) or intelligent RAM (IRAM)– Reconfigurable– Asynchronous

10

Future Future pprocessors to use rocessors to use ccoarse-oarse-ggrain rain pparallelismarallelism

Today‘s microprocessors utilize instruction level parallelism by a deep instruction pipeline and by the superscalar or VLIW multiple issue techniques

Today‘s (2001) technology: approx. 40 M transistors per chip, In future (2012): 1.4 G transistors per chip,

What next?

Two directions:– Increase of single-thread performance

--> use of more speculative instruction-level parallelism– Increase of multi-thread (multi-task) performance

--> Utilize thread-level parallism additionally to instruction-level parallelismA „thread“ in this lecture means a „HW thread“ which can be a SW (Posix) thread, a process, ...

Far future (??): Increase of single-thread performance by use of speculative instruction-level and thread-level parallelism

11

Processor Processor ttechniques that echniques that sspeedup peedup ssingle threaded ingle threaded aapplicationspplications

Trace Cache tries to fetch from dynamic instruction sequences instead of the static code cache

Advanced superscalar processors scale current designs up to issue 16 or 32 instructions per cycle.

Superspeculative processors enhance wide-issue superscalar performance by speculating aggressively at every point.

12

The The ttrace race ccacheache

Trace cache is a new paradigm for caching instructions. A Trace cache is a special I-cache that captures dynamic instruction sequences

in contrast to the I-cache that contains static instruction sequences. Like the I-cache, the trace cache is accessed using the starting address of the

next block of instructions. Unlike the I-cache, it stores logically contiguous instructions in physically

contiguous storage. A trace cache line stores a segment of the dynamic instruction trace across

multiple, potentially taken branches.

I-cache Trace cache

13

The The ttrace race ccache (2)ache (2)

Each line stores a snapshot, or trace, of the dynamic instruction stream. The trace construction is of the critical path. As a group of instructions is processed, it is latched into the fill unit. The fill unit maximizes the size of the segment and finalizes a segment when the

segment can be expanded no further.– The number of instructions within a trace is limited by the trace cache line

size.– Finalized segments are written into the trace cache.

Instructions can be sent from the trace cache into the reservation stations (??) without having to undergo a large amount of processing and rerouting. – It is under research if the instructions in the Trace cache are

• fetched but not yet decoded• decoded but not yet renamed• or decoded and partly renamed

– Trace cache placement in the microarchitecture is dependent on this decision.

14

The The ttrace race ccacheache - - pperformanceerformance

Three applications from the SPECint95 benchmarks are simulated on a 16-wide issue machine with perfect branch prediction.

15

Superspeculative Superspeculative pprocessorsrocessors

Idea: Instructions generate many highly predictable result values in real programs Speculate on source operand values and begin execution without waiting for result from the previous instruction.Speculate about true data dependences!!

reasons for the existence of value locality– Due to register spill code the reuse distance of many shared values is very

short in processor cycles. Many stores do not even make it out of the store queue before their values are needed again.

– Input sets often contain data with little variation (e.g., sparse matrices or text files with white spaces).

– A compiler often generates run-time constants due to error-checking, switch statement evaluation, and virtual function calls.

– The compiler also often loads program constants from memory rather than using immediate operands.

16

Strong- vs. Strong- vs. wweak-dependence eak-dependence mmodelodel

Strong-dependence model for program execution: a total instruction ordering of a sequential program. – Two instructions are identified as either dependent or independent, and

when in doubt, dependences are pessimistically assumed to exist. – Dependences are never allowed to be violated and are enforced during

instruction processing. – To date, most machines enforce such dependences in a rigorous fashion. – This traditional model is overly rigorous and unnecessarily restricts

available parallelism. Weak-dependence model:

– specifying that dependences can be temporarily violated during instruction execution as long as recovery can be performed prior to affecting the permanent machine state.

– Advantage: the machine can speculate aggressively and temporarily violate the dependences. The machine can exceed the performance limit imposed by the strong-dependence model.

17

Implementation of a Implementation of a wweak-dependence eak-dependence mmodelodel

The front-end engine assumes the weak-dependence model and is highly speculative, predicting instructions to aggressively speculate past them.

The back-end engine still uses the strong-dependence model to validate the speculations, recover from misspeculation, and provide history and guidance information to the speculative engine.

18

Superflow processorSuperflow processor

The Superflow processor speculates on– instruction flow: two-phase branch predictor combined with trace cache– register data flow: dependence prediction: predict the register value

dependence between instructions• source operand value prediction • constant value prediction• value stride prediction: speculate on constant, incremental increases

in operand values• dependence prediction predicts inter-instruction dependences

– memory data flow: prediction of load values, of load addresses and alias prediction

– Superflow simulations: 7.3 IPC for SPEC95 integer suite, up to 9 instructions per cycle when 32 instructions are potentially issued per cycle

– With dependence and value prediction a three cycle issue nearly matches the performance of a single issue dispatch.

19

SuperflowSuperflow p processorrocessor pproposalroposal

RSs

RSs

RSs

RSs

Integer

Floating-point

Multimedia

Memory

Ret

ire

Register dataflow

Reo

rder

buf

fer

Inst

ruct

ion

buff

er

Stor

e qu

eue

Memory dataflow

Dec

ode

Fet

chB

ranc

hpr

edic

tor

Inst

ruct

ion

flow

I-ca

che

Tra

ceca

che

D-c

ache

20

Advanced Advanced ssuperscalar uperscalar pprocessors - Characteristicsrocessors - Characteristics

Aggressive speculation, such as a very aggressive dynamic branch predictor,

a large trace cache,

very-wide-issue superscalar processing (an issue width of 16 or 32 instructions per cycle),

a large number of reservation stations to accommodate 2,000 instructions,

24 to 48 highly optimized, pipelined functional units,

sufficient on-chip data cache, and

sufficient resolution and forwarding logic.

21

Advanced Advanced ssuperscalar uperscalar pprocessorrocessor

22

Requirements and Requirements and ssolutionolution Delivering optimal instruction bandwidth requires:

– a minimal number of empty fetch cycles, – a very wide (conservatively 16 instructions, aggressively 32), full issue each

cycle, – and a minimal number of cycles in which the instructions fetched are

subsequently discarded. Consuming this instruction bandwidth requires:

– sufficient data supply,– and sufficient processing resources to handle the instruction bandwidth.

Suggestions:– an instruction cache system (the I-cache) that provides for out-of-order fetch

(fetch, decode, and issue in the presence of I-cache misses). – a large Trace cache for providing a logically contiguous instruction stream,– an aggressive Multi-Hybrid branch predictor (multiple, separate branch

predictors, each tuned to a different class of branches) with support for context switching, indirect jumps, and interference handling.

23

Multi-Multi-hhybrid ybrid bbranch ranch ppredictoredictorr

Hybrid predictors comprise several predictors, each targeting different classes of branches.

Idea: each predictor scheme works best for another branch type. McFarling already combined two predictors

(see also PowerPC 620, Alpha 21364,...). As predictors increase in size, they often take more time to react to changes in

a program (warm-up time). A hybrid predictor with several components can solve this problem by using

component predictors with shorter warm-up times while the larger predictors are warming up.

Examples of predictors with shorter warm-up times are two-level predictors with shorter histories as well as smaller dynamic predictors.

The Multi-Hybrid uses a set of selection counters for each entry in the branch target buffer, in the trace cache, or in a similar structure, keeping track of the predictor currently most accurate for each branch and then using the prediction from that predictor for that branch.

24

Instruction supply and out-of-order fetchInstruction supply and out-of-order fetch

An in-order fetch processor, upon encountering a trace cache miss, waits until the miss is serviced before fetching any new segments.

But an out-of-order fetch processor temporarily ignores the segment associated with the miss, attempting to fetch, decode, and issue the segments that follow it.

After the miss has been serviced, the processor decodes and issues the ignored segment.

Higher performance can be achieved by fetching instructions that—in terms of a dynamic instruction trace—appear after a mispredicted branch, but are not control-dependent upon that branch.

In the event of a mispredict, only instructions control-dependent on the mispredicted branch are discarded.

Out-of-order fetch provides a way to fetch control-independent instructions.

25

Data Data ssupplyupply

A 16-wide-issue processor will need to execute about eight loads/stores per cycle. The primary design goal of the data-cache hierarchy is to provide the necessary

bandwidth to support eight loads/stores per cycle. The size of a single, monolithic, multi-ported, first-level data cache would likely be so

large that it would jeopardize the cycle time. Because of this, we expect the first-level data cache to be replicated to provide the

required ports. Further features of the data supply system:

– A bigger, second-level data cache with less port requirements.– Data prefetching.– Processors will predict the addresses of loads, allowing loads to be executed

before the computation of operands needed for their address calculation. – Processors will predict dependencies between loads and stores, allowing them to

predict that a load is always dependent on some older store.

26

Execution Execution ccoreore

If the fetch engine is providing 16 to 32 instructions per cycle, then the execution core must consume instructions just as rapidly.

To avoid unnecessary delays due to false dependencies, logical registers must be renamed.

To compensate for the delays imposed by the true data dependencies, instructions must be executed out of order.

Patt et al. envision an execution core comprising 24 to 48 functional units supplied with instructions from large reservation stations and having a total storage capacity of 2,000 or more instructions.

Functional units will be partitioned into clusters of three to five units. Each cluster will maintain an individual register file. Each functional unit has its own reservation station.

Instruction scheduling will be done in stages.

27

ConclusionsConclusions

Patt et al. argue that the highest performance computing system (when one billion transistor chips are available) will contain on each processor chip a single processor.

All this makes sense, however, only – if CAD tools can be improved to design such chips– if algorithms and compilers can be redesigned to take advantage of such

powerful dynamically scheduled engines.

28

Processors that Processors that uuse se tthread-level hread-level sspeculation to peculation to bboost oost ssingle-threaded ingle-threaded pprogramsrograms

Multiscalar processors divide a program in a collection of tasks that are distributed to a number of parallel processing units under control of a single hardware sequencer.

Trace processors facilitate high ILP and a fast clock by breaking up the processor into multiple distinct cores (similar to multiscalar!), and breaking up the program into traces (dynamic sequences of instructions).

DataScalar processors run the same sequential program redundantly across multiple processors using distributed data sets.

29

Multiscalar Multiscalar pprocessorsrocessors

A program is represented as a control flow graph (CFG), where basic blocks are nodes, and arcs represent flow of control.

Program execution can be viewed as walking through the CFG with a high level of parallelism.

A multiscalar processor walks through the CFG speculatively, taking task-sized steps, without pausing to inspect any of the instructions within a task.

The primary constraint: it must preserve the sequential program semantics. A program is statically partitioned into tasks which are marked by annotations

of the CFG. The tasks are distributed to a number of parallel PEs within a processor. Each PE fetches and executes instructions belonging to its assigned task.

30

Multiscalar mode of executionMultiscalar mode of execution

A

B C

D

E

Task A

PE 0

Task B

PE 1

Task D

PE 2

Task E

PE 3D

ata

valu

es

31

Multiscalar processorMultiscalar processor

. . .

unidirectionalring

unidirectional ring. . .

SequencerTailHead

InterconnectA

RB

D-c

acheData

bankDatabank

. . . . . .

Pro

cess

ing

elem

ent n

I-cache

Processingunit

Register file

Pro

cess

ing

elem

ent 1

32

Proper resolution of inter-task data dependencesProper resolution of inter-task data dependences

Concerns in particular data that is passed between instructions via registers and via memory.

To maintain a sequential appearance a twofold strategy is employed. – First, each processing element adheres to sequential execution semantics

for the task assigned to it. – Second, a loose sequential order is enforced over the collection of

processing elements, which in turn imposes a sequential order of the tasks.

The sequential order on the processing elements is maintained by organizing the elements into a circular queue.

The appearance of a single logical register file is maintained although copies are distributed to each parallel PE. – Register results are dynamically routed among the many parallel

processing elements with the help of compiler-generated masks. For memory operations: An address resolution buffer (ARB) is provided to hold

speculative memory operations and to detect violations of memory dependences.

33

The multiscalar paradigm has at least two forms of The multiscalar paradigm has at least two forms of speculation:speculation:

Control speculation, which is used by the task sequencer, and

Data dependence speculation, which is performed by each PE.

It could also use other forms of speculation, such as data value speculation, to alleviate inter-task dependences.

34

Alternatives with Alternatives with tthread-level hread-level sspeculationpeculation

Static Thread-level speculation: multiscalar

Dynamic thread-level speculation: „threads“ are generated during run-time from:– Loops– Subprogram invocations– Specific start or stop instruction– Traces: Trace processors

Datascalar

35

Trace Trace pprocessorsrocessors

The focus of a Trace processor is the combination of the Trace cache with multiscalar.

Idea: Create subsystems similar in complexity to today's superscalar processors and combine replicated subsystems into a full processor.

A trace is a sequence of instructions that potentially covers several basic blocks starting at any point in the dynamic instruction stream.

A trace cache is a special I-cache that captures dynamic instruction sequences in contrast to the I-cache that contains static instruction sequences.

36

Trace Trace pprocessorrocessor

Processing element 2

Processing element nData-valueprediction

Next-traceprediction

Instructionpreprocessing

Traceconstruction

Branchprediction Functional

units

Localregisters

Globalregisters

Instructionbuffer

Processing element 1

Tra

ce c

ache

...

37

Trace Trace pprocessorrocessor

Instruction fetch hardware fetches instructions from the I-cache and simultaneously generates traces of 8 to 32 instructions including predicted conditional branches.

Traces are built as the program executes and they are stored in a trace cache. A trace fetch unit reads traces from the trace cache and parcels them out to the

parallel PEs. Next trace prediction speculates on the next traces to be executed,

– Next trace prediction predicts multiple branches per cycle.– Data prediction speculates on the trace's input data values.

Constant value prediction predicts 80% correct for gcc. A trace cache miss causes a trace to be built through conventional instruction

fetching with branch prediction. Trace processor is similar to multiscalar except for its use of hardware-

generated dynamic traces rather than compiler-generated static tasks.

38

DataScalar DataScalar pprocessorsrocessors

The DataScalar model of execution runs the same sequential program redundantly across multiple processors.

The data set is distributed across physical memories that are tightly coupled to their distinct processors.

Each processor broadcasts operands that it loads from its local memory to all

other processors.

Instead of explicitly accessing a remote memory, processors wait until the requested value is broadcasted.

Stores are completed only by the processor that owns the operand, and are dropped by the others.

39

Address Address sspacepace

The address space is divided into a replicated and a communicated section:– The communicated section holds values that only exist in single copies

and are owned by the respective processor.– The communicated section of the address space is distributed among the

processors.

– Replicated pages are mapped in each processor's local memory. – Access to a replicated page requires no data communication.

40

Accesses of DataScalar processors to memoryAccesses of DataScalar processors to memory

Processor 1replicated

Memory

communicated

load-1store-1

load-2store-2

Processor 2replicated

Memory

communicated

load-2

store-1load-1

store-2... broadcast

41

Why DataScalar?Why DataScalar?

The main goal of the DataScalar model of execution is the improvement of memory system performance by introducing redundancy in execution by replicating processors and part of the data storage.

Since all physical memory is local to at least one processor, a request for a remote operand is never sent.

Memory access latency and bus traffic is reduced.

All communication is one-way.

Writes never appear on the global bus.

42

Execution Execution mmodeode

The processors execute the same program in slightly different time steps.

The lead processor runs slightly ahead of the others, especially when it is broadcasting while the others wait for the broadcasted value.

When the program execution accesses an operand that is not owned by the lead processor, a lead change occurs.

All processors stall until the new lead processor catches up and broadcasts its operands.

The capability that each processor may run ahead on computation that involves operands owned by the processor is called datathreading.

43

Pros and Pros and cconsons

The DataScalar model is primarily a memory system optimization intended for codes that are performance limited by the memory system and difficult to parallelize.

Simulation results : the DataScalar model of execution works best with codes for which traditional parallelization techniques fail.

Six unmodied SPEC95 binaries ran from 7 % slower to 50 % faster on two nodes, and from 9 % to 100 % faster on four nodes, than on a system with comparable, more traditional memory system.

Current technological parameters do not make DataScalar systems a cost-effective alternative to today's microprocessors.

44

Pros and Pros and cconsons

For a DataScalar system to be more cost-effective than the alternatives, the following three conditions must hold:– Processing power must be cheap, the dominant cost of each node should

be memory.– Remote memory accesses should be slower than local memory accesses.– Broadcasts should not be prohibitively expensive.

Three possible candidates:– IRAM: because remote memory accesses to other IRAM chips will be more

expensive than on-chip memory accesses.– extending concept of CMP: CMPs access operands from a processor-local

memory faster than requesting an operand from a remote processor memory across the chip due to wiring delays.

– NOWs (networks of workstations): alternative to paging, provided that broadcasts are efficient.

1 chapter 5 future processor to use fine- grain parallelism

Documents

chip double

processor chip cause

y2k chiptechnology

future processor

processor designers

chip interconnecting

giga chip era

number of transistors