ibmpoweriv

8/3/2019 IBMpowerIV

1/6

1. Instruction Level Parallelism

Most of modern processors have a similar architecture which is speculative

superscalar outoforder execution design; it concerns both RISC and X86. The

approach implies that several functional units (FU) operate simultaneously in theprocessor and exploit instructions from a special buffer, if possible, where they

proceed to after decoding. The advantage is that parallelization doesn't depend on a

programmer (at least, in higher-level languages) and it's unnecessary to apply specialalgorithms and language constructions used for development of programs for

computers with several processors. One can think that by increasing the number ofFUs it's possible to achieve a very high degree of ILP (Instruction Level Parallelism,

ILP). It's true to some degree. But the superscalar architecture has a lot of limitationswhich grow as the number of execution units increases. For example:

1. Register dependences - the number of registers to provide a sufficient load forFUs must be quadratically dependent on the number of FUs. The X86 ISA with 8 GPRshas problems of further growth of the ILP by just increasing the number of FUs, that iswhy visible registers are renamed into the much greater number of hidden ones. RISCarchitectures look better, but it's also necessary to use the register renaming technique.This is not a cure-all as far as performance is concerned and makes the chip morecomplicated. However, theoretically the ILP of superscalar processors has a high limit(tens of instructions per clock for a great deal of programs included into the SPEC tests),but in reality there is no ground for talking of even a much lower parallelism degree.2. Rapidly growing processor complexity (complicated development, debugging andtesting) raises the costs and lead time which is not made up for by the performance gain.3. Increasing requirements for the L1 cache - to "feed" a great number of FUs thecache must have a great throughput and a large capacity. The cost is longer delays whichmake the performance poorer. It's also required to increase the number of ports forregisters.

According to some estimates, to double the ILP level regarding modern superscalar

processors it's necessary to provide about 128 GPRs and 8 ALUs + 8 loadstore units.

It will probably be realized in future IA-64 chips, but even now the same speed-upcan be achieved for a vast deal of applications by simpler ways.

2. Thread Level Parallelism

Various TLP (Thread Level Parallelism) technologies are one of the ways to boost upperformance of superscalar processors. Processors which use this approach exploit

several instruction flows simultaneously. Multithread programs benefit from the TLP -for them it makes sense to use the already existent parallelism in programs optimized

for multiprocessor systems. Usage of some SMT variation (Simultaneousmultithreading), for example HMT (Hardware Multi-Threading) from Intel , is a

temporary solution. This technology is able to provide a more effective FU load and

optimize the memory access of existent superscalar architectures, and flows are

divided by the same processor's FUs. It's estimated that the performance gain is 10-30% for different programs on the Xeon processors (this is Pentium 4, in fact).Another example of realization of the SMT is a server processor of the Power PC IBM

RS64 IV family which is a predecessor of the POWER4 in the pSeries 6000 (RS/6000)and iSeries 400 (AS/400) systems. [1]

On the whole, the SMT is logical and simple in realization improvement of modernprocessors.

The Chip MultiProcessing (CMP) is a more radical approach which implies that severalprocessor cores are located on one die. At present this technology has reached the

level when it's possible to put two complex superscalar processors and enoughamount of cache on one chip. We thus get an SMP system, and as the cores are

located on the same die it makes possible to increase much a data rate between the

processors in comparison to using any external buses, commutators etc. It is

interesting that in the early 90s Intel considered to develop such processor - the 786chip codenamed Micro-2000 was to have 4 cores, additional 2 vector processors and
http://ixbtlabs.com/articles/ibmpower4/http://ixbtlabs.com/articles/ibmpower4/

8/3/2019 IBMpowerIV

2/6

to work at 250 MHz. Just compare with the Pentium 4 :). The POWER4 consists of 2identical processor cores which implement PowerPC AS instruction set, the die

measures about 400 mm2, it's based on the 0.18 micron copper SOI IBM CMOS 8S2technology with 7 metallization layers, works at 1.1 and 1.3 GHz, and is the fastest

microprocessor for today. There is also a POWER4's variation with one processor onthe die. However, HP and SUN are also going to release CMP processors soon which

will be based on the 0.13 micron fab process. It's possible that AMD will follow thisway as well.

3. IBM POWER4 - introduction

This processor is meant for the maximum performance, for hi-end server andsupercomputer market, designed for 32-processor SMP systems. Development of

high-performance communication means for processors and memory was given muchattention to. The POWER4 has a high fault-tolerance: critical fails do not make the

system hang; instead, interrupts are generated and then processed by the system.The POWER4 was developed for an efficient operation of commercial (server) and

scientific and technical applications. Note that earlier the IBM Power/Power PC

processors were divided into server and scientific ones - POWER and RS64. ThePOWER4 suits a wide range of hi-end applications and uses all topical performance

boosting ways (within the PowerPC instruction set). We won't find there truncated

caches and lacking FUs. The chip's design looks unusual; later we will see why thefrequency of the POWER4 jumped from 600 MHz of the RS64 IV to 1.3 GHz.

4. POWER4 die

The POWER4 houses 2 processors each having an L1 cache for data and instructions.

The die has a single L2 cache of 1450 KBytes controlled by 3 separate controllersconnected to the cores via a CIU (Core Interface Unit). The controllers work

independently and can process 32 bytes per clock. Each processor uses two separate256-bit buses to connect the CIU for data fetching and data loading, as well as a

separate 64-bit bus to save the results; the L2 cache has a bandwidth of 100GBytes/s. The L2 cache's system looks well balanced and very powerful. Each

processor has a special unit to support noncachable operations (Noncacheable Unit).The L3 controller and the memory's one are located on die as well. For connection

with the L3 cache working at 1/3 of the processor's speed and with the memory there

are two 128-bit buses operating at 1/3 of the processor's frequency. The throughput

of the memory interface is about 11 GBytes/s. Data flows coming from the memoryand L2 and L3 caches and the buses of the chips are controlled by the Fabric

Controller:
http://ixbtlabs.com/articles/ibmpower4/http://ixbtlabs.com/articles/ibmpower4/

8/3/2019 IBMpowerIV

3/6

The 32-bit GX Bus running at 1/3 of the processor's frequency is used to connect to

the I/O subsystem (e.g., by the PCI bridge) and a commutator in case of multiplenodes which contain POWER4 chips for creating clusters.

5. SMP capabilities

4 POWER4 chips can be packed into one module forming a 8-processor SMP. The

POWER4 chips are connected via 4 128-bit buses used on one module; they operate

at 1/2 of the processor's speed. They are realized as 6 unidirectional buses, three in

one direction and three in the other, and their total throughput is about 35 GBytes/s.

8/3/2019 IBMpowerIV

4/6

Take a look at the 4-chip module: (such silicon pieces can be found, for example, in a32-processor pSeries 690 Model 681 server).

IBM focuses on the fact that a central commutator used for connection of the 4

POWER4 chips is replaced with multiple rapid independent point-to-point buses. AMDis going to use this approach in its future systems on the Hammer processor.

For creation of multimodule systems it's possible to use up to 4 modules which allow

for 16-processor SMP system or, taking into account 2 processors on each POWER4chip, for a 32-processor one. Two unidirectional 64-bit buses are used to connectother modules; they form a ring topology:

8/3/2019 IBMpowerIV

5/6

The POWER4 is not meant for SMP systems with the number of processors more than32.

6. POWER4 core

The POWER4 core is much different from its predecessors as it uses the approachapplied in modern X86, - transformation of PowerPC instructions into internal and

group formation.So, a separate POWER4 processor is a superscalar core with speculative out-of-orderexecution. There are 8 pipeline execution units in all - two identical floating-point

pipelines each able to implement addition and subtraction at a clock, i.e. up to 4floating point operations at a clock, two loadstore units, two integer-valued execution

devices, a branch execution unit a logical execution unit. Operations of division andsquare-rooting for floating-point figures are not pipelined and can worsen the

POWER4's performance much. Look at the pipeline of the POWER4:

The integer-valued pipeline has 17 stages! It contrasts to the previous IBM chips with

their 5-stage pipeline. Let's dwell on some interesting peculiarities of the POWER4core.

The L1 cache is capable of delivering to the front part of the pipeline up to 8instructions per clock according to the address given by the IFAR register the contents

of which is determined by the branch prediction unit. Then instructions are decoded,cracked and groups are formed. In order to minimize the logic necessary to track a

large number of in flight instructions, groups of instructions are formed. A groupcontains up to five internal instructions referred to as IOPs. In the decode stages the

instructions are placed sequentially in a group, the oldest instruction is placed in slot0, the next oldest one in slot 1, and so on. Slot 4 is reserved for branch instructions

only. To reach a high clock speed the POWER4 cracks PowerPC instructions into agreater number of simpler instructions which then combine into groups and are

executed. If an instruction is split into 2 instructions we consider that cracking. If aninstruction is split into more than 2 IOPs then we term this a millicoded instruction.

Register renaming is widely used in the POWER4, in particular, 32 GPRs are renamedinto 80 internal registers, 32 FPRs into 72 registers. It's clear that many once

attractive peculiarities of the PowerPC are out-dated, and the processor gets new

units for transformation of instructions into a form more convenient for execution. Aprocessor with such a long pipeline needs an effective branch prediction algorithm.

For dynamic prediction the POWER4 uses 2 algorithm versions and an additional table

which tracks the most effective algorithm for a certain branch instruction. Thedynamic prediction can be overriden by a special bit in the branch instruction. By the

way, such feature appeared also in the X86 line with the Pentium 4. There are 3buffer types to speed up translation of virtual addresses into physical ones - a

translation lookaside buffer (TLB) for 1024 entries, a segment look-aside buffer (SLB)

- a completely associative cache for 64 entries, and an effective-to-real address table(ERAT). The ERAT is divided into two tables - for data and for instructions for 128elements.

8/3/2019 IBMpowerIV

6/6

More detailed information on the POWER4 architecture and optimization of programscan be found in the IBM's guide [3].

7. Caches and memory

The following table shows key data on the memory subsystem:

Component Organization Capacity per chip

L1 InstructionCache

Direct map, 128-byte line managed as 4 32-bytesectors

128 KBytes (64 KBytes perprocessor)

L1 Data Cache 2-way, 128-byte line64 KBytes (32 KBytes per

processor)

L2 8-way, 128-byte line 1.41 MBytes

L38-way, 512-byte line managed as 4 128-byte

sectors32 MBytes

Memory 0 - 16 GBytes

Latency of the L1 cache is 4 cycles (for Pentium 4 - 2 cycles, for Athlon - 4). The L2

cache uses the MESI protocol for coherence support, and its average latency is 12

cycles (for Pentium 4 - 18, for Athlon - 20). But sometimes its latency can rise up to20 cycles. Controllers of the L3 cache and of the memory, as well as the tag directoryare integrated into the chip, and the cache consists of two 16 MBytes eDRAM chips

mounted on a separate module which is divided into 8 banks of 2 MBytes. Animportant feature of the L3 cache is a capability to combine separate caches of

POWER4 chips up to 4 (128 MBytes) which allows using address interleaving to speedup the access.

The L3 cache is connected to the memory controller via two bidirectional 64-bit buseswhich operate at 1/3 of the processor's speed. The memory (200 MHz DDR SDRAM) is

connected to the controller via two ports each consisting of 4 32-bit buses working at

400 MHz. So, the memory throughput when the two ports are used is a bit over 11GBytes/s (the respective parameter of the Intel McKinley which is not released yet is

6.4 GBytes/s). Each chip has its own bus to the L3 cache and memory.

The POWER4 has a hardware prefetch unit which loads data into the L1 cache fromthe whole memory hierarchy, and there are instructions which allow controlling this

process on a software level.

8. Happy End

CPU CPU MHz CINT2000 base/peak CFP2000 base/peak

AMD Athlon XP 1900+ 1600 677/701 588/634

HP PA-8700 750 568/604 526/576

IBM POWER4 (1CPU) 1300 790/814 1098/1169

Intel Itanium 800 314/314 645

Intel Pentium 4 2200 771/784 766/777SUN UltraSPARC III-Cu 900 470/533 629/731

The Alpha is fading, and the POWER4 has no more competitors as far as processor

power is concerned. The new Pentium 4 looks impressive in comparison to paleItanium and PA-8700, despite announcements of aging IA-32 and advanced IA-64

technologies. Will the McKinley with its more powerful IA-64 be able to stand againstthe POWER4 at least in computational tests? Will SMT be integrated into personal

POWER4 processors? What kind of CMP chips are other companies going to unveil?

ibmpoweriv

Documents