slide #1friday, october 10, 1997 alpha 21172 core logic chip set jerry huang alpha 21172 core logic...

Slide #1 Friday, October 10, 1997

Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang


Alpha 21172 Inside outAlpha 21172 Inside out

Zhihui Huang (Jerry)

University of Michigan




ComponentsComponents

One 21172-CA chip– Control, I/O, address chip(CIA)– 388 pins, plastic ball grid array(PBGA)

Four 21172-BA– data switch chip (DSW)– 208 pins, plastic quad flat pack (PQFP)




Data PathsData Paths

64-bit data path between CIA and DSW– iod

128-bit data path between 21164 and DSW– cpu_dat

256-bit memory data path between DSW and memory– mem_dat

SlowestSlowest parthas the widestwidest

bus




3-way Interface3-way Interface

21164211642116421164

21172211722117221172

DR

AM

1D

RA

M 1

DR

AM

2D

RA

M 2

DR

AM

3D

RA

M 3

DR

AM

4D

RA

M 4

DR

AM

5D

RA

M 5

DR

AM

6D

RA

M 6

DR

AM

7D

RA

M 7

DR

AM

8D

RA

M 8

DSW0DSW0 DSW1DSW1 DSW2DSW2 DSW3DSW3

64-bit PCI Bus64-bit PCI Bus

64-bit IOD bus64-bit IOD busaddr<39:4>

RAS

CAS

memadr<11:0>control




Memory Memory

DR

AM

1D

RA

M 1

DR

AM

2D

RA

M 2

DR

AM

3D

RA

M 3

DR

AM

4D

RA

M 4

DR

AM

5D

RA

M 5

DR

AM

6D

RA

M 6

DR

AM

7D

RA

M 7

DR

AM

8D

RA

M 8

The DRAM is contained in oneonebank of SIMMs,SIMMs, whether there

are 4 SIMMs or 8 SIMMs.

The DRAM is contained in oneonebank of SIMMs,SIMMs, whether there

are 4 SIMMs or 8 SIMMs.S

IMM

1S

IMM

1

SIM

M 2

SIM

M 2

SIM

M 3

SIM

M 3

SIM

M 4

SIM

M 4

128 bit128 bit

4 SIMMsSIMMs fill adata bus of

128128 bits


128128 bits

SIM

M 5

SIM

M 5

SIM

M 6

SIM

M 6

SIM

M 7

SIM

M 7

SIM

M 8

SIM

M 8

256-bit256-bit


256256 bits


256256 bits

Needs a jumperjumper




Memory blockMemory block

DR

AM

1D

RA

M 1

DR

AM

2D

RA

M 2

DR

AM

3D

RA

M 3

DR

AM

4D

RA

M 4

DR

AM

5D

RA

M 5

DR

AM

6D

RA

M 6

DR

AM

7D

RA

M 7

DR

AM

8D

RA

M 8

DSW0DSW0 DSW1DSW1 DSW2DSW2 DSW3DSW3

256-bit256-bit256-bit256-bit

A 256-bit256-bit blockis composed of

bit slices across allthe 88 SIMMsSIMMs

The arrangement of the slices are

interleaved interleaved withinthe 4 DSWs

A 256-bit256-bit blockis composed of

bit slices across allthe 88 SIMMsSIMMs

The arrangement of the slices are

interleaved interleaved withinthe 4 DSWs

15:015:0 31:1631:16 47:3247:32 63:4863:4879:6479:64 95:8095:80 102:96102:96 127:102127:102

128-bit128-bit

As you just see,the 4 DSWsDSWs together

provide the lower 128-bit128-bit memory

bus.

For the 256-bit256-bitconfiguration,

DSWsDSWs also provide theupper part of the bus

As you just see,the 4 DSWsDSWs together

provide the lower 128-bit128-bit memory

bus.

For the 256-bit256-bitconfiguration,

DSWsDSWs also provide theupper part of the bus

128-bit128-bit

It is better to use the256-bit256-bit configuration,

or you pay the fullprice for DSWsand only use

half of the resources.

It is better to use the256-bit256-bit configuration,

or you pay the fullprice for DSWsand only use

half of the resources.

It may be clear nowwhy it is a oneone bankbank

schema withall the SIMMsSIMMs

have the samesame size.

It may be clear nowwhy it is a oneone bankbank

schema withall the SIMMsSIMMs

have the samesame size.




Bcache and MemoryBcache and Memory

3rd Level Cache for the 21164 Attributes

– optional, external,physical, synchronous SRAM

– direct-mapped, write-back,write-allocate

256-bit or 512-bit block cache size of 1,2,4,8,16,32,64 Mbytes support up to 512MB of memory

– 1MBx36, 2MBx36,4MBx36,8MBx36,16MBx36

A cache architecture in which data

is only written to main memory

when it is forced out of the cache.

Opposite of write-through.

The Scache and Bcache block size is either 64-bytesor 32 bytes. The Scache and Bcache always have

identical block sizes. All the Bcache and main memory FILLs or write transactions are of the selected block size.

The Scache and Bcache block size is either 64-bytesor 32 bytes. The Scache and Bcache always have

identical block sizes. All the Bcache and main memory FILLs or write transactions are of the selected block size.

A cache where the cache location for a given address is determined from the middle addressbits. If the cache line size is 2^n then the bottom n address bits correspond to an offset within a cache entry.

If the cache can hold 2^m entries then the next m address bits give the cache location. The remaining top

address bits are stored as a TAG along with the entry.

In this scheme, there is no choice of which block to flush on a cache miss since there is only one place forany block to go. This simple scheme has the disadvantage that if the program alternately accesses different

addresses which map to the same cache location then it will suffer a cache miss on every access to these locations.

This kind of cache conflict is quite likely on a multi-processor.

A cache where the cache location for a given address is determined from the middle addressbits. If the cache line size is 2^n then the bottom n address bits correspond to an offset within a cache entry.

If the cache can hold 2^m entries then the next m address bits give the cache location. The remaining top

address bits are stored as a TAG along with the entry.

In this scheme, there is no choice of which block to flush on a cache miss since there is only one place forany block to go. This simple scheme has the disadvantage that if the program alternately accesses different

addresses which map to the same cache location then it will suffer a cache miss on every access to these locations.

This kind of cache conflict is quite likely on a multi-processor. A cache line is allocated when the write memory

data miss the cache

A cache line is allocated when the write memory

data miss the cache

In the PC164PC164In the PC164PC164

ECC protectedECC protected




PCI featuresPCI features

Supports 64-bits PCI bus width Supports 64-bit PCI addressing (DAC cycles) Accept PCI fast back-to-back cycles

– addr,data0,data1,data2,...,addr_again!– The Frame# is only deasserted for a cycle to allow

the last to finish

Issues PCI fast back-to-back cycles in dense addrss space

addrdata data

clk

Frame#




CIA TransactionsCIA Transactions

21164 memory read miss 21164 memory read miss with victim 21164 I/O read 21164 I/O write DMA read DMA read(prefetch) DMA write




DSW Data PathsDSW Data Paths

SYSSYSSYSSYS

Memory21164

BCache

DMA 0 DMA 1

IODIOD IODIODMEMMEM MEMMEM

Victim PathVictim Path

Read Miss PathRead Miss Path

PCIPCI PCIPCI

SYSSYS MEMMEM

FlushFlush FlushFlush

IO Paths not shownInstruction QueueInstruction Queue

MEMMEM




DSW BuffersDSW Buffers

DMA Buffer Sets (0 and 1)– PCI buffer for PCI DMA write data

– Memory buffer for memory data

– Flush buffer for system bus data


PCIPCIFlushFlush FlushFlush

DMA 0 DMA 1

PCIPCI




DMA WritesDMA Writes

Data arrives in the PCI Buffer Memory Buffer loaded at the same time Bcache line flushed and Flush buffer loaded 3 sources merged and data back at memory

DMA 0

IODIODMEMMEM

PCIPCI FlushFlush

Memory

21164BCache

As you just see, the DMA operation

causes PCI bufferPCI buffer loaded from the IODIOD bus, the MEMMEMbufferbuffer loaded from memorymemory,and the flush bufferflush buffer loaded

from system bussystem bus at the same time

As you just see, the DMA operation

causes PCI bufferPCI buffer loaded from the IODIOD bus, the MEMMEMbufferbuffer loaded from memorymemory,and the flush bufferflush buffer loaded

from system bussystem bus at the same time

Then the 3 sources are mergedmerged and writtenback to main memorymain memory

Then the 3 sources are mergedmerged and writtenback to main memorymain memory




21164 Read Transaction21164 Read Transaction

If hit in the Bcache, no memory access is required

Memory21164

BCacheRead Miss PathRead Miss Path

SYSSYS MEMMEMHIT !!

Read dataData back to CPUData back to CPU




21164 Read Miss21164 Read Miss

If not hit in the Bcache during a read, memory access is involved.

Memory21164


SYSSYS MEMMEM

Read dataData back to CPUData back to CPU

21172 CIA

Command

command

21172 BAMiss!!




Read Miss With VictimRead Miss With Victim

Two scenarios– write data with different address tag into a valid cache line

– read data with different address tag into a valid cache line

Write allocate!!

read allocate!!

Memory21164


SYSSYS MEMMEM

Write data 21172 CIA

Command

command

Miss!!


Merge data

Read Missed blockRead Missed blockand

Write victim blockWrite victim blockare indivisibleindivisible

in the logic design




Traffic Jam on MEM busTraffic Jam on MEM bus

SYSSYSSYSSYS

DMA 0 DMA 1



PCIPCI PCIPCI

Memory21164


SYSSYS MEMMEM

FlushFlush FlushFlush

IO Paths not shownInstruction QueueInstruction Queue

MEMMEM

Let’s think about this senario, during the PCIPCI

DMADMA transfer, there are READREAD and

WRITEWRITE memoryhappening at the same

time

Let’s think about this senario, during the PCIPCI

DMADMA transfer, there are READREAD and

WRITEWRITE memoryhappening at the same

time

All the circleparts compete

for this resource

Cause read miss

with victim

Causeread miss

Don’t forget instructionfetch uses

memory too

Don’t forget instructionfetch uses

memory too




How Fast can DMA be?How Fast can DMA be?

SYSSYSSYSSYS

DMA 0 DMA 1


PCIPCIFlushFlush FlushFlush

2 fetches and 2 writes to memory/DMA – 64 bytes/240 ns = 266 Mbytes/s– 8 bytes /30 ns = 266 Mbytes/s

33 MHz PCIPCIhas the same

speed withDRAMDRAM !!

Can we really do this ??

33 MHz PCIPCIhas the same

speed withDRAMDRAM !!

Can we really do this ??

60 ns DRAMDRAM256256-bit bus

60 ns DRAMDRAM256256-bit bus

33MHz PCIPCI6464-bit bus

33MHz PCIPCI6464-bit bus

OverheadOverhead, retrysretrys, read linesread lines,read line with victimread line with victim,

instruction fetchinstruction fetchall share the same bandwidth!!It turns out for the worst case,

17MBytes/s17MBytes/s is achievedjust above bottom linebottom line

OverheadOverhead, retrysretrys, read linesread lines,read line with victimread line with victim,

instruction fetchinstruction fetchall share the same bandwidth!!It turns out for the worst case,

17MBytes/s17MBytes/s is achievedjust above bottom linebottom line




Performance of the MB2PCIPerformance of the MB2PCI

Worst case– 29.9MBytes/s– 25.5MBytes/s– 17.5MBytes/s

Best case– 95MBytes/s– 80MBytes/s– 72MBytes/s

- No intervenence- read line, instruction fetch- read line, read line with victim, instruction fetch

- No intervenence- read line, instruction fetch- read line, read line with victim, instruction fetch




ConclusionConclusion

If we want to improve– use 256-bit cache block instead of 512-bit– Is there a next version 21172 chip surport 512-bit

memory bus?– Is there DRAM chips faster then 60ns– can we afford 64M Bcache(SRAM)?

There is a trade offtrade off here, by using smaller block, the 21164 will generate more

cache miss cycles and may slow down.

On the other hand, for the DMA transfer,when only 128-bit data is transferred, no more

512-bit memory read overhead. There is only 256-bit read now. Thus improve the

worst caseworst case performance.

There is a trade offtrade off here, by using smaller block, the 21164 will generate more

cache miss cycles and may slow down.

On the other hand, for the DMA transfer,when only 128-bit data is transferred, no more

512-bit memory read overhead. There is only 256-bit read now. Thus improve the

worst caseworst case performance.

slide #1friday, october 10, 1997 alpha 21172 core logic chip set jerry huang alpha 21172 core logic...

Documents