slide #1friday, october 10, 1997 alpha 21172 core logic chip set jerry huang alpha 21172 core logic...
TRANSCRIPT
Slide #1 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Inside outAlpha 21172 Inside out
Zhihui Huang (Jerry)
University of Michigan
Slide #2 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
ComponentsComponents
One 21172-CA chip– Control, I/O, address chip(CIA)– 388 pins, plastic ball grid array(PBGA)
Four 21172-BA– data switch chip (DSW)– 208 pins, plastic quad flat pack (PQFP)
Slide #3 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Data PathsData Paths
64-bit data path between CIA and DSW– iod
128-bit data path between 21164 and DSW– cpu_dat
256-bit memory data path between DSW and memory– mem_dat
SlowestSlowest parthas the widestwidest
bus
Slide #4 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
3-way Interface3-way Interface
21164211642116421164
21172211722117221172
DR
AM
1D
RA
M 1
DR
AM
2D
RA
M 2
DR
AM
3D
RA
M 3
DR
AM
4D
RA
M 4
DR
AM
5D
RA
M 5
DR
AM
6D
RA
M 6
DR
AM
7D
RA
M 7
DR
AM
8D
RA
M 8
DSW0DSW0 DSW1DSW1 DSW2DSW2 DSW3DSW3
64-bit PCI Bus64-bit PCI Bus
64-bit IOD bus64-bit IOD busaddr<39:4>
RAS
CAS
memadr<11:0>control
Slide #5 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Memory Memory
DR
AM
1D
RA
M 1
DR
AM
2D
RA
M 2
DR
AM
3D
RA
M 3
DR
AM
4D
RA
M 4
DR
AM
5D
RA
M 5
DR
AM
6D
RA
M 6
DR
AM
7D
RA
M 7
DR
AM
8D
RA
M 8
The DRAM is contained in oneonebank of SIMMs,SIMMs, whether there
are 4 SIMMs or 8 SIMMs.
The DRAM is contained in oneonebank of SIMMs,SIMMs, whether there
are 4 SIMMs or 8 SIMMs.S
IMM
1S
IMM
1
SIM
M 2
SIM
M 2
SIM
M 3
SIM
M 3
SIM
M 4
SIM
M 4
128 bit128 bit
4 SIMMsSIMMs fill adata bus of
128128 bits
4 SIMMsSIMMs fill adata bus of
128128 bits
SIM
M 5
SIM
M 5
SIM
M 6
SIM
M 6
SIM
M 7
SIM
M 7
SIM
M 8
SIM
M 8
256-bit256-bit
8 SIMMsSIMMs fill adata bus of
256256 bits
8 SIMMsSIMMs fill adata bus of
256256 bits
Needs a jumperjumper
Slide #6 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Memory blockMemory block
DR
AM
1D
RA
M 1
DR
AM
2D
RA
M 2
DR
AM
3D
RA
M 3
DR
AM
4D
RA
M 4
DR
AM
5D
RA
M 5
DR
AM
6D
RA
M 6
DR
AM
7D
RA
M 7
DR
AM
8D
RA
M 8
DSW0DSW0 DSW1DSW1 DSW2DSW2 DSW3DSW3
256-bit256-bit256-bit256-bit
A 256-bit256-bit blockis composed of
bit slices across allthe 88 SIMMsSIMMs
The arrangement of the slices are
interleaved interleaved withinthe 4 DSWs
A 256-bit256-bit blockis composed of
bit slices across allthe 88 SIMMsSIMMs
The arrangement of the slices are
interleaved interleaved withinthe 4 DSWs
15:015:0 31:1631:16 47:3247:32 63:4863:4879:6479:64 95:8095:80 102:96102:96 127:102127:102
128-bit128-bit
As you just see,the 4 DSWsDSWs together
provide the lower 128-bit128-bit memory
bus.
For the 256-bit256-bitconfiguration,
DSWsDSWs also provide theupper part of the bus
As you just see,the 4 DSWsDSWs together
provide the lower 128-bit128-bit memory
bus.
For the 256-bit256-bitconfiguration,
DSWsDSWs also provide theupper part of the bus
128-bit128-bit
It is better to use the256-bit256-bit configuration,
or you pay the fullprice for DSWsand only use
half of the resources.
It is better to use the256-bit256-bit configuration,
or you pay the fullprice for DSWsand only use
half of the resources.
It may be clear nowwhy it is a oneone bankbank
schema withall the SIMMsSIMMs
have the samesame size.
It may be clear nowwhy it is a oneone bankbank
schema withall the SIMMsSIMMs
have the samesame size.
Slide #7 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Bcache and MemoryBcache and Memory
3rd Level Cache for the 21164 Attributes
– optional, external,physical, synchronous SRAM
– direct-mapped, write-back,write-allocate
256-bit or 512-bit block cache size of 1,2,4,8,16,32,64 Mbytes support up to 512MB of memory
– 1MBx36, 2MBx36,4MBx36,8MBx36,16MBx36
A cache architecture in which data
is only written to main memory
when it is forced out of the cache.
Opposite of write-through.
The Scache and Bcache block size is either 64-bytesor 32 bytes. The Scache and Bcache always have
identical block sizes. All the Bcache and main memory FILLs or write transactions are of the selected block size.
The Scache and Bcache block size is either 64-bytesor 32 bytes. The Scache and Bcache always have
identical block sizes. All the Bcache and main memory FILLs or write transactions are of the selected block size.
A cache where the cache location for a given address is determined from the middle addressbits. If the cache line size is 2^n then the bottom n address bits correspond to an offset within a cache entry.
If the cache can hold 2^m entries then the next m address bits give the cache location. The remaining top
address bits are stored as a TAG along with the entry.
In this scheme, there is no choice of which block to flush on a cache miss since there is only one place forany block to go. This simple scheme has the disadvantage that if the program alternately accesses different
addresses which map to the same cache location then it will suffer a cache miss on every access to these locations.
This kind of cache conflict is quite likely on a multi-processor.
A cache where the cache location for a given address is determined from the middle addressbits. If the cache line size is 2^n then the bottom n address bits correspond to an offset within a cache entry.
If the cache can hold 2^m entries then the next m address bits give the cache location. The remaining top
address bits are stored as a TAG along with the entry.
In this scheme, there is no choice of which block to flush on a cache miss since there is only one place forany block to go. This simple scheme has the disadvantage that if the program alternately accesses different
addresses which map to the same cache location then it will suffer a cache miss on every access to these locations.
This kind of cache conflict is quite likely on a multi-processor. A cache line is allocated when the write memory
data miss the cache
A cache line is allocated when the write memory
data miss the cache
In the PC164PC164In the PC164PC164
ECC protectedECC protected
Slide #8 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
PCI featuresPCI features
Supports 64-bits PCI bus width Supports 64-bit PCI addressing (DAC cycles) Accept PCI fast back-to-back cycles
– addr,data0,data1,data2,...,addr_again!– The Frame# is only deasserted for a cycle to allow
the last to finish
Issues PCI fast back-to-back cycles in dense addrss space
addrdata data
clk
Frame#
Slide #9 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
CIA TransactionsCIA Transactions
21164 memory read miss 21164 memory read miss with victim 21164 I/O read 21164 I/O write DMA read DMA read(prefetch) DMA write
Slide #10 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
DSW Data PathsDSW Data Paths
SYSSYSSYSSYS
Memory21164
BCache
DMA 0 DMA 1
IODIOD IODIODMEMMEM MEMMEM
Victim PathVictim Path
Read Miss PathRead Miss Path
PCIPCI PCIPCI
SYSSYS MEMMEM
FlushFlush FlushFlush
IO Paths not shownInstruction QueueInstruction Queue
MEMMEM
Slide #11 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
DSW BuffersDSW Buffers
DMA Buffer Sets (0 and 1)– PCI buffer for PCI DMA write data
– Memory buffer for memory data
– Flush buffer for system bus data
IODIOD IODIODMEMMEM MEMMEM
PCIPCIFlushFlush FlushFlush
DMA 0 DMA 1
PCIPCI
Slide #12 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
DMA WritesDMA Writes
Data arrives in the PCI Buffer Memory Buffer loaded at the same time Bcache line flushed and Flush buffer loaded 3 sources merged and data back at memory
DMA 0
IODIODMEMMEM
PCIPCI FlushFlush
Memory
21164BCache
As you just see, the DMA operation
causes PCI bufferPCI buffer loaded from the IODIOD bus, the MEMMEMbufferbuffer loaded from memorymemory,and the flush bufferflush buffer loaded
from system bussystem bus at the same time
As you just see, the DMA operation
causes PCI bufferPCI buffer loaded from the IODIOD bus, the MEMMEMbufferbuffer loaded from memorymemory,and the flush bufferflush buffer loaded
from system bussystem bus at the same time
Then the 3 sources are mergedmerged and writtenback to main memorymain memory
Then the 3 sources are mergedmerged and writtenback to main memorymain memory
Slide #13 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
21164 Read Transaction21164 Read Transaction
If hit in the Bcache, no memory access is required
Memory21164
BCacheRead Miss PathRead Miss Path
SYSSYS MEMMEMHIT !!
Read dataData back to CPUData back to CPU
Slide #14 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
21164 Read Miss21164 Read Miss
If not hit in the Bcache during a read, memory access is involved.
Memory21164
BCacheRead Miss PathRead Miss Path
SYSSYS MEMMEM
Read dataData back to CPUData back to CPU
21172 CIA
Command
command
21172 BAMiss!!
Slide #15 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Read Miss With VictimRead Miss With Victim
Two scenarios– write data with different address tag into a valid cache line
– read data with different address tag into a valid cache line
Write allocate!!
read allocate!!
Memory21164
BCacheRead Miss PathRead Miss Path
SYSSYS MEMMEM
Write data 21172 CIA
Command
command
Miss!!
Victim PathVictim Path
Merge data
Read Missed blockRead Missed blockand
Write victim blockWrite victim blockare indivisibleindivisible
in the logic design
Slide #16 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Traffic Jam on MEM busTraffic Jam on MEM bus
SYSSYSSYSSYS
DMA 0 DMA 1
IODIOD IODIODMEMMEM MEMMEM
Victim PathVictim Path
PCIPCI PCIPCI
Memory21164
BCacheRead Miss PathRead Miss Path
SYSSYS MEMMEM
FlushFlush FlushFlush
IO Paths not shownInstruction QueueInstruction Queue
MEMMEM
Let’s think about this senario, during the PCIPCI
DMADMA transfer, there are READREAD and
WRITEWRITE memoryhappening at the same
time
Let’s think about this senario, during the PCIPCI
DMADMA transfer, there are READREAD and
WRITEWRITE memoryhappening at the same
time
All the circleparts compete
for this resource
Cause read miss
with victim
Causeread miss
Don’t forget instructionfetch uses
memory too
Don’t forget instructionfetch uses
memory too
Slide #17 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
How Fast can DMA be?How Fast can DMA be?
SYSSYSSYSSYS
DMA 0 DMA 1
IODIOD IODIODMEMMEM MEMMEM
PCIPCIFlushFlush FlushFlush
2 fetches and 2 writes to memory/DMA – 64 bytes/240 ns = 266 Mbytes/s– 8 bytes /30 ns = 266 Mbytes/s
33 MHz PCIPCIhas the same
speed withDRAMDRAM !!
Can we really do this ??
33 MHz PCIPCIhas the same
speed withDRAMDRAM !!
Can we really do this ??
60 ns DRAMDRAM256256-bit bus
60 ns DRAMDRAM256256-bit bus
33MHz PCIPCI6464-bit bus
33MHz PCIPCI6464-bit bus
OverheadOverhead, retrysretrys, read linesread lines,read line with victimread line with victim,
instruction fetchinstruction fetchall share the same bandwidth!!It turns out for the worst case,
17MBytes/s17MBytes/s is achievedjust above bottom linebottom line
OverheadOverhead, retrysretrys, read linesread lines,read line with victimread line with victim,
instruction fetchinstruction fetchall share the same bandwidth!!It turns out for the worst case,
17MBytes/s17MBytes/s is achievedjust above bottom linebottom line
Slide #18 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Performance of the MB2PCIPerformance of the MB2PCI
Worst case– 29.9MBytes/s– 25.5MBytes/s– 17.5MBytes/s
Best case– 95MBytes/s– 80MBytes/s– 72MBytes/s
- No intervenence- read line, instruction fetch- read line, read line with victim, instruction fetch
- No intervenence- read line, instruction fetch- read line, read line with victim, instruction fetch
Slide #19 Friday, October 10, 1997
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
Alpha 21172 Core Logic Chip SetAlpha 21172 Core Logic Chip Set Jerry Huang
ConclusionConclusion
If we want to improve– use 256-bit cache block instead of 512-bit– Is there a next version 21172 chip surport 512-bit
memory bus?– Is there DRAM chips faster then 60ns– can we afford 64M Bcache(SRAM)?
There is a trade offtrade off here, by using smaller block, the 21164 will generate more
cache miss cycles and may slow down.
On the other hand, for the DMA transfer,when only 128-bit data is transferred, no more
512-bit memory read overhead. There is only 256-bit read now. Thus improve the
worst caseworst case performance.
There is a trade offtrade off here, by using smaller block, the 21164 will generate more
cache miss cycles and may slow down.
On the other hand, for the DMA transfer,when only 128-bit data is transferred, no more
512-bit memory read overhead. There is only 256-bit read now. Thus improve the
worst caseworst case performance.