a study of cyclops64 crossbar architecture and performance yingping zhang april, 2005

22
A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Upload: justina-wheeler

Post on 04-Jan-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

A Study of Cyclops64 Crossbar

Architecture and Performance

Yingping ZhangApril, 2005

Page 2: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Overview

1. Background2. Architecture Of C64 Crossbar3. Performance Simulation4. Test Result5. Performance Analysis6. Conclusion7. Future Work

Page 3: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Background1. What is Cyclops64? Cyclops64(C64), also called Blue Gene/C, is part of IBM Blue Gene

project.

It is a cellular architecture-based supercomputer. Each chip consists of 75~80 custom designed 64-bit processors. Each processor will have two thread units, two integer units, and a floating point unit.

C64 is expected 1000 teraflops and will be one of the fastest supercomputers in the world.

The architecture was conceived by Cray award winner Monty Denneau , Verification testing and system software development is being done at our CAPSL group.

2. What is the project goal?Study of the architecture and performance of the C64 interconnection network, crossbar (part of Verification testing)

Page 4: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Host IF

FIFO64-bit x 64

Mickey treeGbit ethernetDiskMickey tree (DMA)Gbit ethernet (DMA)

Mickey treeGbit ethernetDiskMickey tree (DMA)Gbit ethernet (DMA)

C64 ProcessorC64 ProcessorC64 ProcessorC64 ProcessorC64 Processor

TU TU FP

ICache5

Crossbar

C64 ProcessorC64 ProcessorC64 ProcessorC64 ProcessorC64 Processor

TU TU FP

ICache5

C64 ProcessorC64 ProcessorC64 ProcessorC64 ProcessorC64 Processor

TU TU FP

ICache5

DDR2 SDRAMController

4

ASw(a part of 3D cube network)

The other C64 chips DDR2 SDRAM DIMMs

FPGA

• Port 0-79 for C64 processors• Port 80-83 for mpg ICache• Port 84,85 for Host IF

• Port 86-89 for DRAM controller• Port 90-95 for ASw

Processor# 80 ICache# 16

mpg mpg mpg mpg

Configuration Pin * The configuration pins are Connected to all modules except DDR and Crossbar

Cyclops64 CHIP

Page 5: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Architecture Of C64 Crossbar

1. On chip crossbar: Provide communication inside a single chip

2. 96-way crossbar: 96 input ports, 96 output ports. Each port can

connect with any other port and itself. Any communication among processors, ICaches,

SRAM, DRAM, and ASwitches has to go through the crossbar

3. Pipelined crossbar: 7 pipeline stages When full pipelined, each port flow out one packet

each cycle Bandwidth of the crossbar =

port number * length of the packet

Page 6: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

SrcSplit

TarCombine

TUnitA

102+2

TarCtl

Arbiter

LC SrcCtlWsWrRs

Req Ack

102

92 10

9

FIFO

96

96

C64|MP|CORE

MUX

Sel

92

92 3

95

95

SrcSplit

TarCombine

102+2

TarCtl

Arbiter

LC SrcCtlWsWrRs

Req Ack

102

92 10

9

FIFO

96

96

C64|MP|CORE

MUX

Sel

92

92 3

95

95

Port# 96

Crossbar Architecture

SrcSplit

TarCombine

TUnitA

102+2

TarCtl

Arbiter

LC SrcCtlWsWrRs

Req Ack

102

92 10

9

FIFO

96

96

C64|MP|CORE

MUX

Sel

92

92 3

95

95

TUnitA

TUnitB TUnitB TUnitB

Page 7: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Crossbar Architecture

SrcSplit

TarCombine

102+2

TarCtl

Arbiter

LC SrcCtlWsWrRs

Req Ack

102

92 10

9

FIFO

96

96

C64|MP|CORE

MUX

Sel

92

92 3

SrcSplit

102+2

TarCtl

Arbiter

LC SrcCtlWsWrRs

Req Ack

102

92 10

9

FIFO

96

96

C64|MP|CORE

MUX

Sel

92

92 3

SrcSplit

102+2

TarCtl

Arbiter

LC SrcCtlWsWrRs

Req Ack

102

92 10

9

FIFO

96

96

C64|MP|CORE

MUX

Sel

92

92 3

Port# 96

1

2

3

4

5

6

7

TUnitA TUnitA TUnitA

95

95TUnitB

TarCombine

95

95TUnitB

TarCombine

95

95TUnitB

Page 8: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Performance Simulation1. Performance Measurement

Latency: The time required for a packet to traverse the network form source to destination

Throughput: The rate at which packets are delivered by the network for a particular traffic pattern

2. Workloads Synthetic: Random Distributed vs Poisson Distributed Application Driven: Hello_World, Matrix_Cthread,

Laplace_Cthread, Heat_Cthread, Cnet_get_nb, Cnet_put_nb, Dev_Align, Dev_Reset

3. Simulators Csim_crossbar LAST

(Both designed by Fei Chen at CAPSL)

Page 9: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Parameters configurationPARAMETERS

Workloads Arbitration Schemes

SyntheticApplicationDriven Benchmarks

Temporal1

CharacteristicsSpatial2

Distributions

Uniform

Random

Permutation

(Neighbor & Tornado)

Uniform

RandomPoisson

UniformlyRandom

Matrix Circular

SegmentedMatrix

FixedPriority

1. Describe the generation probability of message over time2. Determine the communication paths between the sources and destinations

Page 10: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Test Results: Latency - Synthetic Workloads

•Latency of Uniform Random Pattern goes infinite when injection rate > 0.6•Latency of Permutation Traffic is always 7 cycles without any change.

Page 11: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Test Results: Throughput - Synthetic Workloads

(Cont)

•Uniform workload with permutation traffic pattern has linear throughput•This network is a stable network

Page 12: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Test Results: Contention- Synthetic

Workloads(Cont)

•Permutation Traffic has zero contention•Uniform distribution has more contention than POISSON distribution

Page 13: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Performance Analysis One- Synthetic Workloads

The least latency in the crossbar is 7 cycles.

The crossbar is a stable network because its throughput does not degrade beyond the saturation point.

Contention at the output causes the delay of transferring message, and permutation traffic has zero contention

Uniformly random workload with permutation traffic has the best performance. When injection rate reaches 1.0, its throughput can achieve 1.

Page 14: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Test Results: Latency - Arbitration Schemes

• Fixed Priority Scheme is the worst case, its latency goes infinite at rate 0.5• Others have very similar latency behavior

Page 15: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Test Results: Throughput - Arbitration Schemes

(Cont)

• Fixed Priority Scheme is the worst case, the network saturates at rate 0.5 • Others have very similar throughput behavior

Page 16: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Performance Analysis Two- Arbitration Schemes

SLRU, PLRU, CIRC and RAND arbitration schemes show very similar performance behavior under uniformly random traffic pattern.

Fixed Priority arbitration scheme shows the worst performance behavior under the same situation.

Page 17: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Test Results – Application-Driven

Benchmarks

ApplicationNumber

OfPackets

ForwordLatency(Avg)

ReverseLatency(Avg)

ForwordThroughput

(Avg)

ReverseThroughput

(Avg)

Hello_World 5110 7.35 19.74 0.002 0.002

Heat_Cthread 7975863 46.00 4034.00 0.002 0.001

Matrix_Cthread 110218 21.59 939.00 0.002 0.002

Cnet_get_nb 10162 7.538 53.552 0.001 0.002

Cnet_put_nb 10052 7.619 50.027 0.001 0.002

Dev_Align 8924 7.286 37.381 0.002 0.002

Dev_Reset 10148 7.617 50.413 0.001 0.002

• Average reverse latency increases very fast when packet number increased• Forward and reverse traffics have different latency behavior

Page 18: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Performance Analysis-Application-Driven

Benchmarks C64 architecture classified traffic into:

Class 0 (Forward traffic): messages send out from processor, like load request and stores from processors

Class 1 (Reverse traffic): Messages send back to processors, like load return to processors

Reverse transfer delay is much bigger than forward transfer delay

Forward and reverse transfer have similar throughput

Page 19: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

ConclusionFor Synthetic Workloads Verified:

C64 crossbar is a stable network The least latency of C64 crossbar is 7 cycles.

Discovered: Traffic pattern, including temporal characteristics and spatial

distribution, has sensitive affect on the crossbar performance behavior

permutation spatial traffic has the best latency behavior. It keeps to have the least latency 7 cycles because it has zero contention.

Uniform random distributed workload has better throughput behavior.

Fixed priority arbitration scheme has worst performance behavior and others are very similar

For Application-Driven Workload Discovered:

Forward and reverse traffics have different latency behavior but similar throughput behavior

Reverse traffic has worse latency behavior than forward

Page 20: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Future WorkSynthetic Workloads Investigate arbitration schemes under different traffic

patterns

Application-Driven Workloads Investigate performance behavior of C64 Crossbar under

different configuration constrains Number of used thread units Number of involved memory banks

Investigate performance behavior of C64 Crossbar under different arbitration schemes

Summary of Performance Analyses

Documentation

Page 21: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Acknowledge

Fei ChenYuheiDimitriJoseph

TedProf. Gao

All people in CAPSL group

Page 22: A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Question?

Thanks!!!