ieee workshop on hsln 16 nov 20041 simulative analysis of the rapidio embedded interconnect...

IEEE Workshop on HSLN16 Nov 2004 1

Simulative Analysis of theSimulative Analysis of theRapidIO Embedded Interconnect Architecture RapidIO Embedded Interconnect Architecture for Real-Time, Network-Intensive Applicationsfor Real-Time, Network-Intensive Applications

David Bueno, Adam Leko, Chris Conger,

Ian Troxel, and Alan D. George

High-Performance Networking (HPN) Group

HCS Research Laboratory

University of Florida


Presentation Outline

Introduction Background

Rapid IO Technology Overview Ground Moving Target Indicator (GMTI) Overview

Simulation and Modeling Environment Experimental Setup Results Conclusions


Introduction Big impetus to provide more processing power on-board satellites

More powerful radiation-hardened components available Strive to reduce downlink requirements

Today’s satellite systems typically built around an expensive custom or COTS bus interconnect (e.g. cPCI, VME) Scalability, bandwidth, and latency limitations Solutions tend to be “one-off” designs with much non-recurring engineering

High on-board data rate requirements and desire to reduce custom design make COTS-based embedded networks an attractive solution

Image courtesy http://www.afa.org

RapidIO (RIO) a leading contender High-performance, switched embedded interconnect Scales to connect many nodes Better bisection bandwidth than bus-based technologies Less “hand-coded” synchronization and arbitration required

Example:A 64-bit, 33 MHz cPCI bus provides ~2Gbps of throughput, while a single 8-bit DDR 250 MHz RapidIO endpoint provides ~4Gbps of throughput. Even a modest RIO system of such links provides tens of gigabits per second of aggregate throughput with many non-blocking links.


Background – Rapid IO Relatively new technology, with

limited research to date White paper out January 2002 First specification document

published June 2002 A set of formal specifications,

published by RapidIO Trade Organization (RTO)

Support from many companies Motorola, IBM, TI, Xilinx, Lucent,

Agere, Analog Devices, Ericsson, Altera, among others

Several are offering products Xilinx, Motorola, Redswitch, and

Praesum

Document DescriptionMulticast Specification Defines a method for RIO switch-based

multicast

Streaming Specification Defines a method for protocol-independent encapsulation of payloads up to 64K bytes

system bringup spec.pdf Provides standard approaches for RapidIO system bring-up, device enumeration, routing table management, software and hardware abstraction layers and APIs

fcspec.pdf Flow control spec. rev1.0

errata1.pdf Revision 1.2 Errata 1

hipspec.pdf HIP doc. rev1.0

errspec.pdf Error spec. rev1.2

RapidIO.pdf Main spec. rev1.2

serial.book.pdf Serial spec. rev1.2

inter-op.pdf Inter-operability spec. rev1.2

oview.pdf Spec. overview rev1.2

gsmlspec.pdf GSM spec. rev1.2

Rapid IO Specification Documents c/o RTO


Background – Rapid IO Three-layered, embedded system interconnect architecture

Logical – memory mapped I/O, message passing, and global shared memory Transport – routing based on packet destination ID Physical – serial and 8- or 16-bit parallel at 250, 500, or 1000 MHz

Point-to-point, packet-switched interconnect Targeted for inter-processor and inter-board embedded interconnects Peak single-link throughput ranging from 2 to 64 Gb/s Focus on 16-bit parallel LVDS RIO implementation for satellite systems

Image courtesy G. Shippen, “RapidIO Technical Deep Dive

1: Architecture & Protocol,” Motorola Smart Network

Developers Forum, 2003.


Background – Rapid IO Uses Low-Voltage Differential Signaling (LVDS) to minimize power Employs fabrics in form of Multistage Interconnection Networks (MINs)

to allow communication between arbitrary devices Two types of packets: control and data Message-passing logical layer

Provides traditional message-passing interface with mailbox-style delivery Request and response messages between endpoints Supports 26 message priorities with segmenting up to 4096B

Trans Recv Trans Recv

Write 0

Write 1

Write 2

Write 3

Write 4

Write 2

Write 3

Write 4

Ack 0

Ack 1

Rtry 2

Ack 2

Ack 3

Write 0

Write 1

Write 2

Write 3

Write 4

Ack 0, 2 buff avail

Ack 1, 1 buff avail

Ack 2, 0 buff availIdle, 0 buff avail

Idle, 2 buff avail

Ack 3, 3 buff avail

Ack 4, 2 buff avail

Idle, 0 buff avail

(a) (b)

(a) Receiver- and (b) Transmitter-controlled flow control

Physical layer Only supports 4 priority levels Error detection

Supported directly via CRC for regular packets Inverted bitwise replication of symbols for

control packets Error recovery accomplished via Go-Back-N

sliding window retransmission of damaged packets

Transmitter or receiver flow control supported at link level


Background – GMTI Space-based RADAR: GMTI detects and tracks moving targets on ground

Important use in military applications Typified by large data sets and high computation requirements

Algorithm decomposed into multiple sub-tasks Incoming data set viewed as 3-dimensional “data cube”

Size of each cube dictated by Coherent Processing Interval (CPI) Each task has an ideal dimension for partitioning and processing

If partitioned along optimum dimension for a particular task, no inter-processor communication necessary during processing of that task

Data reorganized in-between tasks if necessary by performing a corner-turn Size of resulting data is orders of magnitude smaller than incoming data

Completing processing on-board greatly reduces amount of downlink throughput required from satellite to Earth

PulseCompression

DopplerProcessing

Space-TimeAdaptive

Processing(STAP)

ConstantFalse Alarm

Rate(CFAR)

Receive Cube

Send Results

Corner Turn Partitioned along range dimension

Partitioned along pulse dimension

GMTI algorithm flow and processing task breakdown

DATA CUBE

Beam

s

Ranges

Pu

lses

Data cube dimensions


GMTI – Parallel Partitioning Straightforward partitioning

Entire system works in parallel on a single data cube

All-to-all personalized communication used to perform corner-turn

Result latency must be ≤ 1 CPI

Staggered partitioning Processors work in small groups, one data

cube for each group Incoming data cubes are sent to groups via

round-robin distribution Result latency must be ≤ N × CPI

N = number of processor groups

Pipelined partitioning Processors work in small groups, each group

responsible for one stage of algorithm Corner turns “for free” Result latency must be ≤ N × CPI

N = number of stages

Data cube dimensions

time

1 CPI

PE #4

PE #3

PE #2

PE #1

PE #4

PE #3

PE #2

PE #1

PE #4

PE #3

PE #2

PE #1

PE #4

PE #3

PE #2

PE #1

PC DP STAP CFAR

Straightforward Partitioning

Pipelined Partitioning

Data Cube0

Data Cube1

Data Cube2

Data Cube3

Data Cube4

Data Cube5

timestart

CPI 0 CPI 1 CPI 2 CPI 3 CPI 4

Staggered Partitioning


Simulation & Modeling Environment Modeling library created using Mission Level Designer (MLD), a commercial

discrete-event simulation modeling tool from MLDesign Technologies C++-based, block-level, hierarchical modeling tool

Algorithm modeling accomplished via custom C++ primitives Created different processor models for different phases of the algorithm Processor model approximates vector DSP processor

Our model library includes: RIO central-memory switch Compute node with RIO endpoint GMTI traffic source/sink RIO logical message-passing layer Transport and parallel physical

layers

Model of Compute Nodewith RIO Endpoint


RapidIO Models Key features of Endpoint model

Message-passing logical layer Transport layer Parallel physical layer

Transmitter- and receiver-controlled flow control Error detection and recovery Priority scheme for buffer management Adjustable link speed and width Adjustable priority thresholds and queue lengths

Key features of Central-memory switch model Selectable cut-through or store-and-forward routing TDM model for memory access (approximated with average delay) Adjustable priority thresholds based on free switch memory Adjustable link rates, etc. similar to endpoint model

Model of RIOCentral-Memory Switch


System Models High throughput requirements for data source and data redistribution in pipelined

partitioning require non-blocking connectivity between all nodes and data sources Custom network topologies created for 8-, 12-, 16-, and 24-processor systems Network topologies favored communication patterns of pipelined partitioning scheme

Most communication-intensive partitioning scheme Algorithm performance of other schemes not sensitive to topologies

24-node topology shown below (others similar) Grey = Switch Red = Pulse compression node

DataSource

Blue = Doppler node Green = STAP node Orange = CFAR node


Simulation Experiment Setup

Built system models with 8, 12, 16, 24 compute nodes For each experiment, 8 CPIs worth of data sent to processors

and processed Key simulation parameters

16-bit parallel RapidIO 250 MHz DDR clock rate 4.6 Gb/s incoming GMTI data rate 10 KB switch central memory size Cut-through routing on/off Transmitter- or receiver-controlled flow control

Key simulation outputs CPI completion latency Average packet latency System and application bandwidth


System Bandwidth Measurements Overall system b/w = total bytes transferred ÷ total simulated time Application b/w = total payload transferred ÷ total simulated time Pipelined method requires most redistribution of data, consumes

most bandwidth

0

2

4

6

8

10

12

Da

ta r

ate

(G

bp

s) Overall system bandwidth

Application bandwidth

Straightforward and staggered methods are comparable

Gap between pipelined bars indicates communication inefficiencies


Packet Overhead Efficiency

Communication efficiency = total payload transferred ÷ total bytes transferred

Pipelined method is consistently least efficient (as indicated on previous slide)

0.8

0.82

0.84

0.86

0.88

0.90.92

0.94

0.96

0.98

1

Co

mm

un

ica

tio

n e

ffic

ien

cy Data groupings of

pipelined method require many packets to be sent that are < RIO max of 256 bytes Packets that are not even

multiples of a 32B word must also be “padded” with dummy data


RIO Fabric Considerations

Cut-through routing Provided reduced packet

latencies Did not improve performance

of overall application CPI completion latency

remains same GMTI is bandwidth-intensive

but not sensitive to latency of individual packets

Flow-control method Transmitter-controlled flow

control did not provide improvements over receiver-controlled baseline method

0

1000

2000

3000

4000

5000

6000

7000

8 no

de, p

ipel

ined

8 no

de, s

traig

htforw

ard

8 no

de, s

tagg

ered

12 n

ode, p

ipel

ined

12 n

ode, s

traig

htforw

ard

12 n

ode, s

tagger

ed

16 n

ode, p

ipel

ined

16 n

ode, s

traig

htforw

ard

16 n

ode, s

tagger

ed

24 n

ode, p

ipel

ined

24 n

ode, s

traig

htforw

ard

24 n

ode, s

tagger

ed

Av

era

ge

pa

ck

et

de

lay

(n

s)

BaselineCut-through

Tx-controlled flow control does eliminate packet transmission attempts when receiver buffers unable to accept (at no performance cost over Rx-controlled flow control) Could save power in some systems


System Throughput 24-node systems needed to meet real-time deadline (256 msec CPI, red line) As modeled, pipelined method performs worst

True benefit of pipelining comes from stringing together smaller, cheaper specialized processing elements If this can be done in implementation, cost/performance benefits can be gained

Staggered method can sometimes leave processors idle, reducing throughput Straightforward method also has best CPI latency, since all PEs work on each

CPI

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

CP

Is /

se

co

nd


Recent Results, Additional Experiments

Using much larger input data set sizes and more efficient system layouts 24 Gbps vs. 4.6 Gbps input data set used here Still using 28-node system

Result latency comparison Interval from input data arrival to reporting of results Recall that deadline to meet is 256 ms

This deadline is extended to a multiple of 256 ms for staggered and pipelined methods

Some communication-computation overlap is acceptable assuming DMA

Free switch-memory histograms Shows percent of time switch spent with different amounts

of free memory Reveals congestion or confirms efficient routing Spikes or bumps in low free memory brackets imply

contention for a particular port Histograms generated for every switch in the system

Future research Our RapidIO research is an on-going effort Upcoming studies will consider different logical layers

Memory-mapped logical layer already being modeled In addition to GMTI, Synthetic Aperture Radar (SAR) to be

simulated and studied in RapidIO-based systems

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 3276.8 6553.6 9830.4 13107.2

Free memory (bytes)

Fre

qu

en

cy

Free switch memory histogram

Result latency comparison

0

256

512

768

1024

1280

1536

32000 40000 48000 56000 64000

Number of ranges

Lat

ency

(m

s)

Straightforward, 5 boards

Staggered, 5 boards

Pipelined, 6 boards

Pipelined, 7 boards


Conclusions RapidIO provides feasible path to flight for space-based radar

Throughput capability and interconnect scalability of RapidIO provide sufficient infrastructure for compute-intensive applications

Future work to focus on additional SBR variants (e.g. Synthetic Aperture Radar, SAR) and experimental RIO analysis

Developed suite of simulation models and mechanisms for evaluation of RapidIO designs for space-based radar applications et al.

Flexibility in system design using RapidIO interconnect allows range of system topologies to support various algorithm partitionings Straightforward method provides lowest completion latencies, pipelined

method suffers from some communication inefficiencies Recent work shows systems scalable to more nodes, larger data cube

sizes with greater processing/network requirements GMTI result latency does not benefit from cut-through routing,

selection of either Rx- or Tx-controlled flow control Other applications may benefit more from these features of RapidIO Flow control method may offer other benefits, such as lower power

consumption


Acknowledgements

This research was funded in part by Honeywell Defense and Space Electronic Systems (DSES), Clearwater FL.

Thanks are also extended to MLDesign Technologies in Palo Alto, CA for use of their MLD software tools.