22 september 2005 rapidio-based space system architectures for synthetic aperture radar and ground...

22 September 2005

RapidIO-based Space System Architectures for Synthetic Aperture Radarand Ground Moving Target Indicator

David Bueno, Chris Conger, Adam Leko,

Ian Troxel, and Alan D. George

HCS Research Laboratory

College of Engineering

University of Florida

22 September 2005 2

Outline

I. Project Overview

II. BackgroundI. RapidIO (RIO)

II. Synthetic Aperture Radar (SAR)

III. Ground Moving Target Indicator (GMTI)

III. Model Library Overview

IV. RapidIO Experimental TestbedI. Overview

II. Validation results

V. Simulation Experiments and Results

VI. Conclusions

22 September 2005 3

Project Overview

Simulative analysis of real-time Space-Based Radar (SBR) systems using RapidIO interconnection networks RapidIO is a high-performance, switched interconnect for

embedded systems Experimental validation of simulation models using a

RapidIO hardware testbed

Image courtesy [6]

Sensitivity analysis of GMTI and SAR to RIO network and algorithm parameters Uses discrete-event simulation of RapidIO

network, processing elements, and SBR algorithms

Examine considerations for designing RIO-based systems capable of both SAR and GMTI

22 September 2005 4

Background: RapidIO Three-layered, embedded system interconnect architecture

Logical layer, transport layer, physical layer Point-to-point, packet-switched interconnect Peak single-link throughput ranging from 2 to 64 Gb/s Focus on 16-bit parallel LVDS RIO design for space systems

Image courtesy [7]

22 September 2005 5

Background: SAR SAR used to take broad-range, high-resolution snap-shots of

surface features from satellites, even at night or inclement weather

Data set is 2D image or matrix, with range and pulse dimension Typical data size is 2GB, each matrix element a 64-bit complex integer Due to large data set size, image processed iteratively in chunks

Figure below illustrates each stage of algorithm Color denotes partitioning (see legend to right of picture) Blue lines/blocks represent communication events

Processors write data back to global memory for repartitioning after processing completes in each subtask, if necessary

Sensor Endpoints

Global Memory

Pulse FFT

Auto-focus

Range Pulse Comp.

Polar Refor-matting

1st half 2D FFT (range)

2nd half 2D FFT

(pulse)

Magni-

tuding

Range Dimension

Pulse Dimension

Range/Pulse Blocks

22 September 2005 6

Background: GMTI

GMTI used to track moving targets on ground Incoming data organized as 3-D matrix (data cube)

Data reorganized between stages for processing efficiency Real-time processing deadline for each cube defined as Coherent

Processing Interval (CPI) Data set size for GMTI much smaller than SAR, however time

constraints allow much less time for processing Unlike SAR, entire data set may be stored in processor memories No communication with global memory during processing, inter-

processor communication between subtasks (corner turns) Previous work in [1] examined various partitionings of GMTI over

RIO-based systems

PulseCom-

pression

DopplerProcessing

Space-Time

AdaptiveProcessing

(STAP)

ConstantFalse AlarmRate

(CFAR)

Receive Cube

Send Results

Corner TurnPartitioned along range dimension

Partitioned along pulse dimension

22 September 2005 7

Model Library Overview

Modeling library created using Mission Level Designer (MLD), a commercial discrete-event simulation modeling tool C++-based, block-level, hierarchical modeling tool

Our RIO-SBR model library includes: Compute node with RIO endpoint

IO and message passing RIO logical layers

RIO parallel physical layer Script-based processor model

RIO central-memory switch Global memory (GM) board External data source See [1] for more details on model

library

Model ofFour-Processor Board

22 September 2005 8

RapidIO Testbed and Model Validation

2-node RapidIO testbed constructed Xilinx Virtex-II Pro FPGAs Xilinx RapidIO Core

250MHz link clock, 8-bit parallel Block diagram of single endpoint

depicted in upper figure Layer interface signals brought out

for logic analyzer visibility Timing probes inserted into

simulation model in equivalent positions, used to validate timing

Using same signals described above, transaction may be viewed across both endpoints

Equivalent system constructed in MLD for validation

Logical I/OLayer

BufferDesign

8-bit Parallel LVDS

Physical Layer

User-space

LVDS

PhysicalLink

LVDS

These signals broughtout for timing

Fine-Tuning Single Packet Latency Within a Single Endpoint

22 September 2005 9

Calibration Experiments

Calibration experiments designed to match transaction latencies Single-packet experiments

measure internal endpoint latencies 32B, 64B, 128B, and 256B NREAD, NWRITE, NWRITE_R

Multi-packet experiments calibrate realistic transaction sizes, endpoint operation overhead 1KB, 4KB, 16KB, 64KB, 256KB,

1MB, and 4MB transactions NREAD, SWRITE, NWRITE_R

Figure depicts SWRITE trans.

User-LOGLOG-BUFFBUFF-PHY

Link

t = 0

PHYBUF

FLOG UserLink

TESTBED

SIM

User LOGBUF

FPHY

Node 0 Node 1

Node 0 Node 1

LOG-User

BUFF-LOGPHY-BUFF

Data Source

replies

A B C

AB

C

22 September 2005 10

Validation Results: Latency Single-packet latency results shown below for read and response-

less write transactions Simulation models match hardware to less than 5% error in all cases NWRITE_R not shown, but results are consistent with those shown

Maximum absolute difference between simulated and measured time is less than 40ns Difference has two components, model error (up to 32ns) and

measurement error (4-8ns)

NREAD Latency

0

500

1000

1500

2000

2500

3000

32 64 128 256

Packet Size (B)

La

ten

cy

(n

s)

SIM

EXP

NWRITE Latency

0

500

1000

1500

2000

2500

3000

32 64 128 256

Packet Size (B)

Late

ncy

(ns)

SIM

EXP


Validation Results: Throughput Throughput results calculated based on measured latencies and

amount of data transferred Maximum actual throughput of 3.25Gbps for write transactions,

3.14Gbps for read transactions 4Gbps ideal throughput (250MHz DDR × 8 bits) 3.76Gbps ideal effective throughput, considering header/CRC overhead

Simulation models match hardware well (< 5% error) for small, medium, and large transactions

SWRITE Throughput

0

0.5

1

1.5

2

2.5

3

3.5

256 1k 4k 16k 64k 256k 1M 4M

Transaction Size (B)

Th

rou

gh

pu

t (G

bp

s)

SIM

EXP

NREAD Throughput

0

0.5

1

1.5

2

2.5

3

3.5

256 1k 4k 16k 64k 256k 1M 4M


Th

rou

gh

pu

t (G

bp

s)

SIM

EXP


Validation Results: Error Analysis

Single-packet latency error Simulated packets of smaller-than-max size experience varying

levels of error, depending upon transaction type and size All transaction types simulate most accurately for max-sized packets

Throughput error Endpoints driven by user-space state machine, which contributes

overhead between packet transfers not currently modeled Error in simulation levels out for larger transaction sizes

In all cases, simulation models match hardware to within 4%

Throughput Error

-4

-2

0

2

4

6

256 1k 4k 16k 64k 256k 1M 4M


Err

or

(%)

NREAD

NWRITE

Single-Packet Latency Error

-4-202468

101214

32 64 128 256Transaction Size (B)

Err

or

(%)

NREAD

NWRITE

NWRITE_R


Simulation System Design Constraints 16-bit parallel 250MHz DDR RapidIO links (1 GB/s) Systems composed of processor boards interconnected by RIO

backplane 4 processors per board, 8 Floating-Point Units (FPUs) per processor

Baseline SAR algorithm parameters: Chunk-based algorithm performed out of global memory 16k ranges, 16k pulses, 16s CPI (~2GB) Requires less processing power and network throughput, more

memory (global memory required) Baseline GMTI algorithm parameters:

“Straightforward” partitioning used for lowest latency Divides each data cube up across all processing elements

Data cube: 64k ranges, 256 pulses, 6 beams, 256ms CPI (~750 MB) Requires more processing power and network throughput, less

memory (global memory not required)


7-Board System

Backplane and System Models

System architecture mainly dictated by computational and network requirements of GMTI and its small CPI (256 ms) Requires ~3GB/s from source to sink to meet real-time deadlines

4-Switch Non-blocking Backplane

Backplane-to-Board 0, 1, 2, 3 Connections

Backplane-to-Board 4, 5, 6, and Data Source/GM Connections


Experiments: Overview

SAR experiments use four active processor boards Remaining boards available for redundancy purposes, etc. Chunk size varied for each system/algorithm configuration

Baseline GMTI partitioning uses seven active processor boards Explicit inter-processor communication for data redistribution

(rather than global memory) Average CPI completion latency metric of choice to evaluate

each algorithm/configuration SAR completion time must be less than CPI to allow algorithm to

be performed in real-time Double-buffering allows “straightforward” GMTI completion

deadline to be relaxed to 2x CPI


SAR: RIO Logical IO Optimization (1)

Chunk size per-processor = system-level chunk size/number of processors in system 4 boards × 4 processors per board = 16

processors in system

No Synch: “free for all” access to 4 GM ports by 16 processors

Level 1: access to each global memory (GM) port controlled by read token

Level 2: access to GM controlled by write token and read token

Level 3: access to GM controlled by single read + write token

SAR: Synchronization Effects

11

11.5

12

12.5

13

13.5

14

14.5

15

256KB 512KB 1MB 2MB 4MB 8MB 16MB

System-Level Chunk Size

Ave

rag

e L

aten

cy (

s)

Synch Level 3

Synch Level 2

Synch Level 1

No Synch


SAR: RIO Logical IO Optimization (2)

As a general rule, contention in network increases as chunk sizes increase, causing slower CPI completion times Model does not account for processing inefficiencies that may

result from using chunks that are “too small” “Medium-sized” chunks likely provide good compromise

No synch case relies heavily on RIO flow control which bogs down network as many processors contend for access to global memory

Synch level 1 (read token) most simple and effective method of synchronizing access to global memory Define as baseline for Logical IO SAR

Synch level 3 adds too much synchronization, completely removing contention but serializing all memory accesses in the process Trend of decreasing performance with increasing chunk size is

reversed in this case


SAR: Double-Buffering

Double-buffered version of SAR allows processors to process “current” chunk while reading “next” chunk Requires 2x-3x more on-

board memory for buffering each chunk

If system also going to perform GMTI in a “straightforward” fashion (not out of GM), processors will already have significantly more than enough memory for SAR double-buffering

SAR: Double-Buffering

11

11.5

12

12.5

13

13.5

64KB 128KB 256KB 512KB 1MB 2MB 4MB 8MB 16MBSystem-Level Chunk Size

Ave

rag

e L

aten

cy (

s)

Double-BufferedLogical IO Baseline

Significantly increases performance for small chunk sizes, but increases contention at larger sizes Synchronized version of 2x-

buffered alg. removes benefits


SAR: RIO Clock Rate

Compares 125 MHz RIO system with 250 MHz baseline Examine both systems with 4 GM ports and with 8 GM ports

Uses 4 processor boards and 2 GM boards 2 backplane slots free

SAR: RapidIO DDR Clock Rate Variation

11

12

13

14

15

16

17


System-Level Chunk Size

Ave

rage

Lat

ency

(s)

125 MHz 8 GM125 MHz 4 GM250 MHz 8 GM250 MHz 4 GM Baseline

125 MHz systems significantly slower but still well inside 16 sec deadline

Doubled number of GM ports significantly increases system performance, especially for 125 MHz system Additional 4 GM ports

help to provide processors with lots of data for computation


GMTI: Scalability

Purpose of experiment to stretch data cube sizes beyond baseline and explore systems with 5, 6, and 7 active processor boards

GMTI: System Scalability

150

250

350

450

550

650

750

32000 40000 48000 56000 64000 72000 80000Number of Ranges

Ave

rage

Lat

ency

(ms)

5-Board System

6-Board System

Baseline 7-Board System

Double-Buffered Deadline

7-board system scales almost up to 80k ranges with full double-buffering “Double-buffering” implies

storage of “current” cube while receiving “next” cube

All systems able to handle 64k ranges baseline cube size 6-board system fails on 80k

ranges cube, 5-board system fails on 72k and 80k ranges cubes


GMTI: Global Memory-based GMTI

Experiment examines GMTI performed out of global memory in “chunks” similar to SAR Performance much worse than baseline “straightforward” GMTI

due to redundant transfer of data to/from global memory for each chunk

GMTI: Chunk-based Processing

100150200250300350400450500550


System-Level Chunk SizeA

vera

ge

Lat

ency

(m

s)Chunk-based GMTI 32k RangesChunk-based GMTI 64k RangesStraightforward GMTI 32k RangesStraightforward GMTI 64k RangesChunk-based DeadlineStraightforward Double-Buffered Deadline

Problems compounded by inability to double-buffer entire data cube when using chunks (creates strict 256ms deadline)

Advantage is that individual processing elements need much less local memory ~1-2 chunks vs. entire 1/N of

data cube (N = # of processors)


Conclusions (1)

Developed and validated suite of RapidIO models Minimal error experienced in simulation vs. testbed results Sources of small error being investigated with help from Xilinx

due to lack of visibility with internals of RapidIO core Future work will integrate RIO switches into testbed

Used models to study performance of variations of GMTI and SAR algorithms on RapidIO-based system Results emphasize importance of carefully scheduling

communication rather than letting RapidIO network be solely responsible for managing contention

Double-buffering of chunks provides mechanism for decreasing SAR CPI completion latencies (or chunk-based GMTI latencies)

Double-buffering of entire data cube for “straightforward” GMTI enables handling of larger cube sizes with fewer processing resources


Conclusions (2)

Several important considerations for building systems capable of both GMTI and SAR Wide disparity in CPI completion deadlines SAR memory requirements much higher than GMTI GMTI processing and network requirements higher than SAR

Systems capable of both GMTI and SAR must be built to “greatest common denominator” Can sometimes be wasteful of system resources when

performing one algorithm or the other Compromise may be obtained by performing GMTI in a “chunk-

based” manner similar to SAR Evens out usage of system resources at expense of GMTI CPI

completion latencies and data cube-size capabilities


Bibliography

[1] D. Bueno, C. Conger, A. Leko, I. Troxel, and A. George, “Virtual Prototyping and Performance Analysis of RapidIO-based System Architectures for Space-Based Radar,” Proc. High-Performance Embedded Computing (HPEC) Workshop, MIT Lincoln Lab, Lexington, MA, Sep. 28-30, 2004.

[2] C. Cho, “Performance Analysis for Synthetic Aperture Radar Target Classification,” MSEE Thesis, Mass. Institute of Technology, Feb. 2001.

[3] P. Meisl, M. Ito, and I. Cumming, “Parallel Synthetic Aperture Radar Processing on Workstation Networks,” Proc. 10th International Parallel Processing Symposium, Apr. 15-29, 1996, pp. 716-723.

[4] S. Plimpton, G. Mastin, and D. Ghiglia, “Synthetic Aperture Radar Image Processing on Parallel Supercomputers,” Proc. 1991 ACM/IEEE Conf. on Supercomputing, Albuquerque, NM, 1991, pp. 446-452.

[5] D. Bueno, A. Leko, C. Conger, I. Troxel, and A. George, “Simulative Analysis of the RapidIO Embedded Interconnect Architecture for Real-Time, Network-Intensive Applications,” Proc. 29th IEEE Conf. on Local Computer Networks (LCN) via IEEE Workshop on High-Speed Local Networks (HSLN), Tampa, FL, Nov. 16-18, 2004.

[6] http://www.afa.org/magazine/aug2002/0802radar.asp[7] G. Shippen, “RapidIO Technical Deep Dive 1: Architecture & Protocol,”

Motorola Smart Network Developers Forum, 2003.


Acknowledgements

We wish to thank Honeywell Defense and Space Electronic Systems (DSES) in Clearwater, FL for support of this research.

We also extend thanks to Xilinx for their generous donation of hardware and IP cores, as well as MLDesign Technologies for their donation of simulation software.

22 september 2005 rapidio-based space system architectures for synthetic aperture radar and ground...

Documents

rio network

rio endpointio

riobased systems capable

entire data set

processing elements

d matrix data cubedata

large data set size

groundincoming data