floating-point reuse in an fpga implementation of a ray-triangle intersection algorithm craig ulmer...

Floating-Point Reuse in an FPGA Implementation of a

Ray-Triangle Intersection Algorithm

Craig [email protected]

June 27, 2006

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Adrian Javelo UCLACraig Ulmer Sandia National Laboratories/CA

Ray-Triangle Intersection Algorithm

• Möller and Trumbore algorithm (1997)– TUV intersection point

• Modified to remove division– 24 Adds

– 26 Multiplies

– 4 Compares

– 15 Delays

– 17 Inputs

– 2 Outputs (+4 bits)

• Goal: Build for a V2P20– Catch: Do in 32b Floating-Point

– Assume: 5 adds, 6 multiplies, 4 compares

T

+

Outline

• Overview– Reusing floating-point hardware

• Adapting the Algorithm– Operation Scheduling– Mapping Operations to Units– Intermediate Data Values

• Performance Observations– Ongoing Work: Automation

• Summary

Floating-Point and FPGAs

• Floating-Point has been weakness for FPGA

• Recent high-quality FP libraries– SNL: Keith Underwood & K. Scott Hemmert

– USC, ENS Lyon, Nallatech, SRC, Xilinx

• FP units still challenging to work with– Deeply pipelined

– Require sizable resources

Single-Precision Function Stages Max in V2P20

Add 10 14

Multiply 11 18

Multiply (no denormals) 6 22

Divide 31 4

Implementing a Computational Kernel

• Desirable approach: full pipeline– One FP unit per operation

– Issue new iteration every cycle

• Problems– Rapidly run out of chip space

– Input bandwidth

– Low utilization on “one-time” ops

• Need to consider techniques for reusing units

Our Approach: Recycling Architecture

• Build wrapper around an array of FP units– Apply traditional compiler techniques

– Customize hardware data path

Control

Intermediate Buffering

Input Selection

Inputs

Outputs

Outline




• Summary

P Iterations

Operation Scheduling

• Sequence execution on FP array

• Extract Data Flow Graph (DFG)– Wide and shallow

– Need more parallelism

• Loop unrolling / Strip Mining– Pad FP units out to latency P

– Work on a P iterations at a time

– Sequentially issue strip of P iterations

– Thus: ignore FP latency in scheduling P

# Adds Multiplies <0

1

2

3

4

5

6

7

8

0

1

2

Step-by-Step Scheduling

Single Strip

40% 36%

53% 48%Back-to-Back:

One Strip:


1

2

3

4

5

6

7

8

0

1

2


1

2

3

4

5

6

7

8

9

10

11

0

1

2

Step-by-Step Scheduling

Single Strip Double Strip

40% 36%


One Strip:

64% 57%


Double Strip:

Outline




• Summary

Mapping Operations to Units

• Assign operations in schedule to a specific unit– Assignments affect input selection unit’s hardware

• Two strategies: First-Come-First-Serve and a Heuristic

+ + x x

Input

Output

Intermediate Buffering

Input Selection Unit

Mapping Effects


1

2

3

4

5

6

7

8

9

10

11

0

1

2


1

2

3

4

5

6

7

8

9

10

11

0

1

2First-Come-First-Serve Heuristic

MUX3 MUX4 MUX5 MUX6 MUX7 MUX3 MUX4 MUX5 MUX6 MUX70

10

0

10

Multiplexers Required

Outline




• Summary

Buffering Intermediate Values

• Necessary for holding values between stages– Input vs. Output Buffering

– Block RAM (BRAM) vs Registers

• Focus on output buffering w/ registers– “Delay Pipe” houses a strip of P values

Port 0

Port 1

BRAM P Registers

Register Delay Pipe

Two Strategies

Independently-Writable Delay Blocks

Minimize number of buffers 40 Memories, 40 MUXs

+ + x x

Input

Z-1 Z-1 Z-1Z-1

Chaining Delay Blocks

Minimize control logic81 Memories, 0 MUXs

+

Z-1

Z-1

+

Z-1

Z-1

x

Z-1

x

Z-1

Z-1

Z-1

Input

Chaining: 6% Faster, 19% Smaller, and 400% faster to build!

Outline




• Summary

Performance

• Implemented:– Single-strip

– Double-strip

– Full-Pipeline (V2P50)

V2P20 Area

70%

79%199%

Single-strip

Double-strip

Full Pipeline

Clock Rate

155 MHz

148 MHz142 MHz

Single-strip

Double-strip

Full Pipeline

GFLOPS

0.9

1.27.1

Single-strip

Double-strip

Full Pipeline

Input Bandwidth (Bytes/clock)

7.6

11.368

Single-strip

Double-strip

Full Pipeline

Ongoing Work: Automation

• Natural fit for automation– Built our own tools

– DFG analysis tools

– VHDL generation

• Experiment– No strip mining

– Change # of FP units

– Change # Iterations

– Find # clocks for 128 iterations

12

48

1632

1

2

4

8

16

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

10,000

Cycles for 128 Operations

Operations/blockNumber of Units

Concluding Remarks

• Reusing FP units enables FPGAs to process larger kernels– Apply traditional scheduling tricks to increase utilization

– Algorithm shape affects performance

– Simplicity wins

• Simple DFG tools go a long ways– Easy to adjust parameters, generate hardware

– Focus on kernels instead of complete systems

floating-point reuse in an fpga implementation of a ray-triangle intersection algorithm craig ulmer...

Documents

fp latency

xilinx fp units

strip miningpad fp units

pipelineone fp unit

sandia corporation

floatingpoint reuse

tuv intersection pointmodified

latency pwork