floating-point reuse in an fpga implementation of a ray-triangle intersection algorithm craig ulmer...
TRANSCRIPT
Floating-Point Reuse in an FPGA Implementation of a
Ray-Triangle Intersection Algorithm
Craig [email protected]
June 27, 2006
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Adrian Javelo UCLACraig Ulmer Sandia National Laboratories/CA
Ray-Triangle Intersection Algorithm
• Möller and Trumbore algorithm (1997)– TUV intersection point
• Modified to remove division– 24 Adds
– 26 Multiplies
– 4 Compares
– 15 Delays
– 17 Inputs
– 2 Outputs (+4 bits)
• Goal: Build for a V2P20– Catch: Do in 32b Floating-Point
– Assume: 5 adds, 6 multiplies, 4 compares
T
+
Outline
• Overview– Reusing floating-point hardware
• Adapting the Algorithm– Operation Scheduling– Mapping Operations to Units– Intermediate Data Values
• Performance Observations– Ongoing Work: Automation
• Summary
Floating-Point and FPGAs
• Floating-Point has been weakness for FPGA
• Recent high-quality FP libraries– SNL: Keith Underwood & K. Scott Hemmert
– USC, ENS Lyon, Nallatech, SRC, Xilinx
• FP units still challenging to work with– Deeply pipelined
– Require sizable resources
Single-Precision Function Stages Max in V2P20
Add 10 14
Multiply 11 18
Multiply (no denormals) 6 22
Divide 31 4
Implementing a Computational Kernel
• Desirable approach: full pipeline– One FP unit per operation
– Issue new iteration every cycle
• Problems– Rapidly run out of chip space
– Input bandwidth
– Low utilization on “one-time” ops
• Need to consider techniques for reusing units
Our Approach: Recycling Architecture
• Build wrapper around an array of FP units– Apply traditional compiler techniques
– Customize hardware data path
Control
Intermediate Buffering
Input Selection
Inputs
Outputs
Outline
• Overview– Reusing floating-point hardware
• Adapting the Algorithm– Operation Scheduling– Mapping Operations to Units– Intermediate Data Values
• Performance Observations– Ongoing Work: Automation
• Summary
P Iterations
Operation Scheduling
• Sequence execution on FP array
• Extract Data Flow Graph (DFG)– Wide and shallow
– Need more parallelism
• Loop unrolling / Strip Mining– Pad FP units out to latency P
– Work on a P iterations at a time
– Sequentially issue strip of P iterations
– Thus: ignore FP latency in scheduling P
# Adds Multiplies <0
1
2
3
4
5
6
7
8
0
1
2
Step-by-Step Scheduling
Single Strip
40% 36%
53% 48%Back-to-Back:
One Strip:
# Adds Multiplies <0
1
2
3
4
5
6
7
8
0
1
2
# Adds Multiplies <0
1
2
3
4
5
6
7
8
9
10
11
0
1
2
Step-by-Step Scheduling
Single Strip Double Strip
40% 36%
53% 48%Back-to-Back:
One Strip:
64% 57%
80% 72%Back-to-Back:
Double Strip:
Outline
• Overview– Reusing floating-point hardware
• Adapting the Algorithm– Operation Scheduling– Mapping Operations to Units– Intermediate Data Values
• Performance Observations– Ongoing Work: Automation
• Summary
Mapping Operations to Units
• Assign operations in schedule to a specific unit– Assignments affect input selection unit’s hardware
• Two strategies: First-Come-First-Serve and a Heuristic
+ + x x
Input
Output
Intermediate Buffering
Input Selection Unit
Mapping Effects
# Adds Multiplies <0
1
2
3
4
5
6
7
8
9
10
11
0
1
2
# Adds Multiplies <0
1
2
3
4
5
6
7
8
9
10
11
0
1
2First-Come-First-Serve Heuristic
MUX3 MUX4 MUX5 MUX6 MUX7 MUX3 MUX4 MUX5 MUX6 MUX70
10
0
10
Multiplexers Required
Outline
• Overview– Reusing floating-point hardware
• Adapting the Algorithm– Operation Scheduling– Mapping Operations to Units– Intermediate Data Values
• Performance Observations– Ongoing Work: Automation
• Summary
Buffering Intermediate Values
• Necessary for holding values between stages– Input vs. Output Buffering
– Block RAM (BRAM) vs Registers
• Focus on output buffering w/ registers– “Delay Pipe” houses a strip of P values
Port 0
Port 1
BRAM P Registers
Register Delay Pipe
Two Strategies
Independently-Writable Delay Blocks
Minimize number of buffers 40 Memories, 40 MUXs
+ + x x
Input
Z-1 Z-1 Z-1Z-1
Chaining Delay Blocks
Minimize control logic81 Memories, 0 MUXs
+
Z-1
Z-1
+
Z-1
Z-1
x
Z-1
x
Z-1
Z-1
Z-1
Input
Chaining: 6% Faster, 19% Smaller, and 400% faster to build!
Outline
• Overview– Reusing floating-point hardware
• Adapting the Algorithm– Operation Scheduling– Mapping Operations to Units– Intermediate Data Values
• Performance Observations– Ongoing Work: Automation
• Summary
Performance
• Implemented:– Single-strip
– Double-strip
– Full-Pipeline (V2P50)
V2P20 Area
70%
79%199%
Single-strip
Double-strip
Full Pipeline
Clock Rate
155 MHz
148 MHz142 MHz
Single-strip
Double-strip
Full Pipeline
GFLOPS
0.9
1.27.1
Single-strip
Double-strip
Full Pipeline
Input Bandwidth (Bytes/clock)
7.6
11.368
Single-strip
Double-strip
Full Pipeline
Ongoing Work: Automation
• Natural fit for automation– Built our own tools
– DFG analysis tools
– VHDL generation
• Experiment– No strip mining
– Change # of FP units
– Change # Iterations
– Find # clocks for 128 iterations
12
48
1632
1
2
4
8
16
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
10,000
Cycles for 128 Operations
Operations/blockNumber of Units
Concluding Remarks
• Reusing FP units enables FPGAs to process larger kernels– Apply traditional scheduling tricks to increase utilization
– Algorithm shape affects performance
– Simplicity wins
• Simple DFG tools go a long ways– Easy to adjust parameters, generate hardware
– Focus on kernels instead of complete systems