a parallel 'for' loop memory template for a high level synthesis compiler
DESCRIPTION
We propose a parametrized memory template for applications with parallel 'for' loops. The template's parameters reflect important trade-offs made during system design. The template is incorporated in our high level synthesis (HLS) compiler, where the template's parameters are adjusted to the application. The template fits parallel 'for' loops with no loop dependencies and sequential bodies. We found two alternative template implementations using our compiler. In the future, we will develop templates for other types of 'for' loops. These will be added to the compiler and it will identify the template that works best for the application it is compiling. Once a template is selected, the compiler will use design space exploration to select the best combination of template parameters for the targeted hardware and application.TRANSCRIPT
A parallel for loop memory templatefor a high level synthesis compiler
Euromicro Conference on Digital System Design
Lille, France02/09/2010
Craig MooreWim Meeus, Harald Devos, and Dirk Stroobandt
30/06/2010 Craig Moore, DSD 02/09/2010 2
Outline
● High Level Synthesis● Hardware Development● External Memory● Burst memory transfers● Parallel For Loops● Memory Template Overview● Small Example● Future Work● Conclusions
30/06/2010 Craig Moore, DSD 02/09/2010 3
High Level Synthesis (HLS)Missing Pieces
30/06/2010 Craig Moore, DSD 02/09/2010 4
HLS Missing Pieces
30/06/2010 Craig Moore, DSD 02/09/2010 5
HLS Missing Pieces
30/06/2010 Craig Moore, DSD 02/09/2010 6
Memory Templatesas Tools
● HDL Programmers have:● Toolkit of memory designs● Use the right tool for the job● Manually adapt their designs
● HLS Compilers should:● Have a toolkit of templates● Adapt the template to the app● Evaluate each template● Suggest the best template
30/06/2010 Craig Moore, DSD 02/09/2010 7
1) Read values from memory2) Process each value3) Store output in memory
Basic Steps for any Algorithm
for (int i = start; i < end; i++){ b[i] = func(a[i]);}
30/06/2010 Craig Moore, DSD 02/09/2010 8
Implement on Hardware
30/06/2010 Craig Moore, DSD 02/09/2010 9
External Memoryfor FPGAs
● A bottle neck● Sequential in nature● Number of values
returned each cycle depends on bus width.
● Each memory request requires a handshake
30/06/2010 Craig Moore, DSD 02/09/2010 10
Adapting to the Bottleneck
● Stream values from memory
● Pre-fetch values● Read/Write more than
one value each clock cycle
● Store values locally to mask latency
● Reduce number of requests
30/06/2010 Craig Moore, DSD 02/09/2010 11
Burst Transfers
● Burst of consecutive memory operations
30/06/2010 Craig Moore, DSD 02/09/2010 12
Read Transfer Start Address: 3
Transfer: 4
Burst Transfers
● Burst of consecutive memory operations
0
1
4
2
5
3
6
30/06/2010 Craig Moore, DSD 02/09/2010 13
Read Transfer Start Address: 3
Transfer: 4
Burst Transfers
● Burst of consecutive memory operations
0
1
4
2
5
3
6
30/06/2010 Craig Moore, DSD 02/09/2010 14
Read Transfer Start Address: 3
Transfer: 4
Burst Transfers
● Burst of consecutive memory operations
0
1
4
2
5
3
6
30/06/2010 Craig Moore, DSD 02/09/2010 15
Read Transfer Start Address: 3
Transfer: 4
Burst Transfers
● Burst of consecutive memory operations
0
1
4
2
5
3
6
30/06/2010 Craig Moore, DSD 02/09/2010 16
Read Transfer Start Address: 3
Transfer: 4
Burst Transfers
● Burst of consecutive memory operations
0
1
4
2
5
3
6
30/06/2010 Craig Moore, DSD 02/09/2010 17
Write Transfer Start Address: 2
Transfer: 5
Burst Transfers
● Burst of consecutive memory operations
0
1
4
2
5
3
6
30/06/2010 Craig Moore, DSD 02/09/2010 18
Write Transfer Start Address: 2
Transfer: 5
Burst Transfers
● Burst of consecutive memory operations
0
1
4
2
5
3
6
30/06/2010 Craig Moore, DSD 02/09/2010 19
Write Transfer Start Address: 2
Transfer: 5
Burst Transfers
● Burst of consecutive memory operations
0
1
4
2
5
3
6
30/06/2010 Craig Moore, DSD 02/09/2010 20
Write Transfer Start Address: 2
Transfer: 5
Burst Transfers
● Burst of consecutive memory operations
0
1
4
2
5
3
6
30/06/2010 Craig Moore, DSD 02/09/2010 21
Write Transfer Start Address: 2
Transfer: 5
Burst Transfers
● Burst of consecutive memory operations
0
1
4
2
5
3
6
30/06/2010 Craig Moore, DSD 02/09/2010 22
Write Transfer Start Address: 2
Transfer: 5
Burst Transfers
● Burst of consecutive memory operations
0
1
4
2
5
3
6
30/06/2010 Craig Moore, DSD 02/09/2010 23
Parallel for Loop
● Each iteration is run in parallel● No loop dependencies
● Loop Transformations to remove them
for i = 1 to 4{ a(i) = a(i) + 1 b(i) = a(i – 1) + a(i + 1)}
Example with Dependencies
30/06/2010 Craig Moore, DSD 02/09/2010 24
Template Overview
30/06/2010 Craig Moore, DSD 02/09/2010 25
Template Overview
Requests read bursts and controls execution of data paths, waits foroutput buffer if it is full
30/06/2010 Craig Moore, DSD 02/09/2010 26
Template Overview
Non-pipelined loop bodies executing in parallel.
30/06/2010 Craig Moore, DSD 02/09/2010 27
Manual Design
With enough values, performs write bursts.
30/06/2010 Craig Moore, DSD 02/09/2010 28
Manual Design
Starts and stops execution
30/06/2010 Craig Moore, DSD 02/09/2010 29
Manual Design
Controls access to memory, grants permission based on request (output buffer priority)
30/06/2010 Craig Moore, DSD 02/09/2010 30
Manual Design
Controls access to memory, grants permission based on request (output buffer priority)
Starts and stops execution With enough values, performs write bursts.
Non-pipelined loop bodies executing in parallel.
Requests read bursts and controls execution of data paths, waits foroutput buffer if it is full
30/06/2010 Craig Moore, DSD 02/09/2010 31
Byte-Enable Signal
● Multiple values for each memory transaction● Tells which bytes to replace and preserve
30/06/2010 Craig Moore, DSD 02/09/2010 32
Byte-Enable Signal
● Multiple values for each memory transaction● Tells which bytes to replace and preserve
Ignore
Enable
30/06/2010 Craig Moore, DSD 02/09/2010 33
Byte-Enable Signal
● Multiple values for each memory transaction● Tells which bytes to replace and preserve
Ignore
Enable
30/06/2010 Craig Moore, DSD 02/09/2010 34
Byte-Enable Signal
● Multiple values for each memory transaction● Tells which bytes to replace and preserve
Ignore
Enable
30/06/2010 Craig Moore, DSD 02/09/2010 35
Byte-Enable Signal
● Multiple values for each memory transaction● Tells which bytes to replace and preserve
Ignore
Enable
30/06/2010 Craig Moore, DSD 02/09/2010 36
Parametrized Template
30/06/2010 Craig Moore, DSD 02/09/2010 37
Parametrized Template
● Memory Bus Width = MParameters
30/06/2010 Craig Moore, DSD 02/09/2010 38
● Word Width = W
Parametrized Template
● Memory Bus Width = MParameters
30/06/2010 Craig Moore, DSD 02/09/2010 39
● Word Width = W
Parametrized Template
● Memory Bus Width = MParameters
● Max Words = A = M / W
30/06/2010 Craig Moore, DSD 02/09/2010 40
● Word Width = W
Parametrized Template
● Memory Bus Width = MParameters
● Max Words = A = M / W
● Input FIFOs = X = Cx * A
30/06/2010 Craig Moore, DSD 02/09/2010 41
● Word Width = W
Parametrized Template
● Memory Bus Width = MParameters
● Max Words = A = M / W
● Input FIFOs = X = Cx * A
● Iterations = Output FIFOs = N = C
N * X
30/06/2010 Craig Moore, DSD 02/09/2010 42
● Word Width = W
Parametrized Template
● Memory Bus Width = MParameters
● Max Words = A = M / W
● Input FIFOs = X = Cx * A
● Iterations = Output FIFOs = N = C
N * X
● Burst Length
● Input FIFO Length
● Iteration Length
● Output FIFO Length
30/06/2010 Craig Moore, DSD 02/09/2010 43
● Word Width = W
Parametrized Template
● Memory Bus Width = MParameters
● Max Words = A = M / W
● Input FIFOs = X = Cx * A
● Iterations = Output FIFOs = N = C
N * X
● Burst Length
● Input FIFO Length
● Iteration Length
● Output FIFO Length
30/06/2010 Craig Moore, DSD 02/09/2010 44
Example – Reading Values
Values in Memory
Values to be read
Byte enabled
Byte disabled
Values processed
30/06/2010 Craig Moore, DSD 02/09/2010 45
Example – Processing Values
Values in Memory
Values to be read
Byte enabled
Byte disabled
Values processed
30/06/2010 Craig Moore, DSD 02/09/2010 46
Example – Writing Values
Values in Memory
Values to be read
Byte enabled
Byte disabled
Values processed
30/06/2010 Craig Moore, DSD 02/09/2010 47
Future Work
● More templates for other parallel for loops● Pipelined loop body● Data reuse
● Compiler identifies parallel for loop● No keywords● Check for loop dependencies, and do loop
transformations if required● Compiler suggests best memory template
● Chosen based on performance estimate● Design space exploration using templates
30/06/2010 Craig Moore, DSD 02/09/2010 48
Conclusions
● HLS Tools don't create memory designs● Manual memory designs can take
days/weeks/months to complete● Parametrized memory template designs are
generated in seconds● Easy to perform design space exploration using
different parameter values and/or templates
30/06/2010 Craig Moore, DSD 02/09/2010 49
Thank You!
Questions?
[email protected]://www.elis.ugent.be/~cmoore
Wim Meeus*, Harald Devos‡, and Dirk Stroobandt**{wim.meeus, dirk.stroobandt}@elis.ugent.be, ‡[email protected]