a graph based algorithm for data path optimization in custom processors
DESCRIPTION
A Graph Based Algorithm for Data Path Optimization in Custom Processors. J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems University of California, Irvine. Outline. Introduction Design Methodology Initial Allocation Architecture Wizard - PowerPoint PPT PresentationTRANSCRIPT
A Graph Based Algorithm for Data Path Optimization in Custom Processors
J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski
Center for Embedded Computer Systems
University of California, Irvine
Copyright 2006, CECS 2
Outline
• Introduction• Design Methodology• Initial Allocation• Architecture Wizard
• Critical Path Extraction• Spill Algorithm
• Results• Conclusion
Copyright 2006, CECS 3
Introduction (1of 2)
• Complexity of SoC rising• Short time to market• Need for processors specialized for different application domains
• General purpose processors• Often slow and power hungry
• Full HW design• Expensive and rigid for debugging and feature extension
• Custom processor• Adapt the data path to a given application
Need for automatic generation of application specific architectures
Copyright 2006, CECS 4
Introduction (2 of 2)
• Previous work in High Level Synthesis• Integer linear programming [Landwehr et al.] • Force driven scheduling [Paulin and Knight]• Finding minimal cliques [Tseng and Seiwiorek]• Branch-and-bound [Marwedel]
• Proposed methodology separates the allocation from scheduling and binding
Datapath
Compilation
Schedule Allocation
Source code C
Copyright 2006, CECS 5
Design Methodology
• Define application’s maximum requirements• ALAP schedule
• Initial Allocation chooses from Component DB (CDB)• Select as many units as needed for
ALAP• Architecture Wizard (AW) analyzes
component utilization • Based on the schedule and profiling
data• Optimized Architecture
• Using the design constraints
Architecture Wizard(Phase II)
ComponentDB
Max Configuration(XML)
Source Code(C)
CW Generation
OR
Initial Allocation(Phase I)
Profiler
OptimizedArchitecture
(XML)
Report(HTML)
ALAP
Mem
RF
MUL
CMem
PC
B1
B2
B3
AGALU
CW
bitsC
Wbits
Status
Constraints
Copyright 2006, CECS 6
Initial Allocation and Component Selection
• Define max requirement• Based on the statistics for operators
and data transfer• Finding “the best fit” in CDB for
given requirements• Storage (RF and Memory)
• Min difference in number of ports • Functional units:
• The most general unit executing given operation
• Buses:• Source buses:
– N, if N is even
– (N+1), if N is odd
– Where N = # RF output ports
• Destination buses = #RF in portsFUFU
RF
SourceBuses
DestinationBuses
MUX MUX
MemoryInterface
RFRF
MUX
MUX MUX
ToMemory
FU
ComponentDB
Max Configuration(XML)
Initial Allocation(Phase I)
ALAP
Copyright 2006, CECS 7
Architecture Wizard - Overview
• Goal of Phase II• Reducing number of used
resources• Under performance and utilization
constraints• Inputs:
• Schedule for the Max Configuration
• Execution frequencies (Profiler)• Utilization and performance
constraints (Designer)• Component Data Base (CDB)
• Outputs:• Architecture Net-List• Report
Architecture Wizard(Phase II)
ComponentDB
Max Configuration(XML)
Source Code(C)
CW Generation
OR
Profiler
OptimizedArchitecture
(XML)
Report(HTML)
Constraints
Copyright 2006, CECS 8
Architecture Wizard: Tool Flow
Histogram Creation
CheckConstraints
OR
Critical PathExtraction
Flatten Histogramfor CP
Estimate Overheadand Utillization
Net List Creation
Output Generation
Allocation
• Histograms for• A functional unit type• Group of in/out ports of a storage
unit
• For the basic blocks (BB) in the critical path, for each histogram• Vary number of units• Estimate execution and utilization
• Allocate data path• when constraints satisfied
• Use the same heuristics as for the initial allocation
Copyright 2006, CECS 9
Critical Path Extraction
• Critical Path:• A sequence of BB from start
to end that contributes the most to the execution time
1. Start with the graph of the application
2. Create direct acyclic graph
3. Create dual graph edge ex, create a node Ex node By, create (input X
output) # of edges
4. Transform to the shortest path problem
• Compute weights as 1/wi or Wmax-wi
5. Find the shortest path
B3 l:10 f:54B2 l:8 f:11
B8 l:3 f:50
B6 l:200 f:4 B7 l:2 f:50
B1 l:20 f:65
e1 e2
e5e6
e8 e9
e11
B5 l:4 f:11
e4
1
B3 l:10 f:54B2 l:8 f:11
B8 l:3 f:50
B6 l:200 f:4 B7 l:2 f:50
B1 l:20 f:65
e1 e2
e5e6
e8' e9
e11'
B5 l:4 f:11
e4
2
E5
E4
E11'
E8'
E6
E1
b21_4 b3
Estart
Eend
b10_1 b10_2
E2
b6
b3
b8
b5b28_4
b211_4
3
Copyright 2006, CECS 10
“Spill” - Flattening Algorithm
• Utilization profile for each• FU type and in/out port of storage
unit• Type and number of instances of
other components is unchanged• For chosen number of FUs
• Estimate extra cycles (Δ) by postponing operations into empty slots
• Maximize component utilization• Utilization = ΣUsed FUs / (choden# *
Exec. Time)• Compute global Δ and utilization
• Per block estimation• Execution frequencies
FUi typeNumber ofinstances
Time
12345
12345
FUi typeNumber ofinstances
Time
Chosen # ofunits
FU in use in current cycle
Estimated use of FU
Available FU not in use
Copyright 2006, CECS 11
Results
• Application: bdist2 (MPEG2 encoder), OnesCounter, Sort (bubble sort), dct32 (MP3)
• Δ= 20%, Utilization = 75%
Bench
FUs Buses Tri-State
Δ [%]Avg.
Iter.T [s]
MC R MC R MC R
bdist2 6 4 6 5 40 19 32.1 2.8 0.05
Ones
Counter3 2 6 5 34 17 11.9 1.4 0.05
Sort 4 3 6 5 36 18 0.6 2.9 0.06
dct32 6 4 6 5 40 19 1.4 1.4 0.48
Copyright 2006, CECS 12
Conclusion
• Automatic generation of data path• Separate allocation from scheduling and binding
• Initial Allocation – creates dense architecture• Architecture Wizard – refines architecture for given
constraints• Future work and issues
• Reduce area– Reduce complexity of FU
– Further reduce interconnect
• Features– Pipelining, chaining, forwarding, special function units
Copyright 2006, CECS 13
Thank You!