a graph based algorithm for data path optimization in custom processors

A Graph Based Algorithm for Data Path Optimization in Custom Processors

J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski

Center for Embedded Computer Systems

University of California, Irvine

Copyright 2006, CECS 2

Outline

• Introduction• Design Methodology• Initial Allocation• Architecture Wizard

• Critical Path Extraction• Spill Algorithm

• Results• Conclusion


Introduction (1of 2)

• Complexity of SoC rising• Short time to market• Need for processors specialized for different application domains

• General purpose processors• Often slow and power hungry

• Full HW design• Expensive and rigid for debugging and feature extension

• Custom processor• Adapt the data path to a given application

Need for automatic generation of application specific architectures


Introduction (2 of 2)

• Previous work in High Level Synthesis• Integer linear programming [Landwehr et al.] • Force driven scheduling [Paulin and Knight]• Finding minimal cliques [Tseng and Seiwiorek]• Branch-and-bound [Marwedel]

• Proposed methodology separates the allocation from scheduling and binding

Datapath

Compilation

Schedule Allocation

Source code C


Design Methodology

• Define application’s maximum requirements• ALAP schedule

• Initial Allocation chooses from Component DB (CDB)• Select as many units as needed for

ALAP• Architecture Wizard (AW) analyzes

component utilization • Based on the schedule and profiling

data• Optimized Architecture

• Using the design constraints

Architecture Wizard(Phase II)

ComponentDB

Max Configuration(XML)

Source Code(C)

CW Generation

OR

Initial Allocation(Phase I)

Profiler

OptimizedArchitecture

(XML)

Report(HTML)

ALAP

Mem

RF

MUL

CMem

PC

B1

B2

B3

AGALU

CW

bitsC

Wbits

Status

Constraints


Initial Allocation and Component Selection

• Define max requirement• Based on the statistics for operators

and data transfer• Finding “the best fit” in CDB for

given requirements• Storage (RF and Memory)

• Min difference in number of ports • Functional units:

• The most general unit executing given operation

• Buses:• Source buses:

– N, if N is even

– (N+1), if N is odd

– Where N = # RF output ports

• Destination buses = #RF in portsFUFU

RF

SourceBuses

DestinationBuses

MUX MUX

MemoryInterface

RFRF

MUX

MUX MUX

ToMemory

FU

ComponentDB


Initial Allocation(Phase I)

ALAP


Architecture Wizard - Overview

• Goal of Phase II• Reducing number of used

resources• Under performance and utilization

constraints• Inputs:

• Schedule for the Max Configuration

• Execution frequencies (Profiler)• Utilization and performance

constraints (Designer)• Component Data Base (CDB)

• Outputs:• Architecture Net-List• Report

Architecture Wizard(Phase II)

ComponentDB


Source Code(C)

CW Generation

OR

Profiler

OptimizedArchitecture

(XML)

Report(HTML)

Constraints


Architecture Wizard: Tool Flow

Histogram Creation

CheckConstraints

OR

Critical PathExtraction

Flatten Histogramfor CP

Estimate Overheadand Utillization

Net List Creation

Output Generation

Allocation

• Histograms for• A functional unit type• Group of in/out ports of a storage

unit

• For the basic blocks (BB) in the critical path, for each histogram• Vary number of units• Estimate execution and utilization

• Allocate data path• when constraints satisfied

• Use the same heuristics as for the initial allocation


Critical Path Extraction

• Critical Path:• A sequence of BB from start

to end that contributes the most to the execution time

1. Start with the graph of the application

2. Create direct acyclic graph

3. Create dual graph edge ex, create a node Ex node By, create (input X

output) # of edges

4. Transform to the shortest path problem

• Compute weights as 1/wi or Wmax-wi

5. Find the shortest path

B3 l:10 f:54B2 l:8 f:11

B8 l:3 f:50

B6 l:200 f:4 B7 l:2 f:50

B1 l:20 f:65

e1 e2

e5e6

e8 e9

e11

B5 l:4 f:11

e4

1

B3 l:10 f:54B2 l:8 f:11

B8 l:3 f:50

B6 l:200 f:4 B7 l:2 f:50

B1 l:20 f:65

e1 e2

e5e6

e8' e9

e11'

B5 l:4 f:11

e4

2

E5

E4

E11'

E8'

E6

E1

b21_4 b3

Estart

Eend

b10_1 b10_2

E2

b6

b3

b8

b5b28_4

b211_4

3


“Spill” - Flattening Algorithm

• Utilization profile for each• FU type and in/out port of storage

unit• Type and number of instances of

other components is unchanged• For chosen number of FUs

• Estimate extra cycles (Δ) by postponing operations into empty slots

• Maximize component utilization• Utilization = ΣUsed FUs / (choden# *

Exec. Time)• Compute global Δ and utilization

• Per block estimation• Execution frequencies

FUi typeNumber ofinstances

Time

12345

12345

FUi typeNumber ofinstances

Time

Chosen # ofunits

FU in use in current cycle

Estimated use of FU

Available FU not in use


Results

• Application: bdist2 (MPEG2 encoder), OnesCounter, Sort (bubble sort), dct32 (MP3)

• Δ= 20%, Utilization = 75%

Bench

FUs Buses Tri-State

Δ [%]Avg.

Iter.T [s]

MC R MC R MC R

bdist2 6 4 6 5 40 19 32.1 2.8 0.05

Ones

Counter3 2 6 5 34 17 11.9 1.4 0.05

Sort 4 3 6 5 36 18 0.6 2.9 0.06

dct32 6 4 6 5 40 19 1.4 1.4 0.48


Conclusion

• Automatic generation of data path• Separate allocation from scheduling and binding

• Initial Allocation – creates dense architecture• Architecture Wizard – refines architecture for given

constraints• Future work and issues

• Reduce area– Reduce complexity of FU

– Further reduce interconnect

• Features– Pipelining, chaining, forwarding, special function units


Thank You!

a graph based algorithm for data path optimization in custom processors

Documents