vada lab.sungkyunkwan univ. 1 lower power embedded architecture design 성균관대학교 조 준...

SungKyunKwan Univ.

1VADA Lab.

Lower Power Embedded Architecture Design

성균관대학교 조 준 동 교수 , 1999. 8

http://vada.skku.ac.kr

SungKyunKwan Univ.

2VADA Lab.

Contents

• Embedded Systems

• Design and Optimization of ASIP (Application Specific Instruction Processor)

• Hardware and Software Codesign

• Reconfigurable Processors – Ultra-Low-Power Domain-Specific MULTIMEDIA PROCESSOR

S

– Reconguration for Power Savingin Real-Time Motion Estimation

– Kernel Scheduling in Reconfigurable Computing

SungKyunKwan Univ.

3VADA Lab.

Low Power MPU

SungKyunKwan Univ.

4VADA Lab.

Levels for Low Power DesignSystem

Algorithm

Architecture

Circuit/Logic

Technology

Hardware-software partitioning,

Complexity, Concurrency, Locality,

Parallelism, Pipelining, Signal correlations

Sizing, Logic Style, Logic Design

Threshold Reduction, Scaling, Advanced packaging

Possible Power Savings at Different Design LevelsLevel of

Abstraction Expected Saving

Algorithm

Architecture

Logic Level

Layout Level

Device Level

10 - 100 times

10 - 90%

20 - 40%

10 - 30%

10 - 30%

Regularity, Data representation

Instruction set selection, Data rep.

SOI

Power down

SungKyunKwan Univ.

5VADA Lab.

Present- Day Digital Systems

• Current systems are complex and heterogenous Contain many different types of components– Programmable and Re-configurable processors– Application- specific integrated circuits (ASICs)– Application-specific Instruction processor (ASIP)– Read- Only Memory (ROM) and RAM– I/ O devices and circuitry

• Typically designed from a (large) software specification

• These heterogenous systems are called embedded systems

SungKyunKwan Univ.

6VADA Lab.

Embedded System Characteristics

• Limited user programmability– Completely transparent to user, e. g.

automotive engine control– Limited user interface e. g., intelligent

telephones– Programmable through application

specific language e.g., postscript printer

• Real- time response No batch processing

SungKyunKwan Univ.

7VADA Lab.

Embedded Systems: Products - 1

Computer Relatedpersonal digital

assistantprinterdisc drivemultimedia

subsystemgraphics

subsystemgraphics terminal

Communicationscellular phonevideo phonefaxmodemsPBX

Consumer ElectronicsHDTVCD playervideo gamevideo tape recorderprogrammable TVcameramusic system

SungKyunKwan Univ.

8VADA Lab.

Embedded Systems: Products - 2

• Control Systems– Automotive:engine, ignition, brake system– Manufacturing process control: robotics– Remote control: satellite control, spacecraft control– Other mechanical control: elevator control

• Office Equipment– smart copier, printer, smart typewriter, calculator– point- of- sale equipment, credit- card validator,UPC code rea

der, cash register

• Medical Applications: instruments( EKG, EEG), scanning, imaging

SungKyunKwan Univ.

9VADA Lab.

Problem Domain Shift

SungKyunKwan Univ.

10VADA Lab.

Embedded System Trends - I

• Microcomponents grow in importance in IC industry due to their reusability: DSP, P, C

• More embedded systems will require ASICs– From 20- 70% in 1992 to 60- 70% in 1996

Moral of the story: u-P are joining with high- speed highly-complex ASIC in embedded systems

SungKyunKwan Univ.

11VADA Lab.

Embedded System Trends - II

• Embedded systems will require more application software– Average moves from 16- 64k lines in 1992 to

64k-512k in 1996– Requires migration from assembler to C/ C++,

implying requirement for automatic compilation– From 40- 70% of programmers versus ASIC

designers in 1992 to 60- 90% in 1996

Moral of the story: Increase in code- size / code- complexityis causing a migration to C/ C++ from assembly coding

SungKyunKwan Univ.

12VADA Lab.

Embedded Software Optimization

• Code size becomes an important objective Software will eventually become a part of the chip:– Need to generate the best possible code– Can afford longer compilation time

• Need not only traditional optimization techniques, but also new application- domain-specific optimizations (e. g., DSP and microcontroller architectures)

SungKyunKwan Univ.

13VADA Lab.

Implementing Digital Systems

SungKyunKwan Univ.

14VADA Lab.

What is an ASIP?• Application- Specific Instruction Processor

• Processor architecture tailored not just for application domain (e. g., DSP, microcontrollers), but for specific sets of applications (e. g., audio, engine control)

• ASIP characteristics– Greater design cost (processor + compiler)– + Higher performance, lower power than commercial cores, m

ore flexibility than ASIC

SungKyunKwan Univ.

15VADA Lab.

ASIP Design• Given a set of applications, determine architecture

of ASIP (i. e., configuration of functional units in datapaths, instruction set)

• To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code

• However, the architecture of the processor is a design parameter!

SungKyunKwan Univ.

16VADA Lab.

ASIP Design Flow

SungKyunKwan Univ.

17VADA Lab.

Required Compiler Optimizations

• Machine independent optimizations– Parallelizing transformations (lots of them!) Comm

on subexpression elimination, Strength reduction, Code motion

• Machine dependent optimizations– Loop unrolling and software pipelining– Static allocation (non- recursive procedure calls)– Storage layout (arrays, scalars)– Optimization of mode setting instructionsInstruc

tion selection, scheduling, and register allocation

SungKyunKwan Univ.

18VADA Lab.

Parallelizing Transformation

SungKyunKwan Univ.

19VADA Lab.

Split- Node DAG

SungKyunKwan Univ.

20VADA Lab.

Split- Node DAG - 2• Split- Node DAG represents:

– All the legal assignments of basic block nodes to functional units

– The data transfers implied by pairs of assignments of basic block nodes connected by an edge

• Goal is to find parallelism in the basic block, which can be exploited by the target architecture– Implies grouping operator nodes and data

transfer nodes into VLIW instructions– Constraints may disallow certain groupings

SungKyunKwan Univ.

21VADA Lab.

Split- Node DAG - 3

SungKyunKwan Univ.

22VADA Lab.

Parallelism Matrix

SungKyunKwan Univ.

23VADA Lab.

Common Subexpression Elimination

SungKyunKwan Univ.

24VADA Lab.

Constant Propagation and Folding

SungKyunKwan Univ.

25VADA Lab.

Dead Code Elimination

SungKyunKwan Univ.

26VADA Lab.

Loop Invariant Code Motion

SungKyunKwan Univ.

27VADA Lab.

Array Access Strength Reduction

SungKyunKwan Univ.

28VADA Lab.

Features of DSP Architectures

• DSPs have irregular datapaths• Instruction- set architecture tailored for DSP a

pplications– Limited addressing capability– Autoincrement/ decrement– Bit- reversed addressing– Zero- overhead loops

• Some degree of parallelism, e. g., Motorola 56K’s parallel moves.

SungKyunKwan Univ.

29VADA Lab.

Example: TMS320C25 DSP

SungKyunKwan Univ.

30VADA Lab.

Storage Assignment

Total 28 instructions

SungKyunKwan Univ.

31VADA Lab.

Alternative Assignment

Total 24 instructions

SungKyunKwan Univ.

32VADA Lab.

Simple Offset Assignment

• Assumptions in Simple Offset Assignment (SOA):

– Variables reside in memory and are accessed via a single address register

– One- to- one mapping of variables to memory locations

– A schedule for the basic block is given

SungKyunKwan Univ.

33VADA Lab.

Access Sequence

SungKyunKwan Univ.

34VADA Lab.

Access Graph

SungKyunKwan Univ.

35VADA Lab.

Assignment and Access Graph

SungKyunKwan Univ.

36VADA Lab.

Maximum Weighted Path Covering

SungKyunKwan Univ.

37VADA Lab.

Optimal Disjoint Path Cover

SungKyunKwan Univ.

38VADA Lab.

Code Generation and Optimization

• Focus on automatic retargetability and parameterizable optimization methods

• Instruction selection for configurable functional units• Scheduling for multiple functional units• Register bank allocation to minimize data transfers• Detailed register allocation with varying load/ spill costs• Optimization to exploit address generator features • Goal is to be able to generate high- quality code for any target arc

hitecture description

WHEN THAT HAPPENS, CAD IS GOING TO ESSENTIALLYBECOME SOFTWARE COMPILATION !!

SungKyunKwan Univ.

39VADA Lab.

The 100 Million Transistor Question

HOW BEST CAN WE USE THEM TO SOLVEOUR COMPUTING PROBLEMS ?

(1) Multiprocessor(2) FPGA(3) HW & S/W codesign(4) Reconf. Processor

SungKyunKwan Univ.

40VADA Lab.

Answer I: Multiprocessor on a chip

Requirement: Efficient, parallelizing compiler

Problems: Enough parallelism in programs?Does not go fast enough for video applications, for instance.

SungKyunKwan Univ.

41VADA Lab.

Answer II: Giant FPGA

Requirement: CAD system for FPGAsProblems: May work well for bit- levelvideo computations, but in generalFPGAs are inefficient.

SungKyunKwan Univ.

42VADA Lab.

Answer III: HW/SW Codesign

SungKyunKwan Univ.

43VADA Lab.

Mixing Hardware and Software

• Argument: Mixed hardware/ software systemsrepresent the best of both worlds.High performance, flexibility, design reuse, etc.

• Counterpoint: From a design standpoint, it is the worst of both worlds

– Problems of verification, and test become harder

– Too many tools, too many interactions, too much heterogeneity

– Hardware/ software partitioning is “AI- complete”!

SungKyunKwan Univ.

44VADA Lab.

Hardware/Software Co-design I.K.Hw

ang. San Kim, J.D.Cho

• Co-design 이란…– Hardware 와 software 가 복합된 시스템을 체계적이며 효율적으로

설계하기 위해서 제안 .

– Hardware 구현 : 비용 증가 , 수행시간 빠름 .

– Software 구현 : 비용 감소 , 수행시간 늦음 .

• Co-design 시 고려 사항 .

– Hardware/Software partitioning

– Hardware/Software interface

– Co-simulation

SungKyunKwan Univ.

45VADA Lab.

Partitioning

• Performance Requirements

– 몇몇의 Function 들은 Hardware 로의 구현이 더 용이

– 반복적으로 사용되는 Block

– Parallel 하게 구성되어 있는 Block

• Modifiability

– Software 로 구성된 Block 은 변형이 용이

SungKyunKwan Univ.

46VADA Lab.

Continued

• Implementation Cost

– Hardware 로 구성된 Block 은 공유해서 사용이 가능

• Scheduling

– 각각 HW 와 SW 로 분리된 Block 들을 정해진 constraints 들에 맞출 수 있도록 scheduling

– SW Operation 은 순차적으로 scheduling 되어야 한다– Data 와 Control 의 의존성만 없다면 SW 와 HW 는 Concurrent

하게 scheduling

SungKyunKwan Univ.

47VADA Lab.

Interface• Interface Block 의 필요성

– Hardware 와 Software Block 간의 Data 전달– 효율적인 Interface Block 을 구성해야만 HW/SW

Block 간의 Overhead 를 줄일 수 있다

• Interface 방법– Shared Memory

– FIFO

– Handshaking protocol

SungKyunKwan Univ.

48VADA Lab.

Logical Bus ArchitectureSystem Bus Signals

address, data, control signalsaddress space consists of the memory space & I/O spacememory space : memory of the SW componentI/O space : ports within SW & registers in other HW

Port SignalsThese are specialized signals capable of directly interfacing between SW & HW component

Interrupt SignalsWhen SW & HW components have completed an operation, or when an error condition is detected

SungKyunKwan Univ.

49VADA Lab.

Co-simulation• Co-simulation 의 필요성

– HW part 와 SW part 를 함께 Simulation 을 할 수 있게 해 줌으로써 구성된 System 의 결과를 예측할 수 있다

– System Performance 를 예측하여 Synthesis 이전에 지정된 Spec.에 맞도록 System 을 재설계할 수 있도록 해 준다

– HW/SW Partitioning 을 위한 각 Sub-block 의 특성을 예측해 준다

• Co-simulation Tool– Ptolemy– COSSAP– POLIS

SungKyunKwan Univ.

50VADA Lab.

• SW, HW Co-simulation Tool

– SW : C-code(Generic C)

– HW : VHDL

• Data Flow 형태의 Simulation

– block diagram 형태로 System 구현

• Simulation Report

– 출력을 Waveform 형태로 표시– System 의 Speed 를 예측

Cossap

SungKyunKwan Univ.

51VADA Lab.

Analysis of Constra ints& Requirem ents

System Specification

Hardware & SoftwarePartitioning

HardwareDescription

SoftwareDescription

Interface SynthesisHardware Synthesis

& ConfigurationSoftware G eneration &

Param eterization

ConfigurationM odules

HardwareCom ponents

HW / SWInterface

SoftwareM odules

HW / SW Integration &Cosim ulation

IntegrationSystem

System Evaluation Design Verification

그림 13.Hardware/SoftwareCo-design 의 전체

흐름도

SungKyunKwan Univ.

52VADA Lab.

Partitioning Example: CDMA Searcher

P N -C odeG enera to r

µ ¿ ± â´© À û´Ü(R ea l)

µ ¿ ± â´© À û´Ü(Im age)

¿ ¡³Ê Á ö°è»ê´Ü(R ea l)

¿ ¡³Ê Á ö°è»ê´Ü(Im age)

ºñ± ³, ¼ ± Å Ã ´Ü ºñµ ¿ ± â´© À û´Ü ºñ± ³, ¼ ± Å Ã ´Ü

< S o ftw are -based T yp ica l D es ign >

• Software-based typical diagram

SungKyunKwan Univ.

53VADA Lab.

Continued

P N -C odeG enera tion

S ynchronousA ccum ula tor

(S W )

S ynchronousA ccum ula tor1

(H W )

C ost(S peed,A rea,P ow er)

E nergyE stim ate

(S W )

S ynchronousA ccum ula tor2

(H W )

C om parator(S W )

A synchronousA ccum ula tor

(S W )

C om parator(S W )

E nergyE stim ate

(H W )

C om paratorw ith

precom puta tion(H W )

A synchronousA ccum ula tor

(H W )

C om paratorw ith

precom puta tion(H W )

G O A L!

HW/SW Co-designed diagram

SungKyunKwan Univ.

54VADA Lab.

Answer IV:Re-configurable Processor

• Configurable datapaths (e. g., splittable ALUs,complex operations)

• Configurable interconnect (e. g., nearest neighbor,k buses)

• SIMD processor, many functional units,preferably VLIW, possibly superscalar

SungKyunKwan Univ.

55VADA Lab.

ULTRA-LOW-POWER DOMAIN-SPECIFIC MULTIMEDIA PROCESSORS

• Arthur Abnous and Jan Rabaey

• Programmability requires generalized computation, storage, and communication system, which can be used to implement different kinds of algorithms

• Domain specific processors preserve the flexibility of a general purpose programmable device to achieve higher levels of energy-efficiency, while maintaining the flexibility to handle a variety of algorithms

SungKyunKwan Univ.

56VADA Lab.

Flexibility vs. Energy-Efficiency

• Trade-off between efficiency and flexibility, programmable designs incur significant performance and power penalties compared to ASIC.

• The parallel algorithm of signal processing can be achieved significant power savings by executing the dominant computational kernels of a given class of applications with common features on dedicated, optimized processing elements with minimum energy overhead.

SungKyunKwan Univ.

57VADA Lab.

Application Domains CELP- Based Speech Coding• LPC Analysis and Synthesis• Codebook Search• Lag ComputationDCT- Based Video Compression and Decompression• DCT and Inverse- DCT• Motion Estimation and Compensation• Huffman Coding and Decoding Baseband Processing for Digital Radios• Demodulation, Channel Equalization• Timing Recovery, Error Correction

SungKyunKwan Univ.

58VADA Lab.

The Re-configurable Terminal

SungKyunKwan Univ.

59VADA Lab.

Low- Power Multimedia Processing

• Hybrid, Re-configurable Architecture– application- specific, parallelism, pipelining,– locality, minimum control- overhead, zero- power when idle

• Task Scheduling, and Miscellaneous Functions on Embedded Core Processor (low speed, minimum functionality)

• Standardized Communication Protocols reduce Design Cycle and enable High Level Support

• Use extensive set of low- power circuit techniques– Reduced swing, variable voltages and frequency, self- timing,

locally generated clocks

SungKyunKwan Univ.

60VADA Lab.

Arithmetic Energy Profile :VSELP Speech Coder

Lag Computation+Basic Vector Filtering+Codebook Search=76% of total time

SungKyunKwan Univ.

61VADA Lab.

Hybrid Architecture Template

SungKyunKwan Univ.

62VADA Lab.

The dominant, energy-intensive computationalkernels of a given domain of algorithms are implemented as a set of independent,concurrent threads of computation on the satellite processors.

The Popoased Architectue,Arthur Abnous and Jan Rabaey, UC-Berkeley

Energy- Efficiency + Domain- Specific Programmability

SungKyunKwan Univ.

63VADA Lab.

Control Processor

• The main task of control processor is to configure the satellite processors and the communication networks and to manage the overall control flow of a given signal processing algorithm

• Uses the available satellite processor and the re-configurable interconnect to compose the data flow graph corresponding to a given kernel of computation in hardware

SungKyunKwan Univ.

64VADA Lab.

Overlay operation

• Control processor configures network and co- processors

• Co- processors operate in distributed “data- driven” mode

• At completion, control returns to the core processor for next reconfiguration

SungKyunKwan Univ.

65VADA Lab.

Satellite Processors

SungKyunKwan Univ.

66VADA Lab.

Elements of Energy- Efficiency

SungKyunKwan Univ.

67VADA Lab.

VSELP Synthesis Filter Mapped onto Satellite Processors

Execution of a hardware module is triggered by the arrival of tokens. When there are no tokens to be processed at a given module, no switching activity occurs in that module.

SungKyunKwan Univ.

68VADA Lab.

Example Mappings of VSELP Kernel

SungKyunKwan Univ.

69VADA Lab.

Communication Network

SungKyunKwan Univ.

70VADA Lab.

Communication Network• The communication network is reconfigured by the control

processor to implement the arcs of the data flow graph corresponding to the dominant kernel being implemented as a hardware processor

• Each arc in the data flow graph is assigned a dedicated link in the communication network

• Minimize energy consumption in communication network

1) Reduce voltage swing

2) Clustered satellite processor with local network

3) segmentation of buses into smaller, less capacitive sections by break switches

SungKyunKwan Univ.

71VADA Lab.

Distributed Data- Driven Control

SungKyunKwan Univ.

72VADA Lab.

Power- Variable Performance

SungKyunKwan Univ.

73VADA Lab.

Programmable Logic Modules

SungKyunKwan Univ.

74VADA Lab.

Low Power Circuit Techniques

• Reduced swing interconnect (communication network, memories, programmable logic modules)

• On chip dc- dc conversion + multiple supply voltages

• Locally synchronous - globally asynchronous• Automatic power- down• Optimized libraries (0.6 m CMOS + Cadenc

e/ Synopsys design flow)

SungKyunKwan Univ.

75VADA Lab.

Design Methodology

SungKyunKwan Univ.

76VADA Lab.

Case Studies

• Voice coder for cellular

• Video decoder

• Baseband radio modem

• Security - encryption processor

SungKyunKwan Univ.

77VADA Lab.

Result

• The most energy efficient CELP-based speech algorithm

- dissipates 36 mW ( Vdd = 1.8V, 0.5 um CMOS)

- requires 23.4 MOPS

• Proposed VSELP speech coder

- 0.6 um CMOS

- dissipates under 5 mW

SungKyunKwan Univ.

78VADA Lab.

Implementation of Handshaking

SungKyunKwan Univ.

79VADA Lab.

Single-Wire, Two-Phase Asynchronous Handshaking

Protocol

SungKyunKwan Univ.

80VADA Lab.

Locally Synchronous, Globally Asynchronous Timing Scheme

SungKyunKwan Univ.

81VADA Lab.

Architecture for vector dot product

ConfigurationBus

StrobeAddress

Data

8

16

M em ory M em ory

Network (6 Buses)

AddG en AddG en

M AC

IPor

t

IPor

t

OPo

rt

Network ResetSatellite Reset

S low M ode

IP1 IP2 O P18 18

18AutoAck

M ode

• 0.6 ㎛ CMOS process

• Supply Voltage : 1.5

• Power estimation tool

– PowerMill

• 1 MAC, 2 SRAM, 2 Address

generator, 2 external input p

ort, 1 external output port

• All data and address values a

re 16-bits.

SungKyunKwan Univ.

82VADA Lab.

IIR Mapping

SungKyunKwan Univ.

83VADA Lab.

IIR Comparison

SungKyunKwan Univ.

84VADA Lab.

FFT Mapping

SungKyunKwan Univ.

85VADA Lab.

FFT Comparison

SungKyunKwan Univ.

86VADA Lab.

ResultStrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades

Frequency(MHz)

# of Multipliers

Throughput(cycle/tap)

Energy/tap(J)

Processor

169

0.5

17

37.4n

20

1

40 6 14

1

1

1.3n

1

600p

5 1

2.2n 205p

0.2 1

StrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades

Frequency(MHz)

# of Multipliers

Throughput(cycle/IIR)

Energy/IIR(J)

Processor

169

0.5

114

277n

20

1

40 2.1 14

1

20

19.1n

13

9.5n

9 2

103n 1.9n

1 8

StrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades

Frequency(MHz)

# of Multipliers

Throughput(cycle/stage)

Energy/stage(J)

Processor

169

0.5

766

1870n

20

1

40 - 14

1

152

131n

76

49.3n

- 4

- 13.3n

- 8

FIRResults

IIRResults

FFTResults

StroangARM: micro-processor[2]

TMS320C2xx: DSP chip

[3,4,5,6]

TMS320LC54x: DSP chip

[7,8,12]

XC4003A: FPGA chip[9,10]

SungKyunKwan Univ.

87VADA Lab.

Conclusions• The StrongARM has the worst performance of all because it takes many instru

ctions and cycles to execute a kernel in a highly sequential manners.– The lack of a single-cycle multiplier exacerbates this problem.

– The other architecture have more internal parallelism that allow them to have superior performance.

• Pleiades (re-configurable architecture) does much better on the energy scale than the TI DSPs.

– Because DSPs are general-purpose, and instruction execution involves a great deal of overhead.

– Pleiades has the ability to create dedicated hardware structures tuned to the task at hand and executes operations with a small energy overhead.

• Pleiades outperforms the other processors by a large margin owing to its ability to exploit higher levels of parallelism by creating a dedicated parallel structure from its computational resources and flexible interconnect.

SungKyunKwan Univ.

88VADA Lab.

Reconguration for Power Savingin Real-Time Motion Estimation,S.R.Park,UMASS

SungKyunKwan Univ.

89VADA Lab.

Motion Estimation

SungKyunKwan Univ.

90VADA Lab.

Block Matching Algorithm

SungKyunKwan Univ.

91VADA Lab.

Configurable H/W Paradigms

SungKyunKwan Univ.

92VADA Lab.

Why Hardware for Motion Estimation?

• Most Computationally demanding part of Video Encoding

• Example: CCIR 601 format• 720 by 576 pixel• 16 by 16 macro block (n = 16)• 32 by 32 search area (p = 8)• 25 Hz Frame rate (f frame = 25)• 9 Giga Operations/Sec is needed for Full Search Bloc

k Matching Algorithm.

SungKyunKwan Univ.

93VADA Lab.

Why Reconguration in Motion Estimation?

• Adjusting the search area at frame-rate according to the changing characteristics of video sequences

• Reducing Power Consumption by avoiding unnecessary computation

Motion Vector Distributions

SungKyunKwan Univ.

94VADA Lab.

Architecture for Motion EstimationFrom P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995

SungKyunKwan Univ.

95VADA Lab.

Re-configurable Architecture for ME

SungKyunKwan Univ.

96VADA Lab.

Power Estimation in Recongurable Architecture

SungKyunKwan Univ.

97VADA Lab.

Power vs Search area

SungKyunKwan Univ.

98VADA Lab.

FPGA Implementation

• # of CLBs (2 bits RBMAD)

R Block : 4 CLBs

AD Block : 23 CLBs

• Reconguration time (2 bits RBMAD)

5 usec for increasing p by one

• Power computation,I/O,configuration

SungKyunKwan Univ.

99VADA Lab.

Resource Reuse in FPGAs

SungKyunKwan Univ.

100VADA Lab.

Conclusion• By adjusting the search area according to the

changing characteristics of a picture, power can be saved. Further power saving can be achieved by utilizing freed up resources for local memory

• Extension of Adaptive Search Space Method to Software Implementation– Varying p still reduces computation and hence power– Resource reuse may also be applicable in S/W

implementation by freeing up cache space and compute power for more power efficient use of memory

SungKyunKwan Univ.

101VADA Lab.

Future Works• Reconguration to support more sophisticated motion

estimation algorithms ( intelligent search, object-based, ...)

• More detailed performance studies over a wider range of video sequences

• Generalization of this concept to other algorithms and architectures (not just video)

• Modification to FPGA architectures to support the use of logic and configuration cells as local memory

SungKyunKwan Univ.

102VADA Lab.

Kernel Scheduling in Reconfigurable Computing

• R. Maestre, F. J. Kurdahi, N. Bagherzadeh, H. Singh, R. Hermida, M. Fernandez, Design and Test in Europe, DATE99, Munich, Germany, Mar 99

The PartitionPartition is to find some subsets of kernel that may be scheduled

(executed) independently of other kernels.

Partitioning of the application DFG

The SchedulingScheduling is performed within a given partition in detail after

partitioning .

Scheduling within a given partition

SungKyunKwan Univ.

103VADA Lab.

The Major Criteria

M E M C DCT Q IQ IDCT IM C

6 blocks blocks blocks blocks blocksblocks blocks

Fram e

8 4 21 6 6 421# of contexts :

M PEGsequence

G ranularityof com putation

¨Í


396(Fram e)

6 6 6 6 66¨Î


6 ¡¿396

6 66

¨Ï 396 396

a) M PEG sequence and granularity, b) a possib le schedule of an im age fram e, c) an a lternative schedule

• Context reloading

– Minimize

• Data reuse

– Maximize

• Computation and

data movement

overlapping

– Maximize

SungKyunKwan Univ.

104VADA Lab.

Load Minimization

C1ExcutionKernel 1

C2ExcutionKernel 2

CnExcutionKernel n

C ase (a)

C1ExcutionKernel 1

C2ExcutionKernel 2

CnExcutionKernel n

C ase (b)

Nt N t N t

< Two extrem e cases of excution sequence for a generic application >

Nt

• Case (a)– Each kernel is executed only once before the execution of the next kernel.– The context for each kernel has to be loaded as many times as the total

number of iterations(time wasted).• Case (b)

– Each kernel is executed as many times as the total number of iterations before executing the next kernel.

– The context for each kernel is loading only once(time saved).

SungKyunKwan Univ.

105VADA Lab.

Scheduling

C M

F B se t 1

F B se t 2

R 1i-1,R 2

i-1

K 1i K 2

i

C 3i

kc 2kc 1= 0

R 3i-1,D 1

i+1,D 2i+1,D 3

i+1

C 1i+1,C 2

i+1

kc 3

K 3i

T im e

¥ái = even t in ¥á ite ra tion i.

k i = C om pu ta tion tim e .

kc i = P ossib le ove rlap o f com pu ta tion and con text load ing

C i = C on text load ing tim e .

D i = D a ta load ing tim e .

R i = R esu lt read ing tim e .

Ide l tim e

P artition = { k 1, k2, k3 }. A poss ib le schedu le :

< Execution m odel representation >

SungKyunKwan Univ.

106VADA Lab.

Algorithm

K i

K j

K m

K p

1 2

3 4

B C = TR U E

a. LE E = ¥õ

K i

K j

K m

K p

2

3 4

B C = TR U E

b. LE E = { 1 }

K i

K j

K m

K p

3 4

B C = TR U E

c. LE E = { 1 , 2 }

K i

K j

K m

K p

2

4

B C =TR U E

d. LE E = { 1 , 3 }

K i

K j

K m

K p

2

B C =TR U E

b. LE E = { 1 , 3 , 4 }

K i

K j

K m

K p

B C =TR U E

c. LE E = { 1 , 4 }

2

3

B C =TR U E

< Som e steps of an exploration sequence >

SungKyunKwan Univ.

107VADA Lab.

Experimental Results

ExplorationA lgorithmIteration

1

2

15

30

Not Explored

{ M E, M C, ...... , IM C }

{ M E } {M C, ......, IM C }

{ M E } {M C, DCT, Q } {IQ , IDCT, IM C }

{ M E, ...... , Q } { IQ , ...... , IM C }

{ M E } { M C } { DCT } { Q } { IQ } { IDCT } { IM C }

Cover LEE

¥õ

{ 1 }

{ 1, 2, 5 }

{ 2, 5 }

{ 1, 2, 3, 4, 5, 6 , 7 }

¢²LB ( P I )

(clock cycles)

5110 cc

4941 cc

5080 cc

5036 cc

5806 cc

ExplorationTim e (afterscheduling)

NNS

4941

NNS

NNS

LB ( SS ) = 4894 cc. ; NNS = Nor Necessary to Schedule.

< Experim enta l D ata for M PEG >

M E M C D C T Q IQ ID C T IM C1 3 4 5 6 7

2

< O rdering of edges for M PEG >

SungKyunKwan Univ.

108VADA Lab.

References[1] A. Abnous and J. Rabeay, “Ultra-Low-Power Domain-Specific Multimedia Processors”, Proceedings of

the IEEE VLSI Signal Processing Workshop, San Francisco, Oct 1996.

[2] Digital Semiconductor, Digital Semiconductor SA-110 Microprocessor Technical Reference Manual, Digital Equipment Corporation, 1996.

[3] TMS320C5x General-Purpose Application User’s Guides, Literatures Number SPRU164, TI, 1997.

[4] T. Anderson, The TMS320C2xx Sum-of-Products Methodology, Technical Application Report SPRA068, TI, 1996.

[5] M. Tsai, IIR Filter Design on the TMS320C54x DSP, Technical Application Report SPRA079, TI, 1996.

[6] Ftp://ftp.ti.com/pub/tms320bbs/c5xxfiles/54xffts.exe, ‘C54x Software Support Files, TI.

[7] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA164, TI, 1997.

[8] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA088, TI, 1996.

[9] E. Kusse, Personal communication, 1996.

SungKyunKwan Univ.

109VADA Lab.

References

[10] J. Rabeay et al., “Fast Prototyping of Data Path Intensive Architecture”, IEEE Design & Test Magazine, Vol. 8, N0. 2, pp. 40-51, 1991.

[11] J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor”, IEEE Journal of Solid-State Circuit, Vol. 31, N0. 11, pp. 1703-1714, Nov. 1996.

[12] A. Fischman and P. Rowland, Designing Low-Power Applications with TMS320LC54x, Technical Application Report SPRA281, TI, 1997.

[13] Daniel D. Gajski, Nikil D. Dutt, Allen C-H Wu, Steve Y-L Lin, \High-level synthesis, Introduction to chip and system design," Kluwer Academic publishers, 1992.

[14] Duncan A. Buell, Jerey M.Arnold, Walter J.Kleinfelde \Splash2, FPGAs in Custom Computing Machine," IEEE Computer Society Press, Los Alamitos, California.

[15] Jonathan Babb, Russell Tessier, Mathew Dahl, Silvina Zimi Hanono, David M. Hoki, and Anant Agarwal, Logic emulation with virtual wires," IEEE Transactions on Computer Aided Design of Integrated circuits and systems, vol. 16, No. 6, June 1997.

[16] M.Vasilko, Djamel Ait-Boudaoud, \Architectural synthesis techniques for dynamically Recongurable logic," Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996.

SungKyunKwan Univ.

110VADA Lab.

References[17] Patrick Lysaght, Gordon McGregor and Jonathan Stockwood, Conguration Controller S

ynthesis for Dynamically Recongurable Systems," IEE Colloquium on Hardware Software COSynthesis for Recongurable systems, 1996.

[18] M.Vasilko, Djamel Ait-Boudaoud, Scheduling for dynamically Recongurable FPGAs," Proceedings of International workshop on Logic and Architecture synthesis, pp. 328-336, IFIPTC10 WG10.5, Dec. 18-19 1995.

[19] Doug Smith, Dinesh Bhatia, RACE: Recongurable and Adaptive Computing Environment,” Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996. See http://www.ececs.uc.edu/ ~ dal.

[20] Xilinx Netlist Format (XNF) Specication, Version 6.1, June 1, 1995.

[21] Xilinx XABEL reference manual.

vada lab.sungkyunkwan univ. 1 lower power embedded architecture design 성균관대학교 조 준...

Documents