vada lab.sungkyunkwan univ. 1 lower power embedded architecture design 성균관대학교 조 준...
TRANSCRIPT
SungKyunKwan Univ.
1VADA Lab.
Lower Power Embedded Architecture Design
성균관대학교 조 준 동 교수 , 1999. 8
http://vada.skku.ac.kr
SungKyunKwan Univ.
2VADA Lab.
Contents
• Embedded Systems
• Design and Optimization of ASIP (Application Specific Instruction Processor)
• Hardware and Software Codesign
• Reconfigurable Processors – Ultra-Low-Power Domain-Specific MULTIMEDIA PROCESSOR
S
– Reconguration for Power Savingin Real-Time Motion Estimation
– Kernel Scheduling in Reconfigurable Computing
SungKyunKwan Univ.
3VADA Lab.
Low Power MPU
SungKyunKwan Univ.
4VADA Lab.
Levels for Low Power DesignSystem
Algorithm
Architecture
Circuit/Logic
Technology
Hardware-software partitioning,
Complexity, Concurrency, Locality,
Parallelism, Pipelining, Signal correlations
Sizing, Logic Style, Logic Design
Threshold Reduction, Scaling, Advanced packaging
Possible Power Savings at Different Design LevelsLevel of
Abstraction Expected Saving
Algorithm
Architecture
Logic Level
Layout Level
Device Level
10 - 100 times
10 - 90%
20 - 40%
10 - 30%
10 - 30%
Regularity, Data representation
Instruction set selection, Data rep.
SOI
Power down
SungKyunKwan Univ.
5VADA Lab.
Present- Day Digital Systems
• Current systems are complex and heterogenous Contain many different types of components– Programmable and Re-configurable processors– Application- specific integrated circuits (ASICs)– Application-specific Instruction processor (ASIP)– Read- Only Memory (ROM) and RAM– I/ O devices and circuitry
• Typically designed from a (large) software specification
• These heterogenous systems are called embedded systems
SungKyunKwan Univ.
6VADA Lab.
Embedded System Characteristics
• Limited user programmability– Completely transparent to user, e. g.
automotive engine control– Limited user interface e. g., intelligent
telephones– Programmable through application
specific language e.g., postscript printer
• Real- time response No batch processing
SungKyunKwan Univ.
7VADA Lab.
Embedded Systems: Products - 1
Computer Relatedpersonal digital
assistantprinterdisc drivemultimedia
subsystemgraphics
subsystemgraphics terminal
Communicationscellular phonevideo phonefaxmodemsPBX
Consumer ElectronicsHDTVCD playervideo gamevideo tape recorderprogrammable TVcameramusic system
SungKyunKwan Univ.
8VADA Lab.
Embedded Systems: Products - 2
• Control Systems– Automotive:engine, ignition, brake system– Manufacturing process control: robotics– Remote control: satellite control, spacecraft control– Other mechanical control: elevator control
• Office Equipment– smart copier, printer, smart typewriter, calculator– point- of- sale equipment, credit- card validator,UPC code rea
der, cash register
• Medical Applications: instruments( EKG, EEG), scanning, imaging
SungKyunKwan Univ.
9VADA Lab.
Problem Domain Shift
SungKyunKwan Univ.
10VADA Lab.
Embedded System Trends - I
• Microcomponents grow in importance in IC industry due to their reusability: DSP, P, C
• More embedded systems will require ASICs– From 20- 70% in 1992 to 60- 70% in 1996
Moral of the story: u-P are joining with high- speed highly-complex ASIC in embedded systems
SungKyunKwan Univ.
11VADA Lab.
Embedded System Trends - II
• Embedded systems will require more application software– Average moves from 16- 64k lines in 1992 to
64k-512k in 1996– Requires migration from assembler to C/ C++,
implying requirement for automatic compilation– From 40- 70% of programmers versus ASIC
designers in 1992 to 60- 90% in 1996
Moral of the story: Increase in code- size / code- complexityis causing a migration to C/ C++ from assembly coding
SungKyunKwan Univ.
12VADA Lab.
Embedded Software Optimization
• Code size becomes an important objective Software will eventually become a part of the chip:– Need to generate the best possible code– Can afford longer compilation time
• Need not only traditional optimization techniques, but also new application- domain-specific optimizations (e. g., DSP and microcontroller architectures)
SungKyunKwan Univ.
13VADA Lab.
Implementing Digital Systems
SungKyunKwan Univ.
14VADA Lab.
What is an ASIP?• Application- Specific Instruction Processor
• Processor architecture tailored not just for application domain (e. g., DSP, microcontrollers), but for specific sets of applications (e. g., audio, engine control)
• ASIP characteristics– Greater design cost (processor + compiler)– + Higher performance, lower power than commercial cores, m
ore flexibility than ASIC
SungKyunKwan Univ.
15VADA Lab.
ASIP Design• Given a set of applications, determine architecture
of ASIP (i. e., configuration of functional units in datapaths, instruction set)
• To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code
• However, the architecture of the processor is a design parameter!
SungKyunKwan Univ.
16VADA Lab.
ASIP Design Flow
SungKyunKwan Univ.
17VADA Lab.
Required Compiler Optimizations
• Machine independent optimizations– Parallelizing transformations (lots of them!) Comm
on subexpression elimination, Strength reduction, Code motion
• Machine dependent optimizations– Loop unrolling and software pipelining– Static allocation (non- recursive procedure calls)– Storage layout (arrays, scalars)– Optimization of mode setting instructionsInstruc
tion selection, scheduling, and register allocation
SungKyunKwan Univ.
18VADA Lab.
Parallelizing Transformation
SungKyunKwan Univ.
19VADA Lab.
Split- Node DAG
SungKyunKwan Univ.
20VADA Lab.
Split- Node DAG - 2• Split- Node DAG represents:
– All the legal assignments of basic block nodes to functional units
– The data transfers implied by pairs of assignments of basic block nodes connected by an edge
• Goal is to find parallelism in the basic block, which can be exploited by the target architecture– Implies grouping operator nodes and data
transfer nodes into VLIW instructions– Constraints may disallow certain groupings
SungKyunKwan Univ.
21VADA Lab.
Split- Node DAG - 3
SungKyunKwan Univ.
22VADA Lab.
Parallelism Matrix
SungKyunKwan Univ.
23VADA Lab.
Common Subexpression Elimination
SungKyunKwan Univ.
24VADA Lab.
Constant Propagation and Folding
SungKyunKwan Univ.
25VADA Lab.
Dead Code Elimination
SungKyunKwan Univ.
26VADA Lab.
Loop Invariant Code Motion
SungKyunKwan Univ.
27VADA Lab.
Array Access Strength Reduction
SungKyunKwan Univ.
28VADA Lab.
Features of DSP Architectures
• DSPs have irregular data- paths• Instruction- set architecture tailored for DSP a
pplications– Limited addressing capability– Autoincrement/ decrement– Bit- reversed addressing– Zero- overhead loops
• Some degree of parallelism, e. g., Motorola 56K’s parallel moves.
SungKyunKwan Univ.
29VADA Lab.
Example: TMS320C25 DSP
SungKyunKwan Univ.
30VADA Lab.
Storage Assignment
Total 28 instructions
SungKyunKwan Univ.
31VADA Lab.
Alternative Assignment
Total 24 instructions
SungKyunKwan Univ.
32VADA Lab.
Simple Offset Assignment
• Assumptions in Simple Offset Assignment (SOA):
– Variables reside in memory and are accessed via a single address register
– One- to- one mapping of variables to memory locations
– A schedule for the basic block is given
SungKyunKwan Univ.
33VADA Lab.
Access Sequence
SungKyunKwan Univ.
34VADA Lab.
Access Graph
SungKyunKwan Univ.
35VADA Lab.
Assignment and Access Graph
SungKyunKwan Univ.
36VADA Lab.
Maximum Weighted Path Covering
SungKyunKwan Univ.
37VADA Lab.
Optimal Disjoint Path Cover
SungKyunKwan Univ.
38VADA Lab.
Code Generation and Optimization
• Focus on automatic retargetability and parameterizable optimization methods
• Instruction selection for configurable functional units• Scheduling for multiple functional units• Register bank allocation to minimize data transfers• Detailed register allocation with varying load/ spill costs• Optimization to exploit address generator features • Goal is to be able to generate high- quality code for any target arc
hitecture description
WHEN THAT HAPPENS, CAD IS GOING TO ESSENTIALLYBECOME SOFTWARE COMPILATION !!
SungKyunKwan Univ.
39VADA Lab.
The 100 Million Transistor Question
HOW BEST CAN WE USE THEM TO SOLVEOUR COMPUTING PROBLEMS ?
(1) Multiprocessor(2) FPGA(3) HW & S/W codesign(4) Reconf. Processor
SungKyunKwan Univ.
40VADA Lab.
Answer I: Multiprocessor on a chip
Requirement: Efficient, parallelizing compiler
Problems: Enough parallelism in programs?Does not go fast enough for video applications, for instance.
SungKyunKwan Univ.
41VADA Lab.
Answer II: Giant FPGA
Requirement: CAD system for FPGAsProblems: May work well for bit- levelvideo computations, but in generalFPGAs are inefficient.
SungKyunKwan Univ.
42VADA Lab.
Answer III: HW/SW Codesign
SungKyunKwan Univ.
43VADA Lab.
Mixing Hardware and Software
• Argument: Mixed hardware/ software systemsrepresent the best of both worlds.High performance, flexibility, design reuse, etc.
• Counterpoint: From a design standpoint, it is the worst of both worlds
– Problems of verification, and test become harder
– Too many tools, too many interactions, too much heterogeneity
– Hardware/ software partitioning is “AI- complete”!
SungKyunKwan Univ.
44VADA Lab.
Hardware/Software Co-design I.K.Hw
ang. San Kim, J.D.Cho
• Co-design 이란…– Hardware 와 software 가 복합된 시스템을 체계적이며 효율적으로
설계하기 위해서 제안 .
– Hardware 구현 : 비용 증가 , 수행시간 빠름 .
– Software 구현 : 비용 감소 , 수행시간 늦음 .
• Co-design 시 고려 사항 .
– Hardware/Software partitioning
– Hardware/Software interface
– Co-simulation
SungKyunKwan Univ.
45VADA Lab.
Partitioning
• Performance Requirements
– 몇몇의 Function 들은 Hardware 로의 구현이 더 용이
– 반복적으로 사용되는 Block
– Parallel 하게 구성되어 있는 Block
• Modifiability
– Software 로 구성된 Block 은 변형이 용이
SungKyunKwan Univ.
46VADA Lab.
Continued
• Implementation Cost
– Hardware 로 구성된 Block 은 공유해서 사용이 가능
• Scheduling
– 각각 HW 와 SW 로 분리된 Block 들을 정해진 constraints 들에 맞출 수 있도록 scheduling
– SW Operation 은 순차적으로 scheduling 되어야 한다– Data 와 Control 의 의존성만 없다면 SW 와 HW 는 Concurrent
하게 scheduling
SungKyunKwan Univ.
47VADA Lab.
Interface• Interface Block 의 필요성
– Hardware 와 Software Block 간의 Data 전달– 효율적인 Interface Block 을 구성해야만 HW/SW
Block 간의 Overhead 를 줄일 수 있다
• Interface 방법– Shared Memory
– FIFO
– Handshaking protocol
SungKyunKwan Univ.
48VADA Lab.
Logical Bus ArchitectureSystem Bus Signals
address, data, control signalsaddress space consists of the memory space & I/O spacememory space : memory of the SW componentI/O space : ports within SW & registers in other HW
Port SignalsThese are specialized signals capable of directly interfacing between SW & HW component
Interrupt SignalsWhen SW & HW components have completed an operation, or when an error condition is detected
SungKyunKwan Univ.
49VADA Lab.
Co-simulation• Co-simulation 의 필요성
– HW part 와 SW part 를 함께 Simulation 을 할 수 있게 해 줌으로써 구성된 System 의 결과를 예측할 수 있다
– System Performance 를 예측하여 Synthesis 이전에 지정된 Spec.에 맞도록 System 을 재설계할 수 있도록 해 준다
– HW/SW Partitioning 을 위한 각 Sub-block 의 특성을 예측해 준다
• Co-simulation Tool– Ptolemy– COSSAP– POLIS
SungKyunKwan Univ.
50VADA Lab.
• SW, HW Co-simulation Tool
– SW : C-code(Generic C)
– HW : VHDL
• Data Flow 형태의 Simulation
– block diagram 형태로 System 구현
• Simulation Report
– 출력을 Waveform 형태로 표시– System 의 Speed 를 예측
Cossap
SungKyunKwan Univ.
51VADA Lab.
Analysis of Constra ints& Requirem ents
System Specification
Hardware & SoftwarePartitioning
HardwareDescription
SoftwareDescription
Interface SynthesisHardware Synthesis
& ConfigurationSoftware G eneration &
Param eterization
ConfigurationM odules
HardwareCom ponents
HW / SWInterface
SoftwareM odules
HW / SW Integration &Cosim ulation
IntegrationSystem
System Evaluation Design Verification
그림 13.Hardware/SoftwareCo-design 의 전체
흐름도
SungKyunKwan Univ.
52VADA Lab.
Partitioning Example: CDMA Searcher
P N -C odeG enera to r
µ ¿ ± â´© À û´Ü(R ea l)
µ ¿ ± â´© À û´Ü(Im age)
¿ ¡³Ê Á ö°è»ê´Ü(R ea l)
¿ ¡³Ê Á ö°è»ê´Ü(Im age)
ºñ± ³, ¼ ± Å Ã ´Ü ºñµ ¿ ± â´© À û´Ü ºñ± ³, ¼ ± Å Ã ´Ü
< S o ftw are -based T yp ica l D es ign >
• Software-based typical diagram
SungKyunKwan Univ.
53VADA Lab.
Continued
P N -C odeG enera tion
S ynchronousA ccum ula tor
(S W )
S ynchronousA ccum ula tor1
(H W )
C ost(S peed,A rea,P ow er)
E nergyE stim ate
(S W )
S ynchronousA ccum ula tor2
(H W )
C om parator(S W )
A synchronousA ccum ula tor
(S W )
C om parator(S W )
E nergyE stim ate
(H W )
C om paratorw ith
precom puta tion(H W )
A synchronousA ccum ula tor
(H W )
C om paratorw ith
precom puta tion(H W )
G O A L!
HW/SW Co-designed diagram
SungKyunKwan Univ.
54VADA Lab.
Answer IV:Re-configurable Processor
• Configurable datapaths (e. g., splittable ALUs,complex operations)
• Configurable interconnect (e. g., nearest neighbor,k buses)
• SIMD processor, many functional units,preferably VLIW, possibly superscalar
SungKyunKwan Univ.
55VADA Lab.
ULTRA-LOW-POWER DOMAIN-SPECIFIC MULTIMEDIA PROCESSORS
• Arthur Abnous and Jan Rabaey
• Programmability requires generalized computation, storage, and communication system, which can be used to implement different kinds of algorithms
• Domain specific processors preserve the flexibility of a general purpose programmable device to achieve higher levels of energy-efficiency, while maintaining the flexibility to handle a variety of algorithms
SungKyunKwan Univ.
56VADA Lab.
Flexibility vs. Energy-Efficiency
• Trade-off between efficiency and flexibility, programmable designs incur significant performance and power penalties compared to ASIC.
• The parallel algorithm of signal processing can be achieved significant power savings by executing the dominant computational kernels of a given class of applications with common features on dedicated, optimized processing elements with minimum energy overhead.
SungKyunKwan Univ.
57VADA Lab.
Application Domains CELP- Based Speech Coding• LPC Analysis and Synthesis• Codebook Search• Lag ComputationDCT- Based Video Compression and Decompression• DCT and Inverse- DCT• Motion Estimation and Compensation• Huffman Coding and Decoding Baseband Processing for Digital Radios• Demodulation, Channel Equalization• Timing Recovery, Error Correction
SungKyunKwan Univ.
58VADA Lab.
The Re-configurable Terminal
SungKyunKwan Univ.
59VADA Lab.
Low- Power Multimedia Processing
• Hybrid, Re-configurable Architecture– application- specific, parallelism, pipelining,– locality, minimum control- overhead, zero- power when idle
• Task Scheduling, and Miscellaneous Functions on Embedded Core Processor (low speed, minimum functionality)
• Standardized Communication Protocols reduce Design Cycle and enable High Level Support
• Use extensive set of low- power circuit techniques– Reduced swing, variable voltages and frequency, self- timing,
locally generated clocks
SungKyunKwan Univ.
60VADA Lab.
Arithmetic Energy Profile :VSELP Speech Coder
Lag Computation+Basic Vector Filtering+Codebook Search=76% of total time
SungKyunKwan Univ.
61VADA Lab.
Hybrid Architecture Template
SungKyunKwan Univ.
62VADA Lab.
The dominant, energy-intensive computationalkernels of a given domain of algorithms are implemented as a set of independent,concurrent threads of computation on the satellite processors.
The Popoased Architectue,Arthur Abnous and Jan Rabaey, UC-Berkeley
Energy- Efficiency + Domain- Specific Programmability
SungKyunKwan Univ.
63VADA Lab.
Control Processor
• The main task of control processor is to configure the satellite processors and the communication networks and to manage the overall control flow of a given signal processing algorithm
• Uses the available satellite processor and the re-configurable interconnect to compose the data flow graph corresponding to a given kernel of computation in hardware
SungKyunKwan Univ.
64VADA Lab.
Overlay operation
• Control processor configures network and co- processors
• Co- processors operate in distributed “data- driven” mode
• At completion, control returns to the core processor for next reconfiguration
SungKyunKwan Univ.
65VADA Lab.
Satellite Processors
SungKyunKwan Univ.
66VADA Lab.
Elements of Energy- Efficiency
SungKyunKwan Univ.
67VADA Lab.
VSELP Synthesis Filter Mapped onto Satellite Processors
Execution of a hardware module is triggered by the arrival of tokens. When there are no tokens to be processed at a given module, no switching activity occurs in that module.
SungKyunKwan Univ.
68VADA Lab.
Example Mappings of VSELP Kernel
SungKyunKwan Univ.
69VADA Lab.
Communication Network
SungKyunKwan Univ.
70VADA Lab.
Communication Network• The communication network is reconfigured by the control
processor to implement the arcs of the data flow graph corresponding to the dominant kernel being implemented as a hardware processor
• Each arc in the data flow graph is assigned a dedicated link in the communication network
• Minimize energy consumption in communication network
1) Reduce voltage swing
2) Clustered satellite processor with local network
3) segmentation of buses into smaller, less capacitive sections by break switches
SungKyunKwan Univ.
71VADA Lab.
Distributed Data- Driven Control
SungKyunKwan Univ.
72VADA Lab.
Power- Variable Performance
SungKyunKwan Univ.
73VADA Lab.
Programmable Logic Modules
SungKyunKwan Univ.
74VADA Lab.
Low Power Circuit Techniques
• Reduced swing interconnect (communication network, memories, programmable logic modules)
• On chip dc- dc conversion + multiple supply voltages
• Locally synchronous - globally asynchronous• Automatic power- down• Optimized libraries (0.6 m CMOS + Cadenc
e/ Synopsys design flow)
SungKyunKwan Univ.
75VADA Lab.
Design Methodology
SungKyunKwan Univ.
76VADA Lab.
Case Studies
• Voice coder for cellular
• Video decoder
• Baseband radio modem
• Security - encryption processor
SungKyunKwan Univ.
77VADA Lab.
Result
• The most energy efficient CELP-based speech algorithm
- dissipates 36 mW ( Vdd = 1.8V, 0.5 um CMOS)
- requires 23.4 MOPS
• Proposed VSELP speech coder
- 0.6 um CMOS
- dissipates under 5 mW
SungKyunKwan Univ.
78VADA Lab.
Implementation of Handshaking
SungKyunKwan Univ.
79VADA Lab.
Single-Wire, Two-Phase Asynchronous Handshaking
Protocol
SungKyunKwan Univ.
80VADA Lab.
Locally Synchronous, Globally Asynchronous Timing Scheme
SungKyunKwan Univ.
81VADA Lab.
Architecture for vector dot product
ConfigurationBus
StrobeAddress
Data
8
16
M em ory M em ory
Network (6 Buses)
AddG en AddG en
M AC
IPor
t
IPor
t
OPo
rt
Network ResetSatellite Reset
S low M ode
IP1 IP2 O P18 18
18AutoAck
M ode
• 0.6 ㎛ CMOS process
• Supply Voltage : 1.5
• Power estimation tool
– PowerMill
• 1 MAC, 2 SRAM, 2 Address
generator, 2 external input p
ort, 1 external output port
• All data and address values a
re 16-bits.
SungKyunKwan Univ.
82VADA Lab.
IIR Mapping
SungKyunKwan Univ.
83VADA Lab.
IIR Comparison
SungKyunKwan Univ.
84VADA Lab.
FFT Mapping
SungKyunKwan Univ.
85VADA Lab.
FFT Comparison
SungKyunKwan Univ.
86VADA Lab.
ResultStrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades
Frequency(MHz)
# of Multipliers
Throughput(cycle/tap)
Energy/tap(J)
Processor
169
0.5
17
37.4n
20
1
40 6 14
1
1
1.3n
1
600p
5 1
2.2n 205p
0.2 1
StrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades
Frequency(MHz)
# of Multipliers
Throughput(cycle/IIR)
Energy/IIR(J)
Processor
169
0.5
114
277n
20
1
40 2.1 14
1
20
19.1n
13
9.5n
9 2
103n 1.9n
1 8
StrongARM TMS320C2xx TMS320LC54x XC4003A Pleiades
Frequency(MHz)
# of Multipliers
Throughput(cycle/stage)
Energy/stage(J)
Processor
169
0.5
766
1870n
20
1
40 - 14
1
152
131n
76
49.3n
- 4
- 13.3n
- 8
FIRResults
IIRResults
FFTResults
StroangARM: micro-processor[2]
TMS320C2xx: DSP chip
[3,4,5,6]
TMS320LC54x: DSP chip
[7,8,12]
XC4003A: FPGA chip[9,10]
SungKyunKwan Univ.
87VADA Lab.
Conclusions• The StrongARM has the worst performance of all because it takes many instru
ctions and cycles to execute a kernel in a highly sequential manners.– The lack of a single-cycle multiplier exacerbates this problem.
– The other architecture have more internal parallelism that allow them to have superior performance.
• Pleiades (re-configurable architecture) does much better on the energy scale than the TI DSPs.
– Because DSPs are general-purpose, and instruction execution involves a great deal of overhead.
– Pleiades has the ability to create dedicated hardware structures tuned to the task at hand and executes operations with a small energy overhead.
• Pleiades outperforms the other processors by a large margin owing to its ability to exploit higher levels of parallelism by creating a dedicated parallel structure from its computational resources and flexible interconnect.
SungKyunKwan Univ.
88VADA Lab.
Reconguration for Power Savingin Real-Time Motion Estimation,S.R.Park,UMASS
SungKyunKwan Univ.
89VADA Lab.
Motion Estimation
SungKyunKwan Univ.
90VADA Lab.
Block Matching Algorithm
SungKyunKwan Univ.
91VADA Lab.
Configurable H/W Paradigms
SungKyunKwan Univ.
92VADA Lab.
Why Hardware for Motion Estimation?
• Most Computationally demanding part of Video Encoding
• Example: CCIR 601 format• 720 by 576 pixel• 16 by 16 macro block (n = 16)• 32 by 32 search area (p = 8)• 25 Hz Frame rate (f frame = 25)• 9 Giga Operations/Sec is needed for Full Search Bloc
k Matching Algorithm.
SungKyunKwan Univ.
93VADA Lab.
Why Reconguration in Motion Estimation?
• Adjusting the search area at frame-rate according to the changing characteristics of video sequences
• Reducing Power Consumption by avoiding unnecessary computation
Motion Vector Distributions
SungKyunKwan Univ.
94VADA Lab.
Architecture for Motion EstimationFrom P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995
SungKyunKwan Univ.
95VADA Lab.
Re-configurable Architecture for ME
SungKyunKwan Univ.
96VADA Lab.
Power Estimation in Recongurable Architecture
SungKyunKwan Univ.
97VADA Lab.
Power vs Search area
SungKyunKwan Univ.
98VADA Lab.
FPGA Implementation
• # of CLBs (2 bits RBMAD)
R Block : 4 CLBs
AD Block : 23 CLBs
• Reconguration time (2 bits RBMAD)
5 usec for increasing p by one
• Power computation,I/O,configuration
SungKyunKwan Univ.
99VADA Lab.
Resource Reuse in FPGAs
SungKyunKwan Univ.
100VADA Lab.
Conclusion• By adjusting the search area according to the
changing characteristics of a picture, power can be saved. Further power saving can be achieved by utilizing freed up resources for local memory
• Extension of Adaptive Search Space Method to Software Implementation– Varying p still reduces computation and hence power– Resource reuse may also be applicable in S/W
implementation by freeing up cache space and compute power for more power efficient use of memory
SungKyunKwan Univ.
101VADA Lab.
Future Works• Reconguration to support more sophisticated motion
estimation algorithms ( intelligent search, object-based, ...)
• More detailed performance studies over a wider range of video sequences
• Generalization of this concept to other algorithms and architectures (not just video)
• Modification to FPGA architectures to support the use of logic and configuration cells as local memory
SungKyunKwan Univ.
102VADA Lab.
Kernel Scheduling in Reconfigurable Computing
• R. Maestre, F. J. Kurdahi, N. Bagherzadeh, H. Singh, R. Hermida, M. Fernandez, Design and Test in Europe, DATE99, Munich, Germany, Mar 99
The PartitionPartition is to find some subsets of kernel that may be scheduled
(executed) independently of other kernels.
Partitioning of the application DFG
The SchedulingScheduling is performed within a given partition in detail after
partitioning .
Scheduling within a given partition
SungKyunKwan Univ.
103VADA Lab.
The Major Criteria
M E M C DCT Q IQ IDCT IM C
6 blocks blocks blocks blocks blocksblocks blocks
Fram e
8 4 21 6 6 421# of contexts :
M PEGsequence
G ranularityof com putation
¨Í
M E M C DCT Q IQ IDCT IM C
396(Fram e)
6 6 6 6 66¨Î
M E M C DCT Q IQ IDCT IM C
6 ¡¿396
6 66
¨Ï 396 396
a) M PEG sequence and granularity, b) a possib le schedule of an im age fram e, c) an a lternative schedule
• Context reloading
– Minimize
• Data reuse
– Maximize
• Computation and
data movement
overlapping
– Maximize
SungKyunKwan Univ.
104VADA Lab.
Load Minimization
C1ExcutionKernel 1
C2ExcutionKernel 2
CnExcutionKernel n
C ase (a)
C1ExcutionKernel 1
C2ExcutionKernel 2
CnExcutionKernel n
C ase (b)
Nt N t N t
< Two extrem e cases of excution sequence for a generic application >
Nt
• Case (a)– Each kernel is executed only once before the execution of the next kernel.– The context for each kernel has to be loaded as many times as the total
number of iterations(time wasted).• Case (b)
– Each kernel is executed as many times as the total number of iterations before executing the next kernel.
– The context for each kernel is loading only once(time saved).
SungKyunKwan Univ.
105VADA Lab.
Scheduling
C M
F B se t 1
F B se t 2
R 1i-1,R 2
i-1
K 1i K 2
i
C 3i
kc 2kc 1= 0
R 3i-1,D 1
i+1,D 2i+1,D 3
i+1
C 1i+1,C 2
i+1
kc 3
K 3i
T im e
¥ái = even t in ¥á ite ra tion i.
k i = C om pu ta tion tim e .
kc i = P ossib le ove rlap o f com pu ta tion and con text load ing
C i = C on text load ing tim e .
D i = D a ta load ing tim e .
R i = R esu lt read ing tim e .
Ide l tim e
P artition = { k 1, k2, k3 }. A poss ib le schedu le :
< Execution m odel representation >
SungKyunKwan Univ.
106VADA Lab.
Algorithm
K i
K j
K m
K p
1 2
3 4
B C = TR U E
a. LE E = ¥õ
K i
K j
K m
K p
2
3 4
B C = TR U E
b. LE E = { 1 }
K i
K j
K m
K p
3 4
B C = TR U E
c. LE E = { 1 , 2 }
K i
K j
K m
K p
2
4
B C =TR U E
d. LE E = { 1 , 3 }
K i
K j
K m
K p
2
B C =TR U E
b. LE E = { 1 , 3 , 4 }
K i
K j
K m
K p
B C =TR U E
c. LE E = { 1 , 4 }
2
3
B C =TR U E
< Som e steps of an exploration sequence >
SungKyunKwan Univ.
107VADA Lab.
Experimental Results
ExplorationA lgorithmIteration
1
2
15
30
Not Explored
{ M E, M C, ...... , IM C }
{ M E } {M C, ......, IM C }
{ M E } {M C, DCT, Q } {IQ , IDCT, IM C }
{ M E, ...... , Q } { IQ , ...... , IM C }
{ M E } { M C } { DCT } { Q } { IQ } { IDCT } { IM C }
Cover LEE
¥õ
{ 1 }
{ 1, 2, 5 }
{ 2, 5 }
{ 1, 2, 3, 4, 5, 6 , 7 }
¢²LB ( P I )
(clock cycles)
5110 cc
4941 cc
5080 cc
5036 cc
5806 cc
ExplorationTim e (afterscheduling)
NNS
4941
NNS
NNS
LB ( SS ) = 4894 cc. ; NNS = Nor Necessary to Schedule.
< Experim enta l D ata for M PEG >
M E M C D C T Q IQ ID C T IM C1 3 4 5 6 7
2
< O rdering of edges for M PEG >
SungKyunKwan Univ.
108VADA Lab.
References[1] A. Abnous and J. Rabeay, “Ultra-Low-Power Domain-Specific Multimedia Processors”, Proceedings of
the IEEE VLSI Signal Processing Workshop, San Francisco, Oct 1996.
[2] Digital Semiconductor, Digital Semiconductor SA-110 Microprocessor Technical Reference Manual, Digital Equipment Corporation, 1996.
[3] TMS320C5x General-Purpose Application User’s Guides, Literatures Number SPRU164, TI, 1997.
[4] T. Anderson, The TMS320C2xx Sum-of-Products Methodology, Technical Application Report SPRA068, TI, 1996.
[5] M. Tsai, IIR Filter Design on the TMS320C54x DSP, Technical Application Report SPRA079, TI, 1996.
[6] Ftp://ftp.ti.com/pub/tms320bbs/c5xxfiles/54xffts.exe, ‘C54x Software Support Files, TI.
[7] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA164, TI, 1997.
[8] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA088, TI, 1996.
[9] E. Kusse, Personal communication, 1996.
SungKyunKwan Univ.
109VADA Lab.
References
[10] J. Rabeay et al., “Fast Prototyping of Data Path Intensive Architecture”, IEEE Design & Test Magazine, Vol. 8, N0. 2, pp. 40-51, 1991.
[11] J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor”, IEEE Journal of Solid-State Circuit, Vol. 31, N0. 11, pp. 1703-1714, Nov. 1996.
[12] A. Fischman and P. Rowland, Designing Low-Power Applications with TMS320LC54x, Technical Application Report SPRA281, TI, 1997.
[13] Daniel D. Gajski, Nikil D. Dutt, Allen C-H Wu, Steve Y-L Lin, \High-level synthesis, Introduction to chip and system design," Kluwer Academic publishers, 1992.
[14] Duncan A. Buell, Jerey M.Arnold, Walter J.Kleinfelde \Splash2, FPGAs in Custom Computing Machine," IEEE Computer Society Press, Los Alamitos, California.
[15] Jonathan Babb, Russell Tessier, Mathew Dahl, Silvina Zimi Hanono, David M. Hoki, and Anant Agarwal, Logic emulation with virtual wires," IEEE Transactions on Computer Aided Design of Integrated circuits and systems, vol. 16, No. 6, June 1997.
[16] M.Vasilko, Djamel Ait-Boudaoud, \Architectural synthesis techniques for dynamically Recongurable logic," Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996.
SungKyunKwan Univ.
110VADA Lab.
References[17] Patrick Lysaght, Gordon McGregor and Jonathan Stockwood, Conguration Controller S
ynthesis for Dynamically Recongurable Systems," IEE Colloquium on Hardware Software COSynthesis for Recongurable systems, 1996.
[18] M.Vasilko, Djamel Ait-Boudaoud, Scheduling for dynamically Recongurable FPGAs," Proceedings of International workshop on Logic and Architecture synthesis, pp. 328-336, IFIPTC10 WG10.5, Dec. 18-19 1995.
[19] Doug Smith, Dinesh Bhatia, RACE: Recongurable and Adaptive Computing Environment,” Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996. See http://www.ececs.uc.edu/ ~ dal.
[20] Xilinx Netlist Format (XNF) Specication, Version 6.1, June 1, 1995.
[21] Xilinx XABEL reference manual.