functional modeling style for efficient sw code … · functional modeling style for efficient sw...

27
Functional modeling style for efficient SW code generation of video codec applications Sang-Il Han 1)2) Soo-Ik Chae 1) Ahmed. A. Jerraya 2) SD Group 1) SLS Group 2) Seoul National Univ., Korea TIMA laboratory, France

Upload: hanga

Post on 30-Aug-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Functional modeling style for efficient SW code generation

of video codec applications

Sang-Il Han1)2) Soo-Ik Chae1) Ahmed. A. Jerraya2)

SD Group1) SLS Group2)

Seoul National Univ., Korea TIMA laboratory, France

02/01/06 2

Summary Challenges in video codec system design

Complex High-performance within short design time Classical RTL design methodology is too slow

Requirements in video codec system design System level design methodology with functional model Explicit parallelism and conditional in functional model

Existing modeling styles Data-driven model lacks explicit conditional (ex: SDF lacks conditionals) Event-driven model lacks explicit parallelism (ex: SM lacks distributed imp.)

Proposed modeling style Abstract clock synchronous model (ACSM): An extension of CSM for RTL Both parallelism and conditional with an abstract global clock

02/01/06 3

The scope of this work System-level design methodology

Scope Modeling style to generate efficient SW code.

Target Application

TargetArchitecture

Implementationof System

Modeling

ACSM Model

Mapping

Mixed Algo./Arch. Model

AutomaticGeneration

Evaluation

Scope

02/01/06 4

ContentsIntroductionProposed modeling styleComparison with existing modeling stylesExperimentConclusion and Perspective

02/01/06 5

Target applicationMacroblock-based video codecs

MPEG-2, H.263, MPEG-4, H.264, and WMV9

A frame = WxH macroblocks

MacroblockVLD

Inverse ScanQuantization

InverseTransform + Deblocking

Filter

Video BitStream

Frame/SliceVLD

DecodedFrame Store

Spatial Compensation

Process

MultiplePrevious

Frame Store

Motion Compensation

Process

Current Frame Store

Switch

Spatial Prediction Modes

Motion Vectors

Intra/Inter MB

W

H

Block diagram of an H.264 decoder

Exploit temporal redundancy

Exploit spatial redundancy

02/01/06 6

Challenges and Requirements (1) 1) Computation Complexity

Ex: ≈ 10 Gops/sec for 4VGA decoding

Explicit parallelism Parallel and pipelined computation execution

2) Communication Complexity Ex: ≈ 500 MB/sec external memory bandwidth for 4VGA decoding

Explicit and predictable communication Parallel computation and communication execution Efficient communication scheme e.g. burst transfer

02/01/06 7

Challenges and Requirements (2) 3) Specification Complexity

Ex: Curr. MB has dependency with prev., upper., upper prev. MB

Higher level synchronization Function- and thread-level synchronization

4) Control Complexity Ex: 4x4 intra, 16x16 intra, 4x4 inter, 8x8 inter, … MB prediction

modes

Explicit data-dependent comp./comm. Sophisticated control operations e.g. memory management

02/01/06 8

Contribution Propose modeling style for video codecs

Abstract clock synchronous model (ACSM) An extension of clocked synchronous model for RTL Coarser clock: Abstract clock

1)Partially ordered dependency at function level Computation Complexity,

2)Explicit comm. channel with fixed buffer size Communication Complexity

3)Coarser clock granularity than physical clock Specification Complexity

4)Explicit conditionals through global abstract clock Control Complexity

02/01/06 9

ContentsIntroductionProposed modeling styleComparison with existing modeling stylesExperimentConclusion and Perspective

02/01/06 10

Proposed modeling styleAbstract clock synchronous model

Structure Network of state-less functions Memory elements Abstract clock

Hypothesis : Lock step execution model In every abstract clock cycle, new values propagate in the

network

F0

F1 F3

Del

ay

F2

Abstract clock

02/01/06 11

RTL model vs ACSM

Network of combinational gates Memory elements (delay) No loops in combinational logics Synch: Lock step model Synch. interval: clock

Network of state-less functions Memory elements (delay) No loops in network of functions Synch: Lock step model Synch. interval: abstract clock

*

/ +D

elay

-

clk

F0

F1 F3

Del

ay

F2

Abstract clock

RTL model ACSM

Difference: Granularity of clock and components

02/01/06 12

Basic componentsF0

i0

in

,,,

o0

om

,,, Z-K F0

F1

Z-K

(a) module (b) delay (c) arc

(d) If-action subsystem (IAS)

F1

F2

F3

If/else

F4 F5

Mer

ge

IAS1

IASi

0..3

D F0 M

(e) For-iterator subsystem (FIS)

FIS1

D

M

Demux

Mux

Data

Control

Rule on execution: If one token on all input ports, fired. Except Merge block

Rule on absent token: absent token with global abstract clock. absent token = (tag, empty value)

token = (tag, value)

02/01/06 13

Abstract clock for video codecsHierarchical iterations in video decoder and encoder

Frame level index: too coarse to express parallelism. Sub-macroblock level index: irregular delay depth Macroblock level index: proper granularity and regular

delay depth

Abstract clock for video codecs: Macroblock index

0 1

2 3

4 5

6 7

8 9

10 11

12 13

14 15

UperMB

Prev

MB

12 …

… …

20 21

… …

78 …

89 …

86 87

97 98

11

77

88

1 … 9 100

0 1

2 3

4 5

6 7

8 9

10 11

12 13

14 15

UperMB

Prev

MB

0 1

2 3

4 5

6 7

8 9

10 11

12 13

14 15

UperMB

Prev

MB

12 …

… …

20 21

… …

78 …

89 …

86 87

97 98

11

77

88

1 … 9 100

12 …

… …

20 21

… …

78 …

89 …

86 87

97 98

11

77

88

1 … 9 100

(a) Decoding order of macroblock (b) Decoding order of 4x4 block

Delay depth : 6

Delay depth : 166

QCIF image

02/01/06 14

Implementations of RTL model and ACSM

+

/ *

Del

ay

-

clk

F0

F1 F3

Del

ay

F2

Abstract clockre

gist

er

clk

add

div mul

sub

Abstract clock

SR

AM

IP1 IP3

F0 F2

RTL model ACSM model

16bit Carry lookahead Adder RISC processor

02/01/06 15

A simplified ACSM of H.264 decoder

+

Video BitStream

Video Out

Z-1

Z-W

Z-1

Z-W

CurrentFrame

PreviousFrames

MBVLD

Fr.VLD

8x8IQ

8x8IT

8x8DFM

0..3

Intra/Inter MB

16MVs

0..3

4 MC Types

8x8SC

8x8MC

4x4MC

4x4IF

8x8IF

D

D

D

MD

FIS1

IAS3

If/else

IAS2

IAS1

FIS1

In the full model

83 Modules 24 delays 286 arcs 43 IASs 5 FISs

02/01/06 16

ContentsIntroductionProposed modeling styleComparison with existing modeling stylesExperimentConclusion and Perspective

02/01/06 17

Synchronous Dataflow (SDF)

F4

Z-k

F1

If/else

F2

F3

Z-j

F6

Z-k

F1

If/else

F2

F3

Z-j

Fmux

(a) ACSM (b) SDF

SDF: Concurrent actors communicate with one-way FIFO channels Rule on execution: If sufficient non-zero tokens on all input ports, fired Rule on absent token: No concept of absent token. Difficulty: No explicit conditionals (!req. 4)

Redundant computation : One of F2 and F3 Redundant commutation : One of (Z-j⇒F2) and (Z-k⇒F3)

02/01/06 18

Boolean Dataflow (BDF)

F4

Z-k

F1

If/else

F2

F3

Z-j

F4

Z-k

F1

If/else

F2

F3

Z-j

SWIT

CH

SELE

CT

T

F F

T

(a) ACSM (b) BDF

BDF: Concurrent actors communicate with one-way FIFO channels Rule on execution: If sufficient non-zero tokens on all input ports, fired

Except SWITCH and SELECT blocks Rule on absent token: Absent token without global clock. Difficulty: Buffer overflow problem. (!req. 4)

Accumulate tokens between (Z-j⇒F2) and (Z-k⇒F3) To solve this problem, a lot of additional switches: 138 switches/83 modules in

case of H.264 decoder

02/01/06 19

Other modeling styles Khan Process Network (KPN)

Difficulty1: Task-level coarse granularity (!req 1) Difficulty2: Comp./comm./control mixed (!req 2, !req4)

Synchronous Dataflow with embedded control (SDF+eCtrl) Difficulty1: Coarser granularity. (!req. 1) Difficulty2: Still Redundant communication (!req. 4)

Synchronous model (SM) Rule on execution: If one present token on one or more input

ports, fired Rule on absent token: Absent tokens with virtual clock. Difficulty: distributed implementation (!req 1)

02/01/06 20

Summary of Modeling styles

Clock-drivenEvent-drivenData-drivenModel Type

√√√Explicit data-dependent op.

Abstract clkPhysic clkVirtual clksNo global clockClock

Unrestricted absent tokenDistributed

system

√√

SM

Restricted absent token with global clock

Execution ≈ Data-drivenAbsent token ≈Event-driven

No absent token (or without clock)

Explicit conditionalsAnalysis

√√

SDF+eCtrl

√√√ √Higher level synchronization

√√√√Explicit communication

√√√√Explicit fine-grain parallelism

ACSMRTLBDFSDFKPN

Requirements

Models

02/01/06 21

ContentsIntroductionProposed modeling styleComparison with existing modeling stylesExperimentConclusion and Perspective

02/01/06 22

Experiment environmentTarget application

H.264 baseline decoder Foreman QCIF format with QP=28 and IntraPeriod=5 All search modes are supported.

Target architecture Tensilica Xtensa single processor DMA + memory I/F Mapping

Frames are stored in global memory. Image fetch process is executed on DMA. All other processes are executed on Xtensa.

GlobalMem0 Proc0

Interconnection

Xtensa

localmem

DMA/Mem I/F

In Out

02/01/06 23

Experiment100

71.1 68.1

40.0

0.010.020.030.040.050.060.070.080.090.0

100.0

SDF ACSM Opt. ACSM Handedcode

Execution cycle

EtcRecITVLDChrom DFLuma DFChrom MCLuma MCTop control

100

78.1

54.0 54.0

0.010.020.030.040.050.060.070.080.090.0

100.0

SDF ACSM Opt. ACSM Handedcode

External memory access

Write ChromWrite LumaRead ChromRead Luma

100

71.1 68.1

40.0

100

78.1

54.0 54.0

H.264 decoder

C codes

Modeling

SW codeby RTW

Profile withXtensa ISS

Cycle andMemory access

ACSM (4x4)

Opt. ACSM (4x4, 8x8)

Handed code

XtensaSingle proc.

SDF(4x4 w/o IAS)

Simulink models

02/01/06 24

Experiment result: ComparisonCompared to SDF

Reduce 32% execution time Due to Redundant computation of SDF

Reduce 46% external memory access Due to redundant communication of SDF

Compared to SDF+eCtrl Similar execution time, but SDF+eCtrl limits design space. Reduce 46% external memory access

Due to redundant communication of SDF+eCtrl

Compared to BDF Buffer overflow, or a lot of switches required (138 switches/83 modules)

02/01/06 25

ContentsIntroductionContributionBasics of target application and architectureProposed modeling styleComparison with existing modeling stylesExperimentConclusion and Perspective

02/01/06 26

Conclusion and perspectiveConclusion

ACSM suitable for video codec application An high level RTL model Satisfies the requirements of functional model

Efficient SW code generation from Simulink with RTW Reduce 32% execution time than SDF Reduce 46% external memory access than SDF

Perspective SW generation tool for distributed implementation.

02/01/06 27

Thank you!!!

Question?