mpeg2 video encoding on imagine november 16, 2000 scott rixner

34
MPEG2 Video Encoding on Imagine November 16, 2000 Scott Rixner

Upload: aubrey-griffith

Post on 03-Jan-2016

225 views

Category:

Documents


2 download

TRANSCRIPT

MPEG2 Video Encoding on Imagine

November 16, 2000

Scott Rixner

Scott Rixner Imagine Architecture 2

Programming Imagine

Architecture features– Data bandwidth management

– Data-parallel clusters

– Parallel-subword operations Stream programming model

– Natural data streams of application

– Computation kernels perform “functions” Challenge is to think in terms of streams instead of

traditional C-style sequential code

Scott Rixner Imagine Architecture 3

Application Development (1)

Compose stream and kernel diagram– Identify natural streams in the application

– Understand data-parallelism and how to map it to the clusters

– Stream-oriented algorithmic choices Write kernel code

– C-like syntax

– idebug enables quick non-performance, functional debugging

– iscd/schedviz enables C-level performance tuning

Scott Rixner Imagine Architecture 4

Application Development (2)

Write stream code– First cut: simple mapping of stream/kernel diagram

– idebug enables quick functional testing

– Second cut: convert to macrocode (soon to be obsolete)

– isim yields cycle-accurate simulation Performance tuning

– schedviz allows quick kernel tuning

– appviz shows where application run-time is going

Scott Rixner Imagine Architecture 5

MPEG2 Encoding

Color Conversion (RGBYCbCr) Motion Estimation Discrete Cosine Transform Quantization Run-level Encoding Variable-length Coding

IDCTQ/Correlation for Reference Frame

Scott Rixner Imagine Architecture 6

Streams and Kernels

BlockSAD(future)

B lockSAD(past)

RGB -->YCrCb

Load BestMatch

DCTQ RLE VLC

IQDCT Correlate

MemorySave/Load

I

P

A L L

P

Rate Control Feeback

S R F M e m o r y Memory orNetw ork

Peripheral

P ,B

P ,B

Scott Rixner Imagine Architecture 7

Imagine Programming Environment

StreamC

C++ compiler

KernelC

kernelscheduler

microcode

Ho

st

Ima

gin

e

streamscheduler

Run-time

Compile-time

StereoDepthExtraction(…) {

// Load Input Images

...

// Run Kernels

convolve7x7 (RawImage,ConvImage);

convolve3x3 (ConvImage,Conv2Image);

...

// Store Output

}

Convolve7x7(…){

...

while(!In.empty()) {

...

p0 = k0 * in10;

p12 = k21 * in32;

p34 = k43 * in54;

p56 = k65 * in76;

sum = (p0 + p12)

+ (p34 + p56);

...

}

}

Scott Rixner Imagine Architecture 8

Imagine Programming Tools

StreamC

application.exe

profile

C++ compiler

KernelC

kernelscheduler

microcode

Ho

st

Imag

ine

kernel.viz

schedulevisualizer

application.viz

streamscheduler.dll

Run-time

Compile-time

Visualization

Scott Rixner Imagine Architecture 9

KernelC

loop_stream(datain) pipeline(1) { datain >> color1 >> color2 >> color3 >> color4;

// c = 0.299R || 0.114B c1 = hi(mulrnd(RB_SCALE, shift(a1, 1))); c2 = hi(mulrnd(RB_SCALE, shift(a2, 1))); c3 = hi(mulrnd(RB_SCALE, shift(a3, 1))); c4 = hi(mulrnd(RB_SCALE, shift(a4, 1))); … Yout << hi(mulrnd(Ymadj, shift(temp0, 1)))+Yaadj; Yout << hi(mulrnd(Ymadj, shift(temp1, 1)))+Yaadj;

first = hi(mulrnd((a1a3 - (z1 + z3)), C_SCALE)) + one_two_eight; second = hi(mulrnd((a2a4 - (z2 + z4)), C_SCALE)) + one_two_eight;

first = commucperm(perm_a, first); second = commucperm(perm_b, second);

CrCbout << select(low, first, second);}

Scott Rixner Imagine Architecture 10

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0

G E N _ C I S T A T E

C O N D _ I N _ D

G E N _ C C E N D

S P C R E A D _ W T S P C W R I T E

C O M M U C D A T A

C H K _ A N Y

S E L E C T

S H I F T A 1 6

C O M M U C P E R M

C O M M U C P E R M

C O M M U C P E R M

S E L E C T

S E L E C T C O M M U C P E R M

I M U L R N D 1 6 S E L E C T

I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 N S E L E C T

I M U L R N D 1 6I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6

I M U L R N D 1 6 I M U L R N D 1 6

I M U L R N D 1 6 I M U L R N D 1 6

I M U L R N D 1 6 I M U L R N D 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 S E L E C T P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 N S E L E C TP A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6S H U F F L E P A S S

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SI A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SS H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L ES H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S S P A S S

P A S SS H U F F L E

S H U F F L ES H U F F L E S H U F F L E

I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S

I A D D S 1 6I A D D S 1 6I A D D S 1 6 P A S S P A S SP A S S

I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I A D D S 1 6I A D D S 1 6 I A D D S 1 6

S H U F F L E S H U F F L ES H U F F L E D A T A _ I N

I A D D S 1 6I A D D S 1 6 S H U F F L E P A S SD A T A _ I N

I A D D S 1 6 I A D D S 1 6I A D D S 1 6 P A S SP A S S D A T A _ I N

S E L E C TI A D D S 1 6I A D D S 1 6 D A T A _ I N

I A D D S 1 6 S E L E C TI A D D S 1 6I A D D S 1 6 N S E L E C T D A T A _ O U T

I A D D S 1 6 S E L E C TI A D D S 1 6 I A D D S 1 6 D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C T D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C TN S E L E C T D A T A _ O U T

L O O PI A D D S 1 6 I A D D S 1 6 D A T A _ O U T

D A T A _ O U T D A T A _ O U T

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0

IMULRND16 IMULRND16 PASSIADDS16 NSELECTPASS

IMULRND16 IMULRND16 PASSIADDS16 IADDS16 NSELECTSHIFTA16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16IADDS16SHUFFLE

IMULRND16 IMULRND16SHUFFLE PASS

IMULRND16 IMULRND16IADDS16 IADDS16 PASSIADDS16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16 IADDS16 PASSSHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLESHUFFLE

IMULRND16 IMULRND16IADDS16 IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16 IADDS16 PASS PASS

PASSSHUFFLE SELECT

SHUFFLESHUFFLE SHUFFLE SELECT COMMUCPERM

IADDS16 IADDS16 IADDS16 PASSIMULRND16 SELECT

IADDS16IADDS16IADDS16 PASS PASSPASS IMULRND16

IADDS16 IADDS16IADDS16 IMULRND16IMULRND16 NSELECT

IADDS16IADDS16 IADDS16 IMULRND16IMULRND16

IMULRND16IMULRND16 PASS

SHUFFLE SHUFFLESHUFFLE DATA_INIMULRND16 IMULRND16 PASS

IADDS16IADDS16 SHUFFLE PASSDATA_INIMULRND16 IMULRND16 PASS GEN_CISTATE

IADDS16 IADDS16IADDS16 PASSPASS DATA_INIMULRND16 IMULRND16 COND_IN_D

SELECTIADDS16IADDS16 DATA_INIMULRND16 IMULRND16 GEN_CCEND

IADDS16 SELECTIADDS16IADDS16 NSELECTIMULRND16 IMULRND16 SPCREAD_WT SPCWRITEDATA_OUT

IADDS16 SELECTIADDS16 IADDS16 DATA_INIMULRND16 IMULRND16 COMMUCDATADATA_OUT

IADDS16 NSELECT DATA_INIMULRND16 IMULRND16IADDS16 CHK_ANYDATA_OUT

IADDS16 NSELECTNSELECTIMULRND16 IMULRND16IADDS16IADDS16 DATA_OUT

LOOPIADDS16 IADDS16 IMULRND16 IMULRND16IADDS16 NSELECTSELECT DATA_OUT

IMULRND16 IMULRND16IADDS16 IADDS16 SELECT PASSDATA_OUT DATA_OUT

7x7 Convolution Kernel

ALUs Comm/SPStreams

Pipeline Stage 0

Pipeline Stage 1

Pipeline Stage 2

Scott Rixner Imagine Architecture 11

StreamC

for (row=0; row<NROWS; row++) {// update quantization factor for rate controlquantizerScale = newQuantizerScale;

// setup streams for this row...

// Perform I-Frame encodingconvert(InputRow, &YRow, &CbCrRow);dct(YRow, dctIconstants, quantizerScale, &DCTYRow);dct(CbCrRow, dctIconstants, quantizerScale, &DCTCbCrRow);rle(DCTYRow, DCTCbCrRow, rleConstants, &RunLevelsRow);vlc(RunLevelsRow, &bitStream, &newQuantizerScale);

// Store generated bit stream...

// Generate reference image for subsequent P or B framesidct(DCTYRow, idctIconstants, quantizerScale, &RefYRow);idct(DCTCbCrRow, idctIconstants, quantizerScale, &RefCbCrRow);

// Store reference rows...

}

Scott Rixner Imagine Architecture 12

Macrocode

for (int row = 0; row < mb_height; row++)

{

for (int col = 0; col < mb_width; col += iNumBlocks)

{

rts.write_ucr(1, image_size_param);

rts.write_ucr(2, idxparams);

rts.vect_op(idxgen, 0, 1, iframe.colorIndices);

rts.vect_load(false, iframe.imageBuffer[even],

iframe.colorIndices,

memInputFrame, msg);

rts.vect_op(icolor, 1, 2, "icolor conversion",

iframe.imageBuffer[odd],

iframe.blkY1dct, iframe.blkCrCb1dct);

rts.write_ucr(1, quantizer_scale);

rts.vect_op(dct, 2, 1, "Y dct",

iframe.blkY1dct, dctIntraConsts,

iframe.blkY2rle);

rts.write_ucr(1, quantizer_scale);

rts.vect_op(dct, 2, 1, "CrCb dct",

iframe.blkCrCb1dct, dctIntraConsts,

iframe.blkCrCb2rle);

rts.write_ucr(1, 0);

rts.write_ucr(2, quant_scale);

rts.vect_op(rle, 4, 1, "RLE“ iframe.blkY2rle, iframe.blkCrCb2rle, rle_consts, zeroLength, UP(iframe.blkRunLevels[odd]));

rts.vect_store(false, iframe.blkRunLevels[odd], memOutputFrame, msg);

rts.write_ucr(1, iquantizer_scale); rts.vect_op(idct, 2, 1, "Y idct", iframe.blkY2rle, idctIntraConsts, iframe.blkY3); rts.write_ucr(1, iquantizer_scale); rts.vect_op(idct, 2, 1, "CrCb idct", iframe.blkCrCb2rle, idctIntraConsts, iframe.blkCrCb3); rts.write_ucr(1, 0); rts.vect_op(correlate, 4, 2, "correlate", iframe.blkY3, iframe.blkCrCb3, iframe.dummy_blkYMVref, iframe.dummy_blkCrCbMVref, iframe.blkYref[odd], iframe.blkCrCbref[odd]);

rts.vect_store(false, iframe.blkYref[odd], memNewRefY, msg);

rts.vect_store(false, iframe.blkCrCbref[odd], memNewRefCrCb, msg); } }

Scott Rixner Imagine Architecture 13

Stereo Depth Extractor

Clusters Mem_0 Mem_121300

21400

21500

21600

21700

21800

21900

22000

22100

22200

22300

22400

22500

22600

22700

22800

22900

23000

23100

23200

23300

23400

23500

23600

CONV 3x3

STOREUNPACK LOADCONV 7x7

CONV 3x3

STOREUNPACK LOADCONV 7x7

CONV 3x3

STOREUNPACK LOADCONV 7x7

Clust Mem0 Mem1501300

501400

501500

501600

501700

501800

501900

502000

502100

502200

502300

502400

502500

502600

502700

502800

502900

503000

503100

503200

503300

BlockSAD

BlockSADLoad Load

BlockSAD

StoreBlockSAD

BlockSAD

BlockSADLoad Load

BlockSAD

StoreBlockSAD

Load originalpacked row

Unpack (8bit 16 bit)

7x7 Convolve

3x3 Convolve

Store convolved row

Load ConvolvedRows

CalculateBlockSADs atdifferent disparities

Store bestdisparity values

Convolutions Disparity Search

Scott Rixner Imagine Architecture 14

Tools

idebug (functional simulator)– Built on top of visual studio (any C++ compiler)

iscd (kernel scheduler)– Generates optimized VLIW assembly from C-like code

isim (cycle-accurate simulator)– Simulates current Imagine architecture (configurable)

schedviz (schedule/application visualizer)– Interactive visualization of resource utilization

stream scheduler (run-time stream manager)

Scott Rixner Imagine Architecture 15

idebug

Macros and libraries Enable Imagine StreamC/KernelC to be directly

compiled by a C++ compiler Enables the use of any C++ debugger to debug

Imagine code Can add arbitrary C++ code into the

StreamC/KernelC for debugging– Function stubs

– printf’s, etc.

Scott Rixner Imagine Architecture 16

Imagine Debugging

StreamC

application.exe

C++ compiler

KernelC

Run-time

Compile-timeidebug

extensions

idebug.dll

Deb

ug

ger

Scott Rixner Imagine Architecture 17

IDebug

Scott Rixner Imagine Architecture 18

iscd

Optimizing VLIW scheduler Compiles KernelC Currently supports

– copy propagation & dead code elimination

– software pipelining

– loop unrolling

– schedule randomization

– inline functions (no function calls) Configurable target architecture

Scott Rixner Imagine Architecture 19

isim

Similar application performance to RTL ~4M cycles per hour (>1000 cycles per second) Configurable

– Machine description file (same file as for iscd)

– # clusters, ALU mix/connection, memory system, etc. Interactive command prompt

– Debugging

– Performance monitoring/reporting

– Memory/file comparison

Scott Rixner Imagine Architecture 20

schedviz

Interactive schedule visualizer Visual Basic Shows resource utilization

– Operation scheduling

– Communication scheduling Enables source-level performance optimization

– Never look at assembly code! Also view application execution

– Cluster, memory, network utilization

Scott Rixner Imagine Architecture 21

Stream Scheduler (1)

Converts StreamC functions into Imagine operations

Allocates:• operation issue slots

• stream-level registers

• stream register file (SRF)

• memory

Determines dependencies between operations

Scott Rixner Imagine Architecture 22

Stream Scheduler (2)

SRF allocation is critical– requires usage information

– requires foreknowledge

– too costly to perform at run time Stream scheduler is profile based

– run once with simple allocation

– collect usage information

– perform good allocation

– run repeatedly with good allocation

Scott Rixner Imagine Architecture 23

Handling Large Streams

Strip mining Double bufferingInput O utpu t

K1

K2

K3

K1

K2

K3

Scott Rixner Imagine Architecture 24

Stream Algorithms: Blocksearch

Reference Image

Row from Current Image

Row 0

Row 1

Row 2

blocksearch Motion VectorsReference row 0Reference row 1Reference row 2

Current row

searchregion

Scott Rixner Imagine Architecture 25

MPEG2 Characteristics

Operations– 56% 8-bit ADD/SUB

Little locality– 1.47 accesses per word of global data

Computationally intense– 155 operations per global data reference

Scott Rixner Imagine Architecture 26

Performance & Power

Raw Performance– 360x288, 24-bit: 350 fps

– 720x486, 24-bit: 104 fps

Clusters provide high arithmetic bandwidth– 27.6 GOPS on blocksearch kernel

– 17.9 GOPS overall

SRF provides necessary data locality, bandwidth– Only temporary data in off-chip memory are reference frames

– 2.4 GB/s required, 32 GB/s available

Power Efficiency: 10.7 GOPS/W

Scott Rixner Imagine Architecture 27

Bandwidth Hierarchy

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

File

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s

Scott Rixner Imagine Architecture 28

Stream Recirculation

ColorConvert

DCT

DCT

IDCT

IDCT

Run-LevelEncoding

VariableLengthCoding

Arithmetic ClustersStream Register FileMemory (or I/O)

InputImage

RGBPixels

LuminancePixels

TransformedLuminance

LuminanceReference

EncodedBitstream

RLE Stream

Bitstream

ReferenceChrominance

Image

ReferenceLuminance

Image

ChrominancePixels

TransformedChrominance

ChrominanceReference

Data Referenced: 835KB 4.8MB 38.6MB

Scott Rixner Imagine Architecture 29

MPEG Bandwidth

Unconstrained Constrained BandwidthBandwidth SRF Bandwidth Use

Memory (GB/s) 0.88 0.88 0.86

Global RF (GB/s) 5.23 5.20 5.08

Local RF (GB/s) 167.47 166.53 0.86

Performance (GOPS) 10.89 10.83 10.58

Scott Rixner Imagine Architecture 30

MPEG Execution

6,000

9,000

12,000

15,000

18,000

21,000

24,000

27,000

30,000

C L U S T E R SME MOR Y

S T R E A M 0ME MOR Y

S T R E A M 1

ColorConversion

LoadInput

Blocks

batch i-1 batch i batch i+1

Y DCT

CrCb DCT

Run-LevelEncoding

StoreRun-LevelEncodedBlocks

Y IDCT

LoadInput

Blocks

CrCb IDCT

ReferenceStore

Y ReferenceColor

ConversionStore

CrCb ReferenceY DCT

CrCb DCT

Run-LevelEncoding

Scott Rixner Imagine Architecture 31

Challenges

VLC (Huffman Coding)– Difficult and inefficient to implement on clusters (SIMD on 32-bit

data)

– Instead, send RLE data over network to FPGA

– Could add special-purpose Huffman coding stream unit

Rate Control– Difficult because multiple macroblocks encoded in parallel

– Must perform on a coarser granularity (impact on picture quality?)

– For smaller image sizes, can simply re-encode a group of macroblocks at a higher quantization level if necessary in real-time

Scott Rixner Imagine Architecture 32

Imagine Programming

Think in terms of streams Range of software tools

– Compilers

– Visualizers

– Simulators Achieve new levels of performance

– Less programming effort

– Greater power efficiency

Scott Rixner Imagine Architecture 33

If-Statement Example

if (case) { f(x); }else { g(x); }

if (case) { strA << x; }else { strB << x; }

PE0

PE1

PE2

PE3

0

1

0

1

Case values

Should PEsexecutef( ) or g( )?

PE0

PE1

PE2

PE3

SRF0

SRF1

SRF2

SRF3

Shared Control

0

1

0

1

Case values

Shared Control

Scott Rixner Imagine Architecture 34

Conditional Streams

– Data streams that are accessed conditionally based on a local case value

– Results in an arbitrary expansion or compression of stream in space and time inpu t_stream

P L H D

O K G C

M I E A

N J F B

SRF0

SRF1

SRF2

SRF3

PE0

PE1

PE2

PE3

0 1 2 3 4 5 6

A F K O

C G I L P

B D J M

E H N

itera tion

kernel (da ta )

PE0

PE1

PE2

PE3

0 1 2 3 4 5 6

1 0 1 0 0 1 1

0 1 1 0 1 1 1

1 1 0 0 1 1 0

0 1 1 0 0 1 0

itera tion

kernel (se l)