mpeg2 video encoding on imagine november 16, 2000 scott rixner
TRANSCRIPT
Scott Rixner Imagine Architecture 2
Programming Imagine
Architecture features– Data bandwidth management
– Data-parallel clusters
– Parallel-subword operations Stream programming model
– Natural data streams of application
– Computation kernels perform “functions” Challenge is to think in terms of streams instead of
traditional C-style sequential code
Scott Rixner Imagine Architecture 3
Application Development (1)
Compose stream and kernel diagram– Identify natural streams in the application
– Understand data-parallelism and how to map it to the clusters
– Stream-oriented algorithmic choices Write kernel code
– C-like syntax
– idebug enables quick non-performance, functional debugging
– iscd/schedviz enables C-level performance tuning
Scott Rixner Imagine Architecture 4
Application Development (2)
Write stream code– First cut: simple mapping of stream/kernel diagram
– idebug enables quick functional testing
– Second cut: convert to macrocode (soon to be obsolete)
– isim yields cycle-accurate simulation Performance tuning
– schedviz allows quick kernel tuning
– appviz shows where application run-time is going
Scott Rixner Imagine Architecture 5
MPEG2 Encoding
Color Conversion (RGBYCbCr) Motion Estimation Discrete Cosine Transform Quantization Run-level Encoding Variable-length Coding
IDCTQ/Correlation for Reference Frame
Scott Rixner Imagine Architecture 6
Streams and Kernels
BlockSAD(future)
B lockSAD(past)
RGB -->YCrCb
Load BestMatch
DCTQ RLE VLC
IQDCT Correlate
MemorySave/Load
I
P
A L L
P
Rate Control Feeback
S R F M e m o r y Memory orNetw ork
Peripheral
P ,B
P ,B
Scott Rixner Imagine Architecture 7
Imagine Programming Environment
StreamC
C++ compiler
KernelC
kernelscheduler
microcode
Ho
st
Ima
gin
e
streamscheduler
Run-time
Compile-time
StereoDepthExtraction(…) {
// Load Input Images
...
// Run Kernels
convolve7x7 (RawImage,ConvImage);
convolve3x3 (ConvImage,Conv2Image);
...
// Store Output
}
Convolve7x7(…){
...
while(!In.empty()) {
...
p0 = k0 * in10;
p12 = k21 * in32;
p34 = k43 * in54;
p56 = k65 * in76;
sum = (p0 + p12)
+ (p34 + p56);
...
}
}
Scott Rixner Imagine Architecture 8
Imagine Programming Tools
StreamC
application.exe
profile
C++ compiler
KernelC
kernelscheduler
microcode
Ho
st
Imag
ine
kernel.viz
schedulevisualizer
application.viz
streamscheduler.dll
Run-time
Compile-time
Visualization
Scott Rixner Imagine Architecture 9
KernelC
loop_stream(datain) pipeline(1) { datain >> color1 >> color2 >> color3 >> color4;
// c = 0.299R || 0.114B c1 = hi(mulrnd(RB_SCALE, shift(a1, 1))); c2 = hi(mulrnd(RB_SCALE, shift(a2, 1))); c3 = hi(mulrnd(RB_SCALE, shift(a3, 1))); c4 = hi(mulrnd(RB_SCALE, shift(a4, 1))); … Yout << hi(mulrnd(Ymadj, shift(temp0, 1)))+Yaadj; Yout << hi(mulrnd(Ymadj, shift(temp1, 1)))+Yaadj;
first = hi(mulrnd((a1a3 - (z1 + z3)), C_SCALE)) + one_two_eight; second = hi(mulrnd((a2a4 - (z2 + z4)), C_SCALE)) + one_two_eight;
first = commucperm(perm_a, first); second = commucperm(perm_b, second);
CrCbout << select(low, first, second);}
Scott Rixner Imagine Architecture 10
ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0
G E N _ C I S T A T E
C O N D _ I N _ D
G E N _ C C E N D
S P C R E A D _ W T S P C W R I T E
C O M M U C D A T A
C H K _ A N Y
S E L E C T
S H I F T A 1 6
C O M M U C P E R M
C O M M U C P E R M
C O M M U C P E R M
S E L E C T
S E L E C T C O M M U C P E R M
I M U L R N D 1 6 S E L E C T
I M U L R N D 1 6
I M U L R N D 1 6I M U L R N D 1 6 N S E L E C T
I M U L R N D 1 6I M U L R N D 1 6
I M U L R N D 1 6I M U L R N D 1 6 P A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S S
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 N S E L E C T
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 S E L E C T P A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 N S E L E C TP A S S
I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 I A D D S 1 6 N S E L E C T
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6S H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6S H U F F L E P A S S
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SI A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SS H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L ES H U F F L E
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6
I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S S P A S S
P A S SS H U F F L E
S H U F F L ES H U F F L E S H U F F L E
I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S
I A D D S 1 6I A D D S 1 6I A D D S 1 6 P A S S P A S SP A S S
I A D D S 1 6 I A D D S 1 6I A D D S 1 6
I A D D S 1 6I A D D S 1 6 I A D D S 1 6
S H U F F L E S H U F F L ES H U F F L E D A T A _ I N
I A D D S 1 6I A D D S 1 6 S H U F F L E P A S SD A T A _ I N
I A D D S 1 6 I A D D S 1 6I A D D S 1 6 P A S SP A S S D A T A _ I N
S E L E C TI A D D S 1 6I A D D S 1 6 D A T A _ I N
I A D D S 1 6 S E L E C TI A D D S 1 6I A D D S 1 6 N S E L E C T D A T A _ O U T
I A D D S 1 6 S E L E C TI A D D S 1 6 I A D D S 1 6 D A T A _ I N D A T A _ O U T
I A D D S 1 6 N S E L E C T D A T A _ I N D A T A _ O U T
I A D D S 1 6 N S E L E C TN S E L E C T D A T A _ O U T
L O O PI A D D S 1 6 I A D D S 1 6 D A T A _ O U T
D A T A _ O U T D A T A _ O U T
ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0
IMULRND16 IMULRND16 PASSIADDS16 NSELECTPASS
IMULRND16 IMULRND16 PASSIADDS16 IADDS16 NSELECTSHIFTA16
IMULRND16 IMULRND16IADDS16 IADDS16
IMULRND16 IMULRND16IADDS16IADDS16SHUFFLE
IMULRND16 IMULRND16SHUFFLE PASS
IMULRND16 IMULRND16IADDS16 IADDS16 PASSIADDS16
IMULRND16 IMULRND16IADDS16 IADDS16
IMULRND16 IMULRND16IADDS16 IADDS16 PASSSHUFFLE
IMULRND16 IMULRND16IADDS16 SHUFFLE
IMULRND16 IMULRND16IADDS16 SHUFFLESHUFFLE
IMULRND16 IMULRND16IADDS16 IADDS16IADDS16 COMMUCPERM
IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM
IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM
IMULRND16 IMULRND16IADDS16 IADDS16 PASS PASS
PASSSHUFFLE SELECT
SHUFFLESHUFFLE SHUFFLE SELECT COMMUCPERM
IADDS16 IADDS16 IADDS16 PASSIMULRND16 SELECT
IADDS16IADDS16IADDS16 PASS PASSPASS IMULRND16
IADDS16 IADDS16IADDS16 IMULRND16IMULRND16 NSELECT
IADDS16IADDS16 IADDS16 IMULRND16IMULRND16
IMULRND16IMULRND16 PASS
SHUFFLE SHUFFLESHUFFLE DATA_INIMULRND16 IMULRND16 PASS
IADDS16IADDS16 SHUFFLE PASSDATA_INIMULRND16 IMULRND16 PASS GEN_CISTATE
IADDS16 IADDS16IADDS16 PASSPASS DATA_INIMULRND16 IMULRND16 COND_IN_D
SELECTIADDS16IADDS16 DATA_INIMULRND16 IMULRND16 GEN_CCEND
IADDS16 SELECTIADDS16IADDS16 NSELECTIMULRND16 IMULRND16 SPCREAD_WT SPCWRITEDATA_OUT
IADDS16 SELECTIADDS16 IADDS16 DATA_INIMULRND16 IMULRND16 COMMUCDATADATA_OUT
IADDS16 NSELECT DATA_INIMULRND16 IMULRND16IADDS16 CHK_ANYDATA_OUT
IADDS16 NSELECTNSELECTIMULRND16 IMULRND16IADDS16IADDS16 DATA_OUT
LOOPIADDS16 IADDS16 IMULRND16 IMULRND16IADDS16 NSELECTSELECT DATA_OUT
IMULRND16 IMULRND16IADDS16 IADDS16 SELECT PASSDATA_OUT DATA_OUT
7x7 Convolution Kernel
ALUs Comm/SPStreams
Pipeline Stage 0
Pipeline Stage 1
Pipeline Stage 2
Scott Rixner Imagine Architecture 11
StreamC
for (row=0; row<NROWS; row++) {// update quantization factor for rate controlquantizerScale = newQuantizerScale;
// setup streams for this row...
// Perform I-Frame encodingconvert(InputRow, &YRow, &CbCrRow);dct(YRow, dctIconstants, quantizerScale, &DCTYRow);dct(CbCrRow, dctIconstants, quantizerScale, &DCTCbCrRow);rle(DCTYRow, DCTCbCrRow, rleConstants, &RunLevelsRow);vlc(RunLevelsRow, &bitStream, &newQuantizerScale);
// Store generated bit stream...
// Generate reference image for subsequent P or B framesidct(DCTYRow, idctIconstants, quantizerScale, &RefYRow);idct(DCTCbCrRow, idctIconstants, quantizerScale, &RefCbCrRow);
// Store reference rows...
}
Scott Rixner Imagine Architecture 12
Macrocode
for (int row = 0; row < mb_height; row++)
{
for (int col = 0; col < mb_width; col += iNumBlocks)
{
rts.write_ucr(1, image_size_param);
rts.write_ucr(2, idxparams);
rts.vect_op(idxgen, 0, 1, iframe.colorIndices);
rts.vect_load(false, iframe.imageBuffer[even],
iframe.colorIndices,
memInputFrame, msg);
rts.vect_op(icolor, 1, 2, "icolor conversion",
iframe.imageBuffer[odd],
iframe.blkY1dct, iframe.blkCrCb1dct);
rts.write_ucr(1, quantizer_scale);
rts.vect_op(dct, 2, 1, "Y dct",
iframe.blkY1dct, dctIntraConsts,
iframe.blkY2rle);
rts.write_ucr(1, quantizer_scale);
rts.vect_op(dct, 2, 1, "CrCb dct",
iframe.blkCrCb1dct, dctIntraConsts,
iframe.blkCrCb2rle);
rts.write_ucr(1, 0);
rts.write_ucr(2, quant_scale);
rts.vect_op(rle, 4, 1, "RLE“ iframe.blkY2rle, iframe.blkCrCb2rle, rle_consts, zeroLength, UP(iframe.blkRunLevels[odd]));
rts.vect_store(false, iframe.blkRunLevels[odd], memOutputFrame, msg);
rts.write_ucr(1, iquantizer_scale); rts.vect_op(idct, 2, 1, "Y idct", iframe.blkY2rle, idctIntraConsts, iframe.blkY3); rts.write_ucr(1, iquantizer_scale); rts.vect_op(idct, 2, 1, "CrCb idct", iframe.blkCrCb2rle, idctIntraConsts, iframe.blkCrCb3); rts.write_ucr(1, 0); rts.vect_op(correlate, 4, 2, "correlate", iframe.blkY3, iframe.blkCrCb3, iframe.dummy_blkYMVref, iframe.dummy_blkCrCbMVref, iframe.blkYref[odd], iframe.blkCrCbref[odd]);
rts.vect_store(false, iframe.blkYref[odd], memNewRefY, msg);
rts.vect_store(false, iframe.blkCrCbref[odd], memNewRefCrCb, msg); } }
Scott Rixner Imagine Architecture 13
Stereo Depth Extractor
Clusters Mem_0 Mem_121300
21400
21500
21600
21700
21800
21900
22000
22100
22200
22300
22400
22500
22600
22700
22800
22900
23000
23100
23200
23300
23400
23500
23600
CONV 3x3
STOREUNPACK LOADCONV 7x7
CONV 3x3
STOREUNPACK LOADCONV 7x7
CONV 3x3
STOREUNPACK LOADCONV 7x7
Clust Mem0 Mem1501300
501400
501500
501600
501700
501800
501900
502000
502100
502200
502300
502400
502500
502600
502700
502800
502900
503000
503100
503200
503300
BlockSAD
BlockSADLoad Load
BlockSAD
StoreBlockSAD
BlockSAD
BlockSADLoad Load
BlockSAD
StoreBlockSAD
Load originalpacked row
Unpack (8bit 16 bit)
7x7 Convolve
3x3 Convolve
Store convolved row
Load ConvolvedRows
CalculateBlockSADs atdifferent disparities
Store bestdisparity values
Convolutions Disparity Search
Scott Rixner Imagine Architecture 14
Tools
idebug (functional simulator)– Built on top of visual studio (any C++ compiler)
iscd (kernel scheduler)– Generates optimized VLIW assembly from C-like code
isim (cycle-accurate simulator)– Simulates current Imagine architecture (configurable)
schedviz (schedule/application visualizer)– Interactive visualization of resource utilization
stream scheduler (run-time stream manager)
Scott Rixner Imagine Architecture 15
idebug
Macros and libraries Enable Imagine StreamC/KernelC to be directly
compiled by a C++ compiler Enables the use of any C++ debugger to debug
Imagine code Can add arbitrary C++ code into the
StreamC/KernelC for debugging– Function stubs
– printf’s, etc.
Scott Rixner Imagine Architecture 16
Imagine Debugging
StreamC
application.exe
C++ compiler
KernelC
Run-time
Compile-timeidebug
extensions
idebug.dll
Deb
ug
ger
Scott Rixner Imagine Architecture 18
iscd
Optimizing VLIW scheduler Compiles KernelC Currently supports
– copy propagation & dead code elimination
– software pipelining
– loop unrolling
– schedule randomization
– inline functions (no function calls) Configurable target architecture
Scott Rixner Imagine Architecture 19
isim
Similar application performance to RTL ~4M cycles per hour (>1000 cycles per second) Configurable
– Machine description file (same file as for iscd)
– # clusters, ALU mix/connection, memory system, etc. Interactive command prompt
– Debugging
– Performance monitoring/reporting
– Memory/file comparison
Scott Rixner Imagine Architecture 20
schedviz
Interactive schedule visualizer Visual Basic Shows resource utilization
– Operation scheduling
– Communication scheduling Enables source-level performance optimization
– Never look at assembly code! Also view application execution
– Cluster, memory, network utilization
Scott Rixner Imagine Architecture 21
Stream Scheduler (1)
Converts StreamC functions into Imagine operations
Allocates:• operation issue slots
• stream-level registers
• stream register file (SRF)
• memory
Determines dependencies between operations
Scott Rixner Imagine Architecture 22
Stream Scheduler (2)
SRF allocation is critical– requires usage information
– requires foreknowledge
– too costly to perform at run time Stream scheduler is profile based
– run once with simple allocation
– collect usage information
– perform good allocation
– run repeatedly with good allocation
Scott Rixner Imagine Architecture 23
Handling Large Streams
Strip mining Double bufferingInput O utpu t
K1
K2
K3
K1
K2
K3
Scott Rixner Imagine Architecture 24
Stream Algorithms: Blocksearch
Reference Image
Row from Current Image
Row 0
Row 1
Row 2
blocksearch Motion VectorsReference row 0Reference row 1Reference row 2
Current row
searchregion
Scott Rixner Imagine Architecture 25
MPEG2 Characteristics
Operations– 56% 8-bit ADD/SUB
Little locality– 1.47 accesses per word of global data
Computationally intense– 155 operations per global data reference
Scott Rixner Imagine Architecture 26
Performance & Power
Raw Performance– 360x288, 24-bit: 350 fps
– 720x486, 24-bit: 104 fps
Clusters provide high arithmetic bandwidth– 27.6 GOPS on blocksearch kernel
– 17.9 GOPS overall
SRF provides necessary data locality, bandwidth– Only temporary data in off-chip memory are reference frames
– 2.4 GB/s required, 32 GB/s available
Power Efficiency: 10.7 GOPS/W
Scott Rixner Imagine Architecture 27
Bandwidth Hierarchy
2GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAM
Str
eam
R
egis
ter
File
ALU Cluster
ALU Cluster
ALU Cluster
544GB/s
Scott Rixner Imagine Architecture 28
Stream Recirculation
ColorConvert
DCT
DCT
IDCT
IDCT
Run-LevelEncoding
VariableLengthCoding
Arithmetic ClustersStream Register FileMemory (or I/O)
InputImage
RGBPixels
LuminancePixels
TransformedLuminance
LuminanceReference
EncodedBitstream
RLE Stream
Bitstream
ReferenceChrominance
Image
ReferenceLuminance
Image
ChrominancePixels
TransformedChrominance
ChrominanceReference
Data Referenced: 835KB 4.8MB 38.6MB
Scott Rixner Imagine Architecture 29
MPEG Bandwidth
Unconstrained Constrained BandwidthBandwidth SRF Bandwidth Use
Memory (GB/s) 0.88 0.88 0.86
Global RF (GB/s) 5.23 5.20 5.08
Local RF (GB/s) 167.47 166.53 0.86
Performance (GOPS) 10.89 10.83 10.58
Scott Rixner Imagine Architecture 30
MPEG Execution
6,000
9,000
12,000
15,000
18,000
21,000
24,000
27,000
30,000
C L U S T E R SME MOR Y
S T R E A M 0ME MOR Y
S T R E A M 1
ColorConversion
LoadInput
Blocks
batch i-1 batch i batch i+1
Y DCT
CrCb DCT
Run-LevelEncoding
StoreRun-LevelEncodedBlocks
Y IDCT
LoadInput
Blocks
CrCb IDCT
ReferenceStore
Y ReferenceColor
ConversionStore
CrCb ReferenceY DCT
CrCb DCT
Run-LevelEncoding
Scott Rixner Imagine Architecture 31
Challenges
VLC (Huffman Coding)– Difficult and inefficient to implement on clusters (SIMD on 32-bit
data)
– Instead, send RLE data over network to FPGA
– Could add special-purpose Huffman coding stream unit
Rate Control– Difficult because multiple macroblocks encoded in parallel
– Must perform on a coarser granularity (impact on picture quality?)
– For smaller image sizes, can simply re-encode a group of macroblocks at a higher quantization level if necessary in real-time
Scott Rixner Imagine Architecture 32
Imagine Programming
Think in terms of streams Range of software tools
– Compilers
– Visualizers
– Simulators Achieve new levels of performance
– Less programming effort
– Greater power efficiency
Scott Rixner Imagine Architecture 33
If-Statement Example
if (case) { f(x); }else { g(x); }
if (case) { strA << x; }else { strB << x; }
PE0
PE1
PE2
PE3
0
1
0
1
Case values
Should PEsexecutef( ) or g( )?
PE0
PE1
PE2
PE3
SRF0
SRF1
SRF2
SRF3
Shared Control
0
1
0
1
Case values
Shared Control
Scott Rixner Imagine Architecture 34
Conditional Streams
– Data streams that are accessed conditionally based on a local case value
– Results in an arbitrary expansion or compression of stream in space and time inpu t_stream
P L H D
O K G C
M I E A
N J F B
SRF0
SRF1
SRF2
SRF3
PE0
PE1
PE2
PE3
0 1 2 3 4 5 6
A F K O
C G I L P
B D J M
E H N
itera tion
kernel (da ta )
PE0
PE1
PE2
PE3
0 1 2 3 4 5 6
1 0 1 0 0 1 1
0 1 1 0 1 1 1
1 1 0 0 1 1 0
0 1 1 0 0 1 0
itera tion
kernel (se l)