김용정 부장 senior applications engineer · 김용정 부장 senior applications engineer . 2...
TRANSCRIPT
1 © 2013 The MathWorks, Inc.
Accelerating System Simulations
김용정 부장 Senior Applications Engineer
2
Why simulation acceleration?
From algorithm exploration to system design – Size and complexity of models increases
– Time needed for a single simulation increases
– Number of test cases increases
– Test cases become larger
Need to reduce
– simulation time during design
– simulation time for large scale testing during prototyping
3
MATLAB is quite fast
Optimized and widely-used libraries
– BLAS Basic Linear Algebra Subroutines (multithreaded)
– LAPACK Linear Algebra Package
JIT (Just In Time) Acceleration
– On-the-fly multithreaded code generation for increased speed
Built-in support for vector and matrix operations
4
Application
LTE Physical Downlink Control Channel (PDCCH)
5
Workflow
Start with a baseline algorithm
Profile it to introduce a performance yardstick
Introduce the following optimizations:
– Better MATLAB serial programming techniques
– Using System objects
– MATLAB to C code generation (MEX)
– Parallel Computing
– GPU-optimized System objects
– Rapid Accelerator mode of simulation in Simulink
6
Simulation acceleration options in MATLAB
MATLAB to C
User’s Code
GPU
processing
Parallel
Computing
Better MATLAB
code
System objects
7
Profiling MATLAB algorithms
Profiler summarizes
MATLAB code execution
– total time spent within each
function
– which lines of code use the
most processing time
Helps identify algorithm
bottlenecks
8
Effective MATLAB programming techniques
Pre-allocation
– Initialize an array using its final size
– Helps avoid dynamically resizing arrays in a loop
Vectorization
– Convert code from using scalar loops to using matrix/vector
operations
– Helps MATLAB leverage processor-optimized libraries for
vector processing
Example of pre-allocation
y=[]; for n=1:LEN/Tx G=[u(idx1(n)) u(idx2(n));... -conj(u(idx2(n))) conj(u(idx1(n)))]; y=[y;G]; end
y=complex(zeros(LEN,Tx)); y(idx1,1)=u(idx1); y(idx1,2)=u(idx2); y(idx2,1)=-conj(u(idx2)); y(idx2,2)=conj(u(idx1));
9
Using System objects of
DSP & Communications System Toolboxes
System objects facilitate stream processing
Can accelerate simulation because
– Decouple declaration from the execution of the algorithms
– Reduce overhead of parameter handling in the loop
– Most of them implemented as MATLAB executables (MEX)
Example of System objects
function s = Alamouti_DecoderS(u,H) %#codegen % STBC Combiner persistent hTDDec if isempty(hTDDec) hTDDec= comm.OSTBCCombiner(... 'NumTransmitAntennas',2,'NumReceiveAntennas',2); end s = step(hTDDec, u, H);
10
MATLAB to C code generation
MATLAB Coder
Automatically generate
a MEX function
Call the generated MEX
file within testbench
Verify same numerical
results
Assess the baseline
function and the
generated MEX function
for speed
11
Task 1 Task 2 Task 3 Task 4 Task 1 Task 2 Task 3 Task 4
Parallel Simulation Runs
Time Time
TOOLBOXES
BLOCKSETS
Worker
Worker
Worker
Worker
>> Demo
12
Summary
matlabpool available workers
No modification of algorithm
Use parfor loop instead of for loop
Parallel computation or simulation
leads to further acceleration
More cores = more speed
13
Simulation acceleration options in MATLAB
MATLAB to C
User’s Code
GPU
processing
Parallel
Computing
Better MATLAB
code
System objects
14
What is a Graphics Processing Unit (GPU)
Originally for graphics acceleration, now also used for
scientific calculations
Massively parallel array of integer and
floating point processors
– Typically hundreds of processors per card
– GPU cores complement CPU cores
Dedicated high-speed memory
15
Why would you want to use a GPU?
Speed up execution of computationally intensive
simulations
For example:
– Performance: A\b with Double Precision
16
Options for Targeting GPUs
1) Use GPU with MATLAB built-in functions
2) Execute MATLAB functions elementwise
on the GPU
3) Create kernels from existing CUDA code
and PTX files
Ea
se
of
Us
e
Gre
ate
r Co
ntro
l
17
Data Transfer between MATLAB and GPU
% Push data from CPU to GPU memory
Agpu = gpuArray(A)
% Bring results from GPU memory back to CPU
B = gather(Bgpu)
18
GPU Processing with
Communications System Toolbox
Alternative implementation
for many System objects
take advantage of GPU
processing
Use Parallel Computing
Toolbox to execute many
communications algorithms
directly on the GPU
Easy-to-use syntax
Dramatically accelerate
simulations
GPU System objects
comm.gpu.TurboDecoder
comm.gpu.ViterbiDecoder
comm.gpu.LDPCDecoder
comm.gpu.PSKDemodulator
comm.gpu.AWGNChannel
19
Impressive coding gain
High computational complexity
Bit-error rate performance as a function of number of
iterations
Example: Turbo Coding
= comm.TurboDecoder(…
‘NumIterations’, numIter,…
20
Acceleration with GPU System objects
Version Elapsed time Acceleration
CPU 8 hours 1.0
1 GPU 40 minutes 12.0
Cluster of 4
GPUs
11 minutes 43.0
Same numerical results
= comm.TurboDecoder(…
‘NumIterations’, N,… = comm.gpu.TurboDecoder(…
‘NumIterations’, N,…
= comm.AWGNChannel(… = comm.gpu.AWGNChannel(…
21
Key Operations in Turbo Coding Function
CPU GPU Version 1
% Turbo Encoder
hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..
'InterleaverIndices', intrlvrIndices)
% AWG Noise
hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');
% BER measurement
hBER = comm.ErrorRate;
% Turbo Decoder
hTDec = comm.TurboDecoder(…
'TrellisStructure',poly2trellis(4, [13 15], 13),...
'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);
ber = zeros(3,1); %initialize BER output
%% Processing loop
while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)
data = randn(blkLength, 1)>0.5;
% Encode random data bits
yEnc = step(hTEnc, data);
%Modulate, Add noise to real bipolar data
modout = 1-2*yEnc;
rData = step(hAWGN, modout);
% Convert to log-likelihood ratios for decoding
llrData = (-2/noiseVar).*rData;
% Turbo Decode
decData = step(hTDec, llrData);
% Calculate errors
ber = step(hBER, data, decData);
end
% Turbo Encoder
hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..
'InterleaverIndices', intrlvrIndices)
% AWG Noise
hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');
% BER measurement
hBER = comm.ErrorRate;
% Turbo Decoder
hTDec = comm.gpu.TurboDecoder(…
'TrellisStructure',poly2trellis(4, [13 15], 13),...
'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);
ber = zeros(3,1); %initialize BER output
%% Processing loop
while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)
data = randn(blkLength, 1)>0.5;
% Encode random data bits
yEnc = step(hTEnc, data);
%Modulate, Add noise to real bipolar data
modout = 1-2*yEnc;
rData = step(hAWGN, modout);
% Convert to log-likelihood ratios for decoding
llrData = (-2/noiseVar).*rData;
% Turbo Decode
decData = step(hTDec, llrData);
% Calculate errors
ber = step(hBER, data, decData);
end
22
Profile results in Turbo Coding Function
CPU GPU Version 1
% Turbo Encoder
<0.01 hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..
'InterleaverIndices', intrlvrIndices)
% AWG Noise
<0.01 hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');
% BER measurement
<0.01 hBER = comm.ErrorRate;
% Turbo Decoder
<0.01 hTDec = comm.TurboDecoder(…
'TrellisStructure',poly2trellis(4, [13 15], 13),...
'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);
<0.01 ber = zeros(3,1); %initialize BER output
%% Processing loop
while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)
0.30 data = randn(blkLength, 1)>0.5;
% Encode random data bits
2.33 yEnc = step(hTEnc, data);
%Modulate, Add noise to real bipolar data
0.05 modout = 1-2*yEnc;
1.50 rData = step(hAWGN, modout);
% Convert to log-likelihood ratios for decoding
0.03 llrData = (-2/noiseVar).*rData;
% Turbo Decode
330.54 decData = step(hTDec, llrData);
% Calculate errors
0.17 ber = step(hBER, data, decData);
end
% Turbo Encoder
<0.01 hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..
'InterleaverIndices', intrlvrIndices)
% AWG Noise
<0.01 hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');
% BER measurement
<0.01 hBER = comm.ErrorRate;
% Turbo Decoder
0.02 hTDec = comm.gpu.TurboDecoder(…
'TrellisStructure',poly2trellis(4, [13 15], 13),...
'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);
<0.01 ber = zeros(3,1); %initialize BER output
%% Processing loop
while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)
0.28 data = randn(blkLength, 1)>0.5;
% Encode random data bits
2.38 yEnc = step(hTEnc, data);
%Modulate, Add noise to real bipolar data
0.05 modout = 1-2*yEnc;
1.45 rData = step(hAWGN, modout);
% Convert to log-likelihood ratios for decoding
0.04 llrData = (-2/noiseVar).*rData;
% Turbo Decode
98.18 decData = step(hTDec, llrData);
% Calculate errors
0.17 ber = step(hBER, data, decData);
end
23
Key Operations in Turbo Coding Function
CPU GPU Version 2
% Turbo Encoder
hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..
'InterleaverIndices', intrlvrIndices)
% AWG Noise
hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');
% BER measurement
hBER = comm.ErrorRate;
% Turbo Decoder
hTDec = comm.TurboDecoder('TrellisStructure',poly2trellis(4, [13 15], 13),...
'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);
%% Processing loop
while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)
data = randn(blkLength, 1)>0.5;
% Encode random data bits
yEnc = step(hTEnc, data);
%Modulate, Add noise to real bipolar data
modout = 1-2*yEnc;
rData = step(hAWGN, modout);
% Convert to log-likelihood ratios for decoding
llrData = (-2/noiseVar).*rData;
% Turbo Decode
decData = step(hTDec, llrData);
% Calculate errors
ber = step(hBER, data, decData);
end
% Turbo Encoder
hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..
'InterleaverIndices', intrlvrIndices)
% AWG Noise
hAWGN = comm.gpu.AWGNChannel ('NoiseMethod', 'Variance');
% BER measurement
hBER = comm.ErrorRate;
% Turbo Decoder - setup for Multi-frame or Multi-user processing
numFrames = 30;
hTDec = comm.gpu.TurboDecoder('TrellisStructure',poly2trellis(4, [13 15], 13),...
'InterleaverIndices', intrlvrIndices,'NumIterations',numIter,…
’NumFrames’,numFrames);
%% Processing loop
while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)
data = randn(numFrames*blkLength, 1)>0.5;
% Encode random data bits
yEnc = gpuArray(multiframeStep(hTEnc, data, numFrames));
%Modulate, Add noise to real bipolar data
modout = 1-2*yEnc;
rData = step(hAWGN, modout);
% Convert to log-likelihood ratios for decoding
llrData = (-2/noiseVar).*rData;
% Turbo Decode
decData = step(hTDec, llrData);
% Calculate errors
ber=step(hBER, data, gather(decData));
end
24
Profile results in Turbo Coding Function
CPU GPU Version 2
% Turbo Encoder
<0.01 hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..
'InterleaverIndices', intrlvrIndices)
% AWG Noise
<0.01 hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');
% BER measurement
<0.01 hBER = comm.ErrorRate;
% Turbo Decoder
<0.01 hTDec = comm.TurboDecoder(…
'TrellisStructure',poly2trellis(4, [13 15], 13),...
'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);
%% Processing loop
while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)
0.30 data = randn(blkLength, 1)>0.5;
% Encode random data bits
2.33 yEnc = step(hTEnc, data);
%Modulate, Add noise to real bipolar data
0.05 modout = 1-2*yEnc;
1.50 rData = step(hAWGN, modout);
% Convert to log-likelihood ratios for decoding
0.03 llrData = (-2/noiseVar).*rData;
% Turbo Decode
330.54 decData = step(hTDec, llrData);
% Calculate errors
0.17 ber = step(hBER, data, decData);
end
% Turbo Encoder
<0.01 hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..
'InterleaverIndices', intrlvrIndices)
% AWG Noise
0.03 hAWGN = comm.gpu.AWGNChannel ('NoiseMethod', 'Variance');
% BER measurement
<0.01 hBER = comm.ErrorRate;
% Turbo Decoder - setup for Multi-frame or Multi-user processing
0.01 numFrames = 30;
0.01 hTDec = comm.gpu.TurboDecoder('TrellisStructure',…
poly2trellis(4, [13 15], 13),'InterleaverIndices', intrlvrIndices,
'NumIterations',numIter, ’NumFrames’,numFrames);
%% Processing loop
while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)
0.22 data = randn(numFrames*blkLength, 1)>0.5;
% Encode random data bits
2.45 yEnc = gpuArray(multiframeStep(hTEnc, data, numFrames));
%Modulate, Add noise to real bipolar data
0.02 modout = 1-2*yEnc;
0.31 rData = step(hAWGN, modout);
% Convert to log-likelihood ratios for decoding
0.01 llrData = (-2/noiseVar).*rData;
% Turbo Decode
20.89 decData = step(hTDec, llrData);
% Calculate errors
0.09 ber=step(hBER, data, gather(decData));
end
25
Things to note when targeting GPU
Minimize data transfer between CPU and GPU.
Using GPU only makes sense if data size is large.
Some functions in MATLAB are optimized and can be
faster than the GPU equivalent (eg. FFT).
Use arrayfun to explicitly specify elementwise
operations.
26
Summary
Acceleration methodologies in MATLAB & Simulink Technology / Product
1. Best Practices in Programming • Vectorization & pre-allocation • Environment tools. (i.e. Profiler, Code Analyzer)
MATLAB, Toolboxes, System Toolboxes
2. Better Algorithms • Ideal environment for algorithm exploration • Rich set of functionality (e.g. System objects)
MATLAB, Toolboxes, System Toolboxes
3. More Processors or Cores
• High level parallel constructs (e.g. parfor, matlabpool)
• Utilize cluster, clouds, and grids
Parallel Computing Toolbox, MATLAB Distributed Computing Server
4. Refactoring the Implementation • Compiled code (MEX) • GPUs, FPGA-in-the-Loop
MATLAB, MATLAB Coder, Parallel Computing Toolbox
27
Q & A
Thank You