matlab on biohpc...matlab distributed computing server: setup and test your matlab environment you...

38
MATLAB on BioHPC 1 [web] portal.biohpc.swmed.edu [email] [email protected] Updated for 2016-09-21

Upload: others

Post on 31-Jan-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

  • MATLAB on BioHPC

    1

    [web] portal.biohpc.swmed.edu

    [email] [email protected]

    Updated for 2016-09-21

  • What is MATLAB

    2

    • High level language and development environment for:- Algorithm and application development- Data analysis- Mathematical modeling

    • Extensive math, engineering, and plotting functionality

    • Add-on products for image and video processing, communications, signal processing and more.

    powerful & easy to learn !

  • Ways of MATLAB Parallel Computing

    3

    • Build-in Multithreading (Implicit Parallelism)*automatically handled by Matlab

    - use of vector operations in place of for-loops+, -, * , /, SORT, CEIL, LOG, FFT, IFFT, SVD, …

    Example: dot product

    v.s. for i = 1 : size(A, 1)for j = 1: size(A, 2)

    C(i, j) = A(i, j) * B(i, j);end

    end

    C = A .* B;

    If both A and B are 1000-by-1000 Matrix, the Matlab build-in dot product is 50~100 times faster than for-loop.

    * Testing environment : Matlab/R2015a @ BioHPC computer node

  • Ways of MATLAB Parallel Computing

    4

    • Parallel Computing Toolbox *Acceleration MATLAB code with very little code changes

    job will running on your workstation, thin-client or a reserved compute node

    - Parallel for Loop (parfor)- Spmd (Single Program Multiple Data)- Pmode (Interactive environment for spmd development)- Parallel Computing with GPU

    • Matlab Distributed Computing Server (MDCS)*Directly submit matlab job to BioHPC cluster

    - Matlab job scheduler integrated with SLURM- Single job use multiple nodes

  • Parallel Computing Toolbox –Parallel for-Loops (parfor)

    5

    • Allow several MATLAB workers to execute individual loop iterations simultaneously.

    • The only difference in parfor loop is the keyword parfor instead of for. When the loop begins, it opens a parallel pool of MATLAB sessions called workers for executing the iterations in parallel.

    for …

    end

    parfor …

    end

  • a b

    6

    Parallel Computing Toolbox –Parallel for-Loops (parfor)

    Example 1: Estimating an Integration

    𝐹 𝑥 = 𝑎

    𝑏

    0.2𝑠𝑖𝑛2𝑥 𝑑𝑥

    The integration is estimated by calculating the total area under the F(x) curve.

    𝐹 𝑥 ≈

    1

    𝑛

    ℎ𝑒𝑖𝑔ℎ𝑡 ∗ 𝑤𝑖𝑑𝑡ℎ

    We can calculate subareas at any order !

  • 7

    quad_fun.m (serial version)

    function q = quad_fun (n, a, b)q = 0.0;w=(b-a)/n;for i = 1 : n

    x = (n-i) * a + (i-1) *b) /(n-1);fx = 0.2*sin(2*x);q = q + w*fx;

    endreturn

    end

    function q = quad_fun_parfor (n, a, b)q = 0.0;w=(b-a)/n;parfor i = 1 : n

    x = (n-i) * a + (i-1) *b) /(n-1);fx = 0.2*sin(2*x);q = q + w*fx;

    endreturn

    end

    quad_fun_parfor.m (parfor version)

    12 workersSpeedup = t1/t2 = 6.14x

    Parallel Computing Toolbox –Parallel for-Loops (parfor)

    Example 1: Estimating an Integration

  • 8

    Parallel Computing Toolbox –Parallel for-Loops (parfor)

    Q: How many workers can I use for MATLAB parallel pool ?

    Default Maximum

    128GB, 384GB, GPU 12 16*

    256GB 12 24*

    Workstation 6 6

    Thin client 2 2

    ** Use parpool(‘local’, numOfWorkers) to specify how many workers you want.(or use matlabpool(‘local’, numOfWorkers) if your are using matlab_r2013a)

    2013a/2013b• max number of workers = 12

    2014a/2014b/2015a/2015b• max number of workers = number of physical cores

  • 9

    Parallel Computing Toolbox –Parallel for-Loops (parfor)

    Q: How many workers can I use for MATLAB parallel pool ?

    2016a• max number of workers = 512• default number of workers = number of

    physical cores• choose smartly

    % load local profile configurationmyCluster = parcluster('local');% increase number of workersmyCluster.NumWorkers=32;

    % open matlab poolparpool(myCluster);

    % parfor loopparfor …

    end

    % close matlab pooldelete(gcp('nocreate'));

    Number of workers

    128GB, 384GB, GPU

  • Limitations of parfor

    10

    No Nested parfor loops

    parfor i = 1 : 10parfor j = 1 : 10

    end

    end

    parfor i = 1 : 10for j = 1 : 10

    end

    end

    * Because a worker cannot open a parallel pool

    Loop iterations must be independent

    parfor i = 2 : 10A(I) = 0.5* ( A(I-1) + A(I+1) );

    end

    Loop variable must be increasing integers

    parfor x = 0 : 0.1 : 1

    end

    xAll = 0 : 0.1 :1;parfor ii = 1 : length(xAll)

    x = xAll(ii);

    end

  • Parallel Computing Toolbox – spmd and Distributed Array

    11

    spmd : Single Program Multiple Data

    • single program -- the identical code runs on multiple workers.• multiple data -- each worker can have different, unique data for that code.

    • numlabs – number of workers.• labindex -- each worker has a unique identifier between 1 and numlabs.• point-to-point communications between workers : labsend, labreceive, labsendrecieve

    • Ideal for: 1) programs that takes a long time to execute.2) programs operating on large data set.

    spmdlabdata = load([‘datafile_’ num2str(labindex) ‘.ascii’]) % multiple dataresult = MyFunction(labdata); % single program

    end

    spmd

    end

  • 12

    function q = quad_fun (n, a, b)q = 0.0;w=(b-a)/n;for i = 1 : n

    fx = 0.2*sin(2*i);q = q + w*fx;

    endreturn

    end

    a = 0.13;b = 1.53;n = 120000000;spmd

    aa = (labindex - 1) / numlabs * (b-a) + a;bb = labindex / numlabs * (b-a) + a;nn = round(n / numlabs)q_part = quad_fun(nn, aa, bb);

    endq = sum([q_part{:}]);

    quad_fun.m

    Parallel Computing Toolbox – spmd and Distributed Array

    quad_fun_spmd.m (spmd version)

    12 workersSpeedup = t1/t3 = 8.26x

    Example 1: Estimating an Integration

    12 communications between client and workers to update q, whereas in parfor, it needs 120,000,000 communications between client and workers

  • 13

    Parallel Computing Toolbox – spmd and Distributed Array

    distributed array - Data distributed from client and access readily on client. Data always distributed along the last dimension, and as evenly as possible along that dimension among the workers.

    codistributed array – Data distributed within spmd, user define at which dimension to distribute data . Array created directly (locally) on worker.

    parpool(‘local’);A = rand(3000);B = rand(3000);spmd

    u = codistributed(A, codistributor1d(1)); % by row, 1st dimv = codistributed(B, codistributor1d(2)); % by column, 2nd dimw = u * v;

    enddelete(gcp);

    Matrix Multiply A*B

    parpool(‘local’);A = rand(3000);B = rand(3000);a = distributed(A);b = distributed(B);c = a*b; % run on workers, c is distributeddelete(gcp);

    distributed array codistributed array

  • 14

    Parallel Computing Toolbox – pmode: learn parallel programing interactively

    Enter command here

  • 15

    Example 2 : Ax = b

    Parallel Computing Toolbox – spmd and Distributed Array

    n = 10000;M = rand(n);X = ones(n, 1);A = M + M’;b = A * X;u = A \ b;

    linearSolver.m (serial version) n = 10000;M = rand(n);X = ones(n, 1);spmd

    m = codistributed(M, codistributor(‘1d’, 2));x = codistributed(X, codistributor(‘1d’, 1));A = m +m’;b = A * x;utmp = A \ b;

    endu1 = gather(utmp);

    linearSolverSpmd.m (spmd version)

  • 16

    Parallel Computing Toolbox – spmd and Distributed Array

    Factors reduce the speedup of parfor and spmd

    • Computation inside the loop is simple.• Memory limitations. (create more data compare to serial code)• Transfer and convert data is time consuming• Unbalanced computational load• Synchronization

  • Parallel Computing Toolbox – spmd and Distributed Array

    17

    Example 3 : Contrast Enhancement

    function y = contrast_enhance (x)x = double (x);n = size (x, 1);x_average = sum ( sum (x(:,:))) /n /n ;s = 3.0; % the contrast s should be greater than 1y = (1.0 – s ) * x_average + s * x((n+1)/2, (n+1)/2);return

    end

    Running time is 14.46 s

    contrast_enhance.m

    x = imread(‘surfsup.tif’);yl = nlfilter (x, [3,3], @contrast_enhance);y = uint8(y);

    Image size : 480 * 640 uint8

    contrast_serial.m

    (example provided by John Burkardt @ FSU)

  • 18

    Parallel Computing Toolbox – spmd and Distributed Array

    Example 3 : Contrast Enhancement

    x = imread(‘surfsup.tif’);xd = distributed(x);spmd

    xl = getlocalPart(xd);xl = nlfilter(xl, [3,3], @contrast_enhance);

    endy = [xl{:}];

    contrast_spmd.m (spmd version)

    12 workers, running time is 2.01 sspeedup = 7.19x

    When the image is divided by columns among the workers, artificial internal boundaries are created !

    Solution: build up communication between workers.

    General sliding-neighborhood operation

    zero padding at edges

  • Parallel Computing Toolbox – spmd and Distributed Array

    Example 3 : Contrast Enhancement

    column = labSendReceive ( previous, next, xl(:,1) );

    column = labSendReceive ( next, previous, xl(:,end) );

    previous next

    ghost boundary

    For 3*3 filtering window, you need 1 column of ghost boundary on each side.

  • column = labSendReceive ( previous, next, xl(:,1) );if ( labindex < numlabs )

    xl = [ xl, column ]; end

    column = labSendReceive ( next, previous, xl(:,end) );if ( 1 < labindex )

    xl = [ column, xl ]; end

    xl = nlfilter ( xl, [3,3], @contrast_enhance );

    if ( 1 < labindex ) xl = xl(:,2:end);

    end

    if ( labindex < numlabs ) xl = xl(:,1:end-1);

    end

    xl = uint8 ( xl ); endy = [ xl{:} ];

    2020

    x = imread('surfsup.tif');xd = distributed(x);

    spmdxl = getLocalPart ( xd );

    if ( labindex ~= 1 ) previous = labindex - 1;

    elseprevious = numlabs;

    end

    if ( labindex ~= numlabs ) next = labindex + 1;

    elsenext = 1;

    end

    12 workersrunning time 2.00 sspeedup = 7.23x

    contrast_MPI.m (mpi version)

    Parallel Computing Toolbox – spmd and Distributed Array

    Example 3 : Contrast Enhancement

    find out previous & next by labindex

    attache ghost boundaries

    remove ghost boundaries

  • 21

    Parallel Computing Toolbox – GPU

    Parallel Computing Toolbox enables you to program MATLAB to use your computer’s graphics processing unit (GPU) for matrix operations. For some problems, execution in the GPU is faster than in the CPU.

    Step 1 : reserve a GPU node by remoteGPUStep 2 : inside terminal, type in

    export CUDA_VISIBLE_DEVICES=“0”to enable the hardware rendering.

    How to enable GPU computing of Matlab on BioHPC cluster

    • You can launch Matlab directly if you have GPU card on your workstation.• Current supporting versions on BioHPC: 2013a/2013b/2014a/2014b/2015a/2015b

  • Parallel Computing Toolbox – GPU

    22

    Example 2 : Ax = b

    linearSolverGPU.m (GPU version)

    Speedup = t1/t3 = 2.54x

    n = 10000;M = rand(n);X = ones(n, 1);Mgpu = gpuArray(M);Xgpu = gpuArray(X);Agpu = Mgpu + Mgpu’;Bgpu = Agpu * Xgpu;ugpu = Agpu \ bgpuu = gather(ugpu);

    Copy data from CPU to GPU ( ~ 0.3 s)

    Copy data from GPU to CPU ( ~0.0002 s)

  • 23

    Parallel Computing Toolbox – GPU

    Check GPU information (inside Matlab)

    Q: If the GPU available.if (gpuDeviceCount > 0 )

    Q: How to check memory usage?currentGPU = gpuDevice;

    currentGPU.AvailableMemory;

    * totalMemory refers to the GPU accelerator memory, e.g: K40m ~12 GB; K20m ~5 GB

  • Parallel Computing Toolbox – benchmark

    24

    Out of memory

    * the actually memory consumption is much higher than the input data size

  • Matlab Distributed Computing Server

    25

  • Matlab Distributed Computing Server: Setup and Test your MATLAB environment

    26

    Home -> ENVIRONMENT -> Parallel -> Manage Cluster Profiles

    Add -> Import Cluster Profiles from fileAnd then select profile file from:

    /project/apps/apps/MATLAB/profile/nucleus_r2016a.settings/project/apps/apps/MATLAB/profile/nucleus_r2015b.settings/project/apps/apps/MATLAB/profile/nucleus_r2015a.settings/project/apps/apps/MATLAB/profile/nucleus_r2014b.settings/project/apps/apps/MATLAB/profile/nucleus_r2014a.settings/project/apps/apps/MATLAB/profile/nucleus_r2013b.settings/project/apps/apps/MATLAB/profile/nucleus_r2013a.settings

    *based on your matlab version

  • 27

    Matlab Distributed Computing Server: Setup and Test your MATLAB environment

    You should have two Cluster Profiles ready for Matlab parallel computing:“local “ – for running job on your workstation, thin client or on any single compute node of

    BioHPC cluster (use Parallel Computing toolbox)“nucleus_r” – for running job on BioHPC cluster with multiple nodes (use

    MDCS)

    * Ignore the “MATLAB Parallel Cloud” profile if you are using 2016a

    Test your parallel profile

  • Matlab Distributed Computing Server: Setup SLURM environment

    28

    From Matlab command window:ClusterInfo.setQueueName(‘128GB’); % use 128 GB partitionClusterInfo.setNNode(2); % request 2 nodesClusterInfo.setWallTime(‘1:00:00’); % setup time limit to 1 hourClusterInfo.setEmailAddress(‘[email protected]’); % email notification

    You could check your settings by:ClusterInfo.getQueueName(); ClusterInfo.getNNode(); ClusterInfo.getWallTime(); ClusterInfo.getEmailAddress();

  • 29

    Matlab Distributed Computing Server: batch processing

    For scripts:job = batch(‘ascript’, ‘Pool’, 15, ‘profile’, ‘nucleus_r2016a’);

    For functions:job = batch(@sfun, 1, {50, 4000}, ‘Pool’, 15, ‘profile’, ‘nucleus_r2016a’);

    script name

    number of workers - 1 configuration file

    function name

    number of variables to be returned by function

    cell array containing input arguments

    Job submission

  • 30

    Matlab Distributed Computing Server: batch processing

    Wait for batch job to be finished

    • Block session until job has finished : wait(job, ‘finished’);

    • Occasionally checking on state of job : get(job, ‘state’);

    • Load result after job finished : results = fetchOutputs(job) -or- load(job);

    • Delete job : delete(job);

    You can also open job monitor from Home->Parallel->Monitor Jobs

  • 31

    Matlab Distributed Computing Server

    Example 1: Estimating an Integration

    ClusterInfo.setQueueName(‘super’); ClusterInfo.setNNode(2); ClusterInfo.setWallTime(‘00:30:00’);

    job = batch(@quad_fun, 1, {120000000, 0.13, 1.53}, ‘Pool’, 63, ‘profile’, ‘nucleus_r2016a’);wait(job, ‘finished’);Results = fetchOutputs(job);

    quad_submission.m

  • 32

    Matlab Distributed Computing Server

    * Recall the serial job only needs 5.7881 s, MDCS is not designed for small jobs

    If we submit spmd version code with MDCS

    Running time is 15 sjob = batch(‘quad_fun_spmd’, ‘Pool’, 63, ‘profile’, ‘nucleus_r2016a’);

    Running time is 24 s

  • 33

    Matlab Distributed Computing Server

    Example 4: Vessel Extraction

    microscopy image (size:2126*5517)

    binary image243 subsets of binary image

    local normalizationremove pepper & saltthreshold

    separate vessel groups based on connectivity

    find the central line of vessel (fast matching method)

    ~ 6.5 hours

    ~ 21.3 seconds

    ~ 6.6 seconds

  • 34

    Matlab Distributed Computing Server

    Example 4: Vessel Extraction (step 3)

    % get file listsubImages = dir(['BWimages', '/*.jpg']);

    parfor i = 1 : length(subImages)filename = subImages(i).name;fastMatching(filename);

    end

    time needed for processing each sub image various Job running on 4 nodes requires minimum computational timet = 784 s

    t = 784 s

    Speedup = 28.3x

    vesselExtraction_submission.m

  • License limitation of Matlab Distributed Computing Server

    35

    Maximum concurrent licenses = 128 workerShared with all BioHPC user

    Solution :• Single-node jobs – use local profile, increase number of workers (2016a)• Multiple-node jobs – SLURM + GUN parallel

    - ex_04 : example_parallel_sbatch.sh

    Job pending on MDCS licenses

    4 nodes with a pool size of 128 threads, t = 699 s

  • SLURM + GUN parallel

    36

    * Image retrieved from https://www.biostars.org/p/63816/

    parallelize is to run 8 jobs on each CPU

    GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time

    A shell tool for executing jobs in parallel using one or more computers.

    http://www.gnu.org/software/parallel/

    Usage: sbatch GUN parallel srun base scripts

  • Submit multiple jobs to multi-node with srun & GUN parallel

    37

    #!/bin/bash

    #SBATCH --job-name=parallelExample#SBATCH --partition=super#SBATCH --nodes=4#SBATCH –ntasks=128#SBATCH --time=1-00:00:00

    CORES_PER_TASK=1INPUTS_COMMAND=“ls BWimages”TASK_SCRIPT=“single.sh”

    module add parallel

    SRUN=“srun --exclusive –n –N1 –c $CORES_PER_TASK”PARALLEL=“parallel --delay .2 –j $SLURM_NTASKS –joblog task.log”

    eval $INPUTS_COMMAND | $PARALLEL $SRUN sh $TASK_SCRIPT {}

    #!/bin/bash

    module add matlab/2016a

    matlab –nodisplay –nodesktop -singleCompThread –r “fastMatching(‘$1’), exit”

    single.sh

    • INPUTS_COMMAND: generate a input file list• TASK_SCRIPT: base script

  • Questions and Comments

    38

    For assistance with MATLAB, please contact the BioHPC Group:

    Email: [email protected]

    Submit help ticket at http://portal.biohpc.swmed.edu

    When did it happen?What time? Cluster or client? What job id? How you run it?

    Which Matlab Version?If the code comparable with different versions?

    Can we look at your scripts and data?Tell us if you are happy for us to access your scripts/data to help troubleshoot.

    mailto:[email protected]://portal.biohpc.swmed.edu/