enhancing the experimental matlab on the teragrid resource

1
Enhancing the Experimental "MATLAB on the TeraGrid" Resource Project Descrip-on The "MATLAB on the TeraGrid" experimental resource has proven to be an important and unique parallel resource on the TeraGrid for computa=onal science and data analysis. It aAracted many users new to TeraGrid and encouraged them to scale up their research problems. The resource provided seamless parallel MATLAB computa=onal services to remote Linux, Mac, or Windows desktops (hAp://www.cac.cornell.edu/matlab) and Science Gateway users (hAps://hubzero.org/ resources/495) with complex analy=c and fast simula=on requirements. In a new research collabora=on with NVIDIA, Dell, and MathWorks, Cornell is tes=ng the performance of generalpurpose GPUs with MATLAB applica=ons. MATLAB GPU compu=ng capabili=es include data manipula=on on NVIDIA GPUs and the use of mul=ple GPUs on the desktop via the Parallel Compu=ng Toolbox and a computer cluster via MATLAB Distributed Compu=ng Server. Tes=ng is occurring on Dell C6100 servers with the C410x PCIe expansion chassis which supports server connec=ons to NVIDIA Tesla M2070 GPUs. In this poster, we share system configura=on informa=on, =ps for tes=ng and adap=ng codes for use with GPUs, and GPU test results for six case studies. GPU Test Results Six case studies were ini=ated to examine the process of adap=ng exis=ng MATLAB codes to u=lize the new GPU capabili=es available in MATLAB 2011a. The results yielded 2=mes to 14.8=mes speedup of the original MATLAB code on four of the six case studies. The MATLAB codes selected for analysis included audio signal processing, medical image processing, Monte Carlo method, and finite element method. Each case study involved a process of profiling the code to iden=fy poten=al candidates for GPU op=miza=on and u=lizing one or more of the 3 methods offered by MATLAB 2011a to u=lize GPU hardware. The amount of effort to u=lize the GPU varied from onehour to achieve a 4=mes speedup to twoweeks to vectorize an exis=ng code and develop custom CUDA kernels resul=ng in a 13=mes speedup. The results of the case studies demonstrate that MATLAB 2011a provides an excellent framework for helping researchers leverage GPU hardware with a rela=vely modest amount of effort. Future work will focus on exploring the benefits of u=lizing mul=ple GPUs simultaneously and developing a set of best prac=ces for assis=ng researchers in making the best use of the new GPU capability of MATLAB. Cornell System Configura-on The Cornell system configura=on is comprised of mul=ple servers: a Web Server, a Windows HPC Server 2008 head node and compute nodes, a SQL Server, MyProxy and a Grid FTP Server. These are all connected to the DataDirect Networks storage with 8TB dedicated to this project. A Dell PowerEdge C410x hosts the GPUs and they are connected to Dell PowerEdge C6100’s. The GPUs are NVIDIA Tesla M2070s. Authen=ca=on and access is through x509 cer=ficates. Users can seamlessly switch from using their desktop for MATLAB mul=core processes to the cluster using either mul=core or mul=node processing. Currently the sofware stack includes Windows HPC Server 2008 x64, MATLAB R2011a with the Parallel Compu=ng Toolbox (PCT), CUDA Toolkit, HPC Pack 2008, Ac=vePerl 5.12.3, Microsof SDK, Microsof Visual C++ 2010 SP1 Redistributable Package (x64), and a 3D Video Controller on the GPU compute nodes. Machine learning and signal analysis techniques may automa=cally iden=fy species such as warblers from their flight calls (Image courtesy of the McGill Bird Observatory) Case Study: Theo Damoulas, a research associate with the NSFestablished Ins=tute for Computa=onal Sustainability (ISC) directed by Prof. Carla Gomes, benefited from a 12=mes speedup in Dynamic Time Warping (DTW) computa=on by using a combina=on of builtin MATLAB GPU func=ons and CUDA code. DTW is the computa=onally expensive part of the code which uses machine learning and signal analysis techniques to automa=cally iden=fy bird species from their flight calls. Automa=c flight call classifica=on is much faster and arguably more accurate than manual classifica=on, and the first step in crea=ng large scale networks of recording sta=ons that can provide a detailed understanding of the migra=on paAerns of individual species. This project is representa=ve of the research of the ISC, whose aim is to provide solu=ons for balancing environmental, economic, and societal needs for a sustainable future by bringing computa=onal thinking to sustainability research. The ISC is a joint venture involving scien=sts from Cornell University, Bowdoin College, the Conserva=on Fund, Howard University, Oregon State University, and the Pacific Northwest Na=onal Laboratory. David Lila, Eric Chen, Lucia Walle, Susan Mehringer, Steven Lantz, Steven Clark, Pascal Meunier GPU Technical Specifica-ons 8x NVIDIA Tesla M2070 GPUs All 8 housed in a single Dell C410x PCIe expansion chassis Reconfigurable: 1 to 8 GPUs can be mapped to any of the servers 6GB RAM per GPU 2x Dell C6100 = 8 servers in total, each with: 2x Intel 5620 Westmere processors = 8 cores per server 24GB RAM 1x 250GB hard drive Gigabit Ethernet GPU Peak Rates 8x NVIDIA Tesla M2070 GPUs Single precision total: 8 Tflop/s Double precision total: 4 Tflop/s 64x Intel 5620 Westmere cores Clock rate = 2.4 GHz SSE4 mul=plyadd = 8 flop/core/cycle for SP, or 4 for DP Single precision total: 1.2 Tflop/s Double precision total: 0.6 Tflop/s Full System Single precision total: 9.2 Tflop/s Double precision total: 4.6 Tflop/s Nearly equivalent to a 512core CPUbased system NVIDIA Tesla GPUs are being used to design the computeraided diagnosis of breast cancer cells. (Image Courtesy of Constan=n Friedman, MD and Victor Brodsky, MD, Weill Cornell Medical College) Case Study: Researchers from Weill Cornell Medical Center, University of Michigan Health System, and Rutgers Laboratory for Computa=onal Imaging and Bioinforma=cs are currently using the NVIDIA GPUs and MATLAB to accelerate and improve the diagnosis of cancer cells using template matching. Using MATLAB’s builtin GPU func=ons, the researchers experienced a 14.7=mes speedup in code processing =me (from 86.9 seconds to 5.9 seconds). That’s a significant improvement for pathologists who would like to process many large scale images each day. By comparison, MATLAB code running on GPUs performed 4.8=mes faster than code that was implemented in C++ without GPUs. And, because MATLAB is op=mized for use with GPUs, users can take advantage of the GPUs’ compute power without needing to learn another programming language or leaving the MATLAB environment. MATLAB > MATLAB + GPU MATLAB now offers 3 methods for u=lizing an NVIDIA GPU to boost the performance of MATLAB code. The following outlines methods u=lized to iden=fy MATLAB code candidates that would be well suited for GPU op=miza=on and the steps involved in enabling GPU func=onality: 1. Profile code 2. Op=mize code 3. U=lize GPU func=ons 1. Profile code MATLAB provides a builtin profile command that creates a visual representa=on of the boAlenecks in MATLAB code. 2. Op-mize code Before u=lizing GPU func=ons it is best to vectorize code boAlenecks. The provided GPU func=ons work best when code has already been op=mized. 3. U-lize GPU func-ons There are three methods for using a GPU with MATLAB: Builtin GPUArray methods ArrayFun Execu=ng CUDA kernel Builtin GPUArray method Simple demo of FFT of 100 million random numbers on CPU vs. GPU BoNleneck! Original Vectorized Research Project Title Built in Array Fun CUDA Speed up Spa-allyInvariant Vector Quan-za-on (SIVQ) Yes Yes No 14.7x Nirfast Yes No Yes 13x Automated Flight Call Classifica-on Yes No Yes 12x Array Process of Ambient Noise for Geophysical Inversion Yes No No 2x White MaNer Tracts No No No 0x Electron Trajectory Simula-on in HallEffect Thrusters No No No 0x GridFTP Server MyProxy Server Web Server SQL Server Compute Nodes NVIDIA Tesla M2070s Head Node Network Interconnect GPU Nodes aNached to Dell C410x DDN Storage

Upload: others

Post on 10-Nov-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enhancing the Experimental MATLAB on the TeraGrid Resource

Enhancing the Experimental "MATLAB on the TeraGrid" Resource

Project  Descrip-on      

The  "MATLAB  on  the  TeraGrid"  experimental  resource  has  proven  to  be  an  important  and  unique  parallel  resource  on  the  TeraGrid  for  computa=onal  science  and  data  analysis.  It  aAracted  many  users  new  to  TeraGrid  and  encouraged  them  to  scale  up  their  research  problems.  The  resource  provided  seamless  parallel  MATLAB  computa=onal  services  to  remote  Linux,  Mac,  or  Windows  desktops  (hAp://www.cac.cornell.edu/matlab)  and  Science  Gateway  users  (hAps://hubzero.org/resources/495)  with  complex  analy=c  and  fast  simula=on  requirements.    In  a  new  research  collabora=on  with  NVIDIA,  Dell,  and  MathWorks,  Cornell  is  tes=ng  the  performance  of  general-­‐purpose  GPUs  with  MATLAB  applica=ons.  MATLAB  GPU  compu=ng  capabili=es  include  data  manipula=on  on  NVIDIA  GPUs  and  the  use  of  mul=ple  GPUs  on  the  desktop  via  the  Parallel  Compu=ng  Toolbox  and  a  computer  cluster  via  MATLAB  Distributed  Compu=ng  Server.  Tes=ng  is  occurring  on  Dell  C6100  servers  with  the  C410x  PCIe  expansion  chassis  which  supports  server  connec=ons  to  NVIDIA  Tesla  M2070  GPUs.  In  this  poster,  we  share  system  configura=on  informa=on,  =ps  for  tes=ng  and  adap=ng  codes  for  use  with  GPUs,  and  GPU  test  results  for  six  case  studies.    

GPU  Test  Results    

Six  case  studies  were  ini=ated  to  examine  the  process  of  adap=ng  exis=ng  MATLAB  codes  to  u=lize  the  new  GPU  capabili=es  available  in  MATLAB  2011a.  The  results  yielded  2-­‐=mes  to  14.8-­‐=mes  speedup  of  the  original  MATLAB  code  on  four  of  the  six  case  studies.  The  MATLAB  codes  selected  for  analysis  included  audio  signal  processing,  medical  image  processing,  Monte  Carlo  method,  and  finite  element  method.      Each  case  study  involved  a  process  of  profiling  the  code  to  iden=fy  poten=al  candidates  for  GPU  op=miza=on  and  u=lizing  one  or  more  of  the  3  methods  offered  by  MATLAB  2011a  to  u=lize  GPU  hardware.  The  amount  of  effort  to  u=lize  the  GPU  varied  from  one-­‐hour  to  achieve  a  4-­‐=mes  speed-­‐up  to  two-­‐weeks  to  vectorize  an  exis=ng  code  and  develop  custom  CUDA  kernels  resul=ng  in  a  13-­‐=mes  speedup.          The  results  of  the  case  studies  demonstrate  that  MATLAB  2011a  provides  an  excellent  framework  for  helping  researchers  leverage  GPU  hardware  with  a  rela=vely  modest  amount  of  effort.  Future  work  will  focus  on  exploring  the  benefits  of  u=lizing  mul=ple  GPUs  simultaneously  and  developing  a  set  of  best  prac=ces  for  assis=ng  researchers  in  making  the  best  use  of  the  new  GPU  capability  of  MATLAB.      

Cornell  System  Configura-on      

The  Cornell  system  configura=on  is  comprised  of  mul=ple  servers:  a  Web  Server,  a  Windows  HPC  Server  2008  head  node  and  compute  nodes,  a  SQL  Server,  MyProxy  and  a  Grid  FTP  Server.  These  are  all  connected  to  the  DataDirect  Networks  storage  with  8TB  dedicated  to  this  project.  A  Dell  PowerEdge  C410x  hosts  the  GPUs  and  they  are  connected  to  Dell  PowerEdge  C6100’s.  The  GPUs  are  NVIDIA  Tesla  M2070s.  Authen=ca=on  and  access  is  through  x509  cer=ficates.  Users  can  seamlessly  switch  from  using  their  desktop  for  MATLAB  mul=-­‐core  processes  to  the  cluster  using  either  mul=-­‐core  or  mul=-­‐node  processing.  Currently  the  sofware  stack  includes  Windows  HPC  Server  2008  x64,  MATLAB  R2011a  with  the  Parallel  Compu=ng  Toolbox  (PCT),  CUDA  Toolkit,  HPC  Pack  2008,  Ac=vePerl  5.12.3,  Microsof  SDK,  Microsof  Visual  C++  2010  SP1  Redistributable  Package  (x64),  and  a  3D  Video  Controller  on  the  GPU  compute  nodes.  

Machine  learning  and  signal  analysis  techniques  may  automa=cally  iden=fy  species  such  as  warblers  from  their  flight  calls  (Image  courtesy  of  the  McGill  Bird  Observatory)    Case  Study:  Theo  Damoulas,  a  research  associate  with  the  NSF-­‐established  Ins=tute  for  Computa=onal  Sustainability  (ISC)  directed  by  Prof.  Carla  Gomes,  benefited  from  a  12-­‐=mes  speedup  in  Dynamic  Time  Warping  (DTW)  computa=on  by  using  a  combina=on  of  built-­‐in  MATLAB  GPU  func=ons  and  CUDA  code.  DTW  is  the  computa=onally  expensive  part  of  the  code  which  uses  machine  learning  and  signal  analysis  techniques  to  automa=cally  iden=fy  bird  species  from  their  flight  calls.  Automa=c  flight  call  classifica=on  is  much  faster  and  arguably  more  accurate  than  manual  classifica=on,  and  the  first  step  in  crea=ng  large  scale  networks  of  recording  sta=ons  that  can  provide  a  detailed  understanding  of  the  migra=on  paAerns  of  individual  species.  This  project  is  representa=ve  of  the  research  of  the  ISC,  whose  aim  is  to  provide  solu=ons  for  balancing  environmental,  economic,  and  societal  needs  for  a  sustainable  future  by  bringing  computa=onal  thinking  to  sustainability  research.  The  ISC  is  a  joint  venture  involving  scien=sts  from  Cornell  University,  Bowdoin  College,  the  Conserva=on  Fund,  Howard  University,  Oregon  State  University,  and  the  Pacific  Northwest  Na=onal  Laboratory.  

David  Lila,  Eric  Chen,  Lucia  Walle,  Susan  Mehringer,  Steven  Lantz,  Steven  Clark,  Pascal  Meunier  

GPU  Technical  Specifica-ons    

8x  NVIDIA  Tesla  M2070  GPUs  •  All  8  housed  in  a  single  Dell  C410x  PCIe  expansion  chassis  •  Reconfigurable:  1  to  8  GPUs  can  be  mapped  to  any  of  the  servers  •  6GB  RAM  per  GPU  

2x  Dell  C6100  =  8  servers  in  total,  each  with:  •  2x  Intel  5620  Westmere  processors  =  8  cores  per  server  •  24GB  RAM  •  1x  250GB  hard  drive  •  Gigabit  Ethernet  

GPU  Peak  Rates    

8x  NVIDIA  Tesla  M2070  GPUs  •  Single  precision  total:  8  Tflop/s  •  Double  precision  total:  4  Tflop/s  

64x  Intel  5620  Westmere  cores  •  Clock  rate  =  2.4  GHz  •  SSE4  mul=ply-­‐add  =  8  flop/core/cycle  for  SP,  or  4      

 for  DP  •  Single  precision  total:  1.2  Tflop/s  •  Double  precision  total:  0.6  Tflop/s  

Full  System  •  Single  precision  total:  9.2  Tflop/s  •  Double  precision  total:  4.6  Tflop/s  •  Nearly  equivalent  to  a  512-­‐core  CPU-­‐based    

 system    

NVIDIA  Tesla  GPUs  are  being  used  to  design  the  computer-­‐aided  diagnosis  of  breast  cancer  cells.  (Image  Courtesy  of  Constan=n  Friedman,  MD  and  Victor  Brodsky,  MD,  Weill  Cornell  Medical  College)    Case  Study:  Researchers  from  Weill  Cornell  Medical  Center,  University  of  Michigan  Health  System,  and  Rutgers  Laboratory  for  Computa=onal  Imaging  and  Bioinforma=cs  are  currently  using  the  NVIDIA  GPUs  and  MATLAB  to  accelerate  and  improve  the  diagnosis  of  cancer  cells  using  template  matching.  Using  MATLAB’s  built-­‐in  GPU  func=ons,  the  researchers  experienced  a  14.7-­‐=mes  speedup  in  code  processing  =me  (from  86.9  seconds  to  5.9  seconds).  That’s  a  significant  improvement  for  pathologists  who  would  like  to  process  many  large  scale  images  each  day.  By  comparison,  MATLAB  code  running  on  GPUs  performed  4.8-­‐=mes  faster  than  code  that  was  implemented  in  C++  without  GPUs.  And,  because  MATLAB  is  op=mized  for  use  with  GPUs,  users  can  take  advantage  of  the  GPUs’  compute  power  without  needing  to  learn  another  programming  language  or  leaving  the  MATLAB  environment.    

MATLAB  -­‐>  MATLAB  +  GPU    

MATLAB  now  offers  3  methods  for  u=lizing  an  NVIDIA  GPU  to  boost  the  performance  of  MATLAB  code.  The  following  outlines  methods  u=lized  to  iden=fy  MATLAB  code  candidates  that  would  be  well  suited  for  GPU  op=miza=on  and  the  steps  involved  in  enabling  GPU  func=onality:    1.  Profile  code  2.  Op=mize  code  3.  U=lize  GPU  func=ons  

1.  Profile  code  MATLAB  provides  a  built-­‐in  profile  command  that  creates  a  visual  representa=on  of  the  boAlenecks  in  MATLAB  code.                            2.  Op-mize  code  Before  u=lizing  GPU  func=ons  it  is  best  to  vectorize  code  boAlenecks.  The  provided  GPU  func=ons  work  best  when  code  has  already  been  op=mized.                                        3.  U-lize  GPU  func-ons  There  are  three  methods  for  using  a  GPU  with  MATLAB:  •  Built-­‐in  GPUArray  methods  •  ArrayFun    •  Execu=ng  CUDA  kernel    Built-­‐in  GPUArray  method  Simple  demo  of  FFT  of  100  million  random  numbers  on  CPU  vs.  GPU                                                    

 

BoNleneck!  

Original  

Vectorized  

Research  Project  Title  Built-­‐in  

ArrayFun   CUDA  

Speed-­‐up  

Spa-ally-­‐Invariant  Vector  Quan-za-on  (SIVQ)   Yes   Yes   No   14.7x  Nirfast   Yes   No   Yes   13x  Automated  Flight  Call  Classifica-on   Yes   No   Yes   12x  

Array  Process  of  Ambient  Noise  for  Geophysical  Inversion   Yes   No   No   2x  White  MaNer  Tracts   No   No   No   0x  

Electron  Trajectory  Simula-on  in  Hall-­‐Effect  Thrusters   No   No   No   0x  

GridFTP  Server  

MyProxy  Server  

Web  Server  

SQL    Server  

Compute  Nodes  NVIDIA  Tesla  M2070s  

Head  Node  

Network  Interc

onnect    

GPU  Nodes    aNached  to  Dell  C410x  

DDN  Storage