high performance computing @ aub tech...goals •demonstrate how you (as users) can benefit from...

55
November 2018 High Performance Computing @ AUB American University of Beirut GradEx Workshop Mher Kazandjian

Upload: others

Post on 30-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • November 2018

    High Performance Computing @ AUB

    American University of Beirut

    GradEx WorkshopMher Kazandjian

  • How this talk is structured?• History of computing• Scientific computing workflows

    • Computer architecture overview• Do's and Don'ts

    •    Demo's and walk throughs

  • Goals• Demonstrate how you (as users) can benefit from AUB's

    HPC facilities• Attract users, because:

    • we want to boost scientific computing research• we want to help you• we have capacity

    This presentation is based on actual feedback and use cases collected from users over the past year

  • History of computing

    Alan Turing 1912-1954

  • Growth over time

    12 orders of magnitudesince 1960

  • Growth over time

    ~12 orders of magnitudesince 1960

    if you had 1000$ in 1970 you could do 10^12 times more calculations with hardware that costs the same today

  • What is HPC used for today?

    ●Solving scientific problems●Data mining and deep learning●Military research and security●Cloud computing●Blockchain (cryptocurrency)

  • What is HPC used for today?

    ● https://blog.openai.com/ai-and-compute/

  • Multicores hit the markets in ~2005

     Click to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add text

    Growth over time

    Users at home started benefiting from parallelism

    Prior to that applications that scaled well were restricted to mainframes / datacenters and HPC clusters

  • HPC @ AUB in 2006

    8 compute nodesSpecs per node   - 4 cores   - 8 GB ram

    ~ 80 GFlops

  • HPC is all about scalability• The high speed network is the "most" important

    component

  • But what is scalability?Performance improvements as the number of cores (resources) increasesfor the same problem size - hard scalability

  • But what is scalability?

    This is a CPU under a microscope

  • But what is scalability?

    Prog.exe

    2 sec

    Serial runtime = T_serial

  • But what is scalability?

    Prog.exe

    1 sec

    Prog.exe

    parallel runtime = T_parallel

  • But what is scalability?

    Prog.exe

    0.5 sec

    Prog.exe Prog.exe Prog.exe

    parallel runtime = T_parallel

  • But what is scalability?

    Prog.exe

    0.5 sec

    Prog.exe Prog.exe Prog.exe

    Very nice!!but this is usually never the case

  • First demo – First scalability diagram

  • But what is scalability?

    Repeat the same process across multiple processors

    Prog.exeProg.exeProg.exeProg.exe Prog.exe Prog.exe Prog.exe

  • But what is scalability?

    Wait! - how do these processors talk to each other? - how much data needs to be transferred for a certain task? - how fast do the processes communicate with each other? - how often should the processes communicate with each other?

    Prog.exeProg.exeProg.exeProg.exe Prog.exe Prog.exe Prog.exe

  • At the single chip level

    Through the cache memory of the CPU

    Typical latency ~ ns (or less)Typical bandwidth > 150 GB/s

  • At the single chip level

    Through the RAM

    Random Access Momory (aka RAM)

    Typical latency ~ a few to tens ns Typical bandwidth ~ 10 to 50 GB/s (sometimes more)

    https://ark.intel.com/#@Processors

    Through the RAM

  • Second demo: bandwidth and some lingo

    - An array is just a bunch of bytes- Bandwidth is the speed with which information is tranferred- A float (double precision) is 8 bytes- an array of one million elements is 1000 x 1000 x 8 bytes = 80 MB- if I measure the time to initialize this array I can measure how fast the cpu can access the RAM (since initializing the array implies visiting each memory address and setting it to zero)- bandwidth = size of array / time to initialize it

  • Second demo: bandwidth and some lingo

    - An array is just a bunch of bytes- Bandwidth is the speed with which information is tranferred- A float (double precision) is 8 bytes- an array of one million elements is 1000 x 1000 x 8 bytes = 80 MB- if I measure the time to initialize this array I can measure how fast the cpu can access the RAM (since initializing the array implies visiting each memory address and setting it to zero)- bandwidth = size of array / time to initialize it Intel i7-6700HQ - https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3-50-GHz- - Advertised bandwidth = 34 GB/s - measured bandwidth (single thread quickie) = 22.8 GB/s

  • At the single motherboard level

    Through the RAM

    Random Access Memory (aka RAM)

    Typical latency ~ a few to tens ns Typical bandwidth ~ 10 to 100 GB/s (sometimes more)

    Random Access Memory (aka RAM)

    Through QPI (quick path interconnect) - typical latency for small data ~ ns - typical bandwidth 100 GB/s

    TIP: server = node = compute node = numa node

    QPI

  • Second demo: bandwidth multi-threaded- https://github.com/jeffhammond/STREAMhttps://ark.intel.com/products/64597/Intel-Xeon-Processor-E5-2665-20M-Cache-2_40-GHz-8_00-GTs-Intel-QPI

    - 2 x sockets, expected bandwidth ~102 GB/s- measured ~ 75 GB/s - on a completely idle node ~95 GB/s is possible

    Another benchmark2 socket Intel Xeon server

    https://github.com/jeffhammond/STREAM

  • At the cluster level (multiple nodes)

    Through the network (ethernet)

    Typical latency ~ 10 micro-sec to 100 micro sec Typical bandwidth ~ 100 MB/s to a few 100 MB/s

  • At the cluster level (multiple nodes)

    Through the network (infiniband – high speed network)

    Typical latency ~ a few to micro-seconds to < 1 micro secTypical bandwidth > 3 GB/s

    Benefits over ethernet: - Remote direct memory access - higher bandwidth - much lower latency

    https://en.wikipedia.org/wiki/InfiniBand

  • What hardware we have at AUBWhat hardware we have at AUB?- Arza: - 256 core, 1 TB RAM IBM cluster - production simulations, benchmarking - http://website.aub.edu.lb/it/hpc/Pages/home.aspx

    - vLabs - see Vassili’s slide - very flexible, easy to manage, windows support

    - public cloud - infinite resources – limited by $$$ - two pilot projects being tested – will be open soon for testing

    http://website.aub.edu.lb/it/hpc/Pages/home.aspx

  • Parallelization libraries / softwareSMP parallelism - OpenMP - CUDA - Matlab - Spark (recently deployed and tested)

    distributed parallelism (cluster wide) - MPI - Spark - MPI + OpenMP (hybrid) - MPI + CUDA - MPI + CUDA + OpenMP - Spark + CUDA (not tested – any volunteers?)

  • Linux/Unix culture> 99% of HPC clusters wold wide use some kind of linux / unix

    - Clicking your way to install software is easy for you (on windows or mac), but a nightmare for power users.

    - Linux is: - open-source - free - secure (at least much secure than windows et. al ) - no need for an antivirus that slows down your system - respects your privacy - huge community support in scientific computing - 99.8% of all HPC systems world wide since 1996 are non-windows machines https://github.com/mherkazandjian/top500parser

  • Software stack on the HPC cluster- Matlab- C, Java, C++, fortran- python 2 and python 3 - jupyter notebooks- Tensorflow (Deep learning)- Scala- Spark- R - R studio, R server (new)

  • Cluster usage: Demo- The scheduler: resource manager

    - bjobs - bqueues - bhosts - lsload

    - important places - /gpfs1/my_username - /gpfs1/apps/sw

    - basic linux knowledge

    - sample job script

  • Cluster usage: Documentation

    https://hpc-aub-users-guide.readthedocs.io/en/latest/ https://github.com/hpcaubuserguide/hpcaub_userguide

    The guide is for you - we want you to contribute to it directly - please send us pull requests

  • Cluster usage: Job scripts

    https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html

  • Cluster usage: Job scripts

    https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html

    In the user guide, there are samples and templates for many use cases: - we will help you write your own if your use case is not covered - this is 90% of the getting started task - recent success story: - spark server job template

  • Cluster usage: Job scripts

    https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html

  • How to benefit from the HPC hardware?- run many serial jobs that do not need to communicate - aka embarrassingly parallel jobs (nothing embarrasing about it though as long as you get your job done) - e.g - train several neural networks with different layer numbers - do a parameter sweep for a certain model ./my_prog.exe --param 1 & ./my_prog.exe --param 2 & ./my_prog.exe --param 3 & These would execute simultaneously

    - difficulty: very easy

  • How to benefit from the HPC hardware?- run many serial jobs that do not need to communicate Demo

  • How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- e.g - matlab - C/C++/python/Java

    Difficulty: very easy to medium (problem dependent)

  • How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- C

  • How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- C

  • How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- Demo: matlab parfor

  • How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- Demo: matlab parfor

  • How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- Demo: matlab parfor

  • How to benefit from the HPC hardware?- run a hybrid MPI + OpenMP parallel job- Demo: Gauß (astrophysics N-Body code) scalability diagram - single node

    MPI

    OpenMP

    OpenMP

    OpenMP

    OpenMP

  • How to benefit from the HPC hardware?- hybrid MPI + OpenMP parallel job Gauß (astrophysics N-Body code) scalability diagram - single node

    MPI

    OpenMP

    OpenMP

    OpenMP

    OpenMP

  • How to benefit from the HPC hardware?- run a deep learning job- Demo: Tensorflow

  • How to benefit from the HPC hardware?- run a deep learning job- Demo: Tensorflow

  • How to benefit from the HPC hardware?- jupyter notebooks (connect through web interface)- R-server (connect throug web interface)- Spark (full cluster configuration – up to 1 TB ram usage)- Map-Reduce

  • How to benefit from the HPC hardware?- jupyter notebooks (connect through web interface)- R-server (connect throug web interface)- Spark (full cluster configuration – up to 1 TB ram usage)- Map-Reduce

  • How to benefit from the HPC hardware?- Optimal performance benefits

    - Go low level - C/Fortran/C++ - need to have good design - good understanding of architecture

    - Currently only customers running such codes: - Chemistry department research group - Physics department - Computer science research group getting involved too

  • Workflows: best practices- prototype on your machine: laptop, teminal/workstation at your department

    - when you think your job could benefit from HPC resources: - talk to us (we can help you assess your program better) - prepare a clean prototype

    - we will provide you with a pilot project access

    - tune / parallelize your application [ we can help you with that if needed ]

    - run production jobs

    - if you need specific hardware that is not available on campus: - go to the cloud - ideal for testing / benchmarking your code/app on the latest and the greatest hardware

  • Containers- Run on top of kernel

    - zero over head

    - portable

    - currently + deep learning containers are used on the cluster + R studio server (new)

    - we can help you produce a custom container to your problem + you can create your own container too (no need for admin rights)

    - pros: + reproducibilty and portability

    - cons: + must be a geek to set up a container and willing to put effort to do it (lots of help available online though)

  • Thank you for attending! # root>