hector - sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · •...
TRANSCRIPT
HECToRWhat is it used for and how can you use it?
Adrian Jackson, [email protected]
Introduction to HPC
Architectures and Parallel Programming
2
Why HPC?
• Scientific simulation and modelling drive the need for greater computing power.
• Single systems could not be made that had enough resource for the simulations needed.• Runtime of months on a single processor not uncommon• Parallel programs often start out as serial programs
• Making faster single chip is difficult due to both physical limitations and cost.
• Theoretical• Physical limitations to size and speed of a single chip• Capacitance increases with complexity• Speed of light, size of atoms, dissipation of heat• Voltage reduction vs Clock speed for power requirements
• Voltages become too small for “digital” differences
• Practical• Developing new chips is incredibly expensive• Must make maximum use of existing technology
• Adding more memory to single chip is expensive and leads to complexity.
• Solution: parallel computing – divide up the work among numerous linked systems.
3
Types of HPC systems• Shared-memory: OpenMP
• Multiple processors share a single memory space
• Simple to program for many problems
• Scaling is problematic
• Distributed memory: MPI
• Each processing unit has its own memory space
• Excellent scaling properties
• Can be more complex to program due to explicit communications
• Accelerators (GPU, Intel MIC)
• Specialist processing units attached to main CPU
• Can be difficult to extract good performance
• (Conceptually similar to old vector architectures.)
4
P M
Interconnect
P M
P M
P M
P M
P M
PP PP PPPP PP PP
BusBus
Memory
Cray T3D• First national parallel
computing service
• EPCC: 1994-1999
• 512 x 150 MHz EV5 Alphas processors
• 76 GFlop/s peak
• 32 GB memory
• Distributed memory system
• Supplemented with Cray T3E
• 344 x 450 MHz EV56 Alpha processors
• 310 GFlop/s peak
• 40 GB memory
5
• Computer Services for Academic
Research
• Manchester: 1998-2006
• Shared memory systems• Number of concurrent systems
• SGI Altix 3700 • 512 Itanium 2 processors• 1Terabyte of memory• 2.7 TFlop/s peak
• SGI Origin 3800• 512 MIPS R12000 processors• 512 GBytes of memory.
• 400 GFlop/s peal
6
• Cluster of shared memory nodes
• The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010
• Phase 1:• 80 nodes, 1280 processors
• 16 x 1.3 GHz Power4
• 6.7 TFlop/s peak
• 1280 GB memory
• Phase 2:• 50 nodes, 1600 processors
• 32 x 1.7 GHz Power4+
• 10.8 TFlop/s peak
• 1.6 TB memory
• Phase 3:• 160 nodes, 2560 cores
• 8 x Dual core 1.6 GHz Power5
• 15.36 TFlop/s peak
• 5.12 TB memory
7
Distributed Shared Memory (clusters)• Dominant architecture is a hybrid of these two
approaches: Distributed Shared Memory.
• Due to most HPC systems being built from commodity hardware –trend to multicore processors.
• Each Shared memory block is known as a node.
• Usually 16-64 processors per node.
• Nodes can also contain accelerators.
• Majority of users try to exploit in the same way as for a
purely distributed machine
• As the number of cores per node increases this becomes increasingly inefficient…
• …and programming for these machines becomes increasingly complex.
8
Differences from Cloud computing• Performance
• Clouds usually use virtual machines which add an extra layer of software.
• In cloud you often share hardware resource with other users – HPC access is usually exclusive.
• Tight-coupling• HPC parallel programming usually assumes that the separate
processes are tightly coupled
• Requires a low-latency, high-bandwidth communication system between tasks
• Cloud usually does not usually have this
• Programming models• HPC use high-level compiled languages with extensive
optimisation.
• Cloud often based on interpreted/JIT.
9
Mode of access• Usually use SSH to log into an interactive node (or login
node)
• Used for editing and compiling code (usually in a terminal)
• Sometimes more powerful data-analysis/post-processing nodes provided
• Simulations usually run through a batch submission
system
• Write a job submission script and send it to a queue
• When there is enough space on the machine the job runs
• Once it is finished you log in to look at your output
• Transferring large amounts of data (terabytes) can be
challenging!
10
Batch system• Batch systems equate to Job Schedulers• Most HPC systems have 2 parts
• Front-end: Login nodes for users, compilation resource, filesystem, small numbers of
processors
• Back-end: Production system, large numbers of processors
• Important for any HPC system
• Software for managing and launching executables on back-end hardware
• Used to separate jobs from hardware
• HPC jobs require dedicated machine access
• Often back-end systems do not allow direct logon
• No public ip addresses, not running same level of O/S
• Even if they do, need a system for running jobs across large numbers of
processors/machines/frames/etc…
• Used to protect hardware, secure systems, optimise machine usage, enforce usage policies
• Very configurable by administrator/operator
• Some choice for user
11
Parallel Programming
12
Shared-memory programming• Most common is data-parallel using threads
• Each thread (usually running on a separate core) performs operation on part of larger data structure (e.g. an array).
• Communication through shared data
• Single-Instruction, Multiple Data (SIMD)
• Also the model for accelerators (with hundreds of cores rather than tens).
• OpenMP is the most common example• Add directives to your code which tell the compiler to parallelise a
particular region.
• Problems scaling to large thread counts (> 16).
13
Message-passing programming• Can be both task- and data-parallel
• Parallel tasks communicate by sending messages with both the sender and receiver involved (two-sided communication).
• Often used to express high-level, task parallelism
• Standardised as the MPI library which is available on all HPC resources
• Excellent scalability for many programs
• Dominant parallel programming paradigm
• Largest current machines seem to be pushing the limit of MPI scalability• O(100000) cores
14
Single-sided communications• Communication between parallel tasks where only one
task is involved
• Either put data directly into remote memory (of other task)…
• …or get data directly from remote memory.
• Can provide higher performance and better scaling than
MPI
• Allows for more flexible overlap of computation and
communication
• Programming can be complex as each task must be
aware that data can change without them knowing.
• Examples include SHMEM library and PGAS languages
• Lots of current research in this area
15
HECToR
UK National Supercomputing Facility
16
HECToR
17
HECToR
• Current national service: 2007 –
• Phase 1:• Cray XT4
• 1416 nodes, 11,328 cores
• 4 x 2.8 GHz dual-core Opterons
• 63.4 TFlop/s peak
• 34 TB memory
• Phase 2a:• 1416 nodes, 22,656 cores
• 4 x 2.3 GHz quad-core Opterons
• 208 TFlop/s peak
• 45.3 TB memory
• Phase 2b• 1856 nodes, 44544 cores
• 2 x 2.1 GHz 12-core Opterons
• 373 Tflop/s peak
• 58 TB memory
Node building block• Node architecture:
• 2x 16-core 2.3GHz
• Really 4x 8-core
• 32GB DDR3 RAM
• 85.3GB/s (3.6GB/s/core)
• Gemini interconnect:
• 1 Gemini per 2 compute nodes
• Interconnect connected directly through HyperTransport
• Hardware support for:
• MPI
• Single-sided communications
• Direct memory access
• Latency ~1-1.5μs;
• Bandwidth 5GB/s
19
ArchitectureCray XE6 Supercomputer
Lustre high-performance, parallel filesystem
� Compute nodes
� Login nodes
� Lustre OSS
� Lustre MDS
� NFS Server
� Boot/SDB node
1 GigE Backbone
10 GigE
Backup
and
Archive Servers
Infiniband Switch
2816 compute nodes = 90,112 cores
90 TB Memory; 1PB disk; 1PB archive
20
GPGPU Testbed• 4 node GPGPU Testbed
• Quadband Infiniband interconnect
• 32 GB memory per node
• Quad-core Intel Xeon 2.4GHz
• Range of libraries, compilers, and GPGPU optimised packages (NAMD,
LAMMPS, AMBER)
GPGPUs
4x NVidia Fermi C2050 (3GB Memory)
4x NVidia Fermi C2050 (3GB Memory)
2x NVidia Fermi C2050 (3GB Memory)2x NVidia Fermi C2070 (6GB Memory)
2x AMD FireStream 9270
21
Modelling dinosaur gaits
Dr Bill Sellers, University of Manchester
Fractal-based models of turbulent flows
Christos Vassilicos & Sylvain Laizet,
Imperial College
Dye-sensitised solar cells
F. Schiffmann and J. VandeVondele
University of Zurich
22
What is it used for?
23
Simulation software
24
25
How do I get access?• Pump-priming access (Class 2a)
• Lightweight procedure for small amounts of time
• Peer-reviewed access (Class 1a)• Apply for time as part of standard RCUK grant application
• Large amount of CPU-only (Class 1b)• Lightweight application for large amounts of time-limited time
• Join one of the science consortia• Materials chemistry, UK turbulence
• Collaborate with someone who already has access
• If you have any questions about access mail ‘[email protected]’
26
HPC Future Look
Its all about more parallelism…
27
ARCHER
• Next UK National Supercomputer
• Cray XC30
• 3-4 times the
throughput of HECToR
• 7/8 standard memory
compute nodes
• 1/8 high memory
compute nodes (twice
the size)
28
• Likely to be 12-core Intel Ivy Bridge processors (2.6 or 2.7 GHz)
• Cray Aries high-performance interconnect
• Best practice guide soon be available• http://www.archer.ac.uk
2012 2015 2018
System Perf. 20 PFlops 100-200 PFlops 1 EFlops
Memory 1 PB 5 PB 10 PB
Node Perf. 200 GFlops 400 GFlops 1-10 TFlops
Concurrency 32 O(100) O(1000)
Interconnect BW 40 GB/s 100 GB/s 200-400 GB/s
Nodes 100,000 500,000 O(Million)
I/O 2 TB/s 10 TB/s 20 TB/s
MTTI Days Days O(1 Day)
Power 10 MW 10 MW 20 MW
What will future systems look like?
29
What does this mean for applications?
• The future of HPC (as for everyone else):
• Lots of cores per node (CPU + co-processor)
• Little memory per core
• Lots of compute power per network interface
• The balance of compute to communication power and
compute to memory are both radically different to now
• Must exploit parallelism at all levels
• Must exploit memory hierarchy efficiently
30
Final thoughts
• Parallel programming is key to exploiting both current and
future computer architectures
• Even non-HPC applications will have to become parallel
• If you are a programmer of any sort you should probably learn
about parallel programming and algorithms
• In HPC – message passing, shared-memory programming and vectorisation
• For cloud/web – concurrency, asynchronous operations and shared-memory programming
31
Training opportunities
• HECToR Training (free to academics):
• http://www.hector.ac.uk/cse/training/
• EPCC PRACE Advanced Training Centre (free to
academics):
• http://www.epcc.ed.ac.uk/training-education/course-programme/
• EPCC M.Sc. in HPC
• http://www.epcc.ed.ac.uk/msc/
32