hector - sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · •...

HECToRWhat is it used for and how can you use it?

Adrian Jackson, [email protected]

Introduction to HPC

Architectures and Parallel Programming

2

Why HPC?

• Scientific simulation and modelling drive the need for greater computing power.

• Single systems could not be made that had enough resource for the simulations needed.• Runtime of months on a single processor not uncommon• Parallel programs often start out as serial programs

• Making faster single chip is difficult due to both physical limitations and cost.

• Theoretical• Physical limitations to size and speed of a single chip• Capacitance increases with complexity• Speed of light, size of atoms, dissipation of heat• Voltage reduction vs Clock speed for power requirements

• Voltages become too small for “digital” differences

• Practical• Developing new chips is incredibly expensive• Must make maximum use of existing technology

• Adding more memory to single chip is expensive and leads to complexity.

• Solution: parallel computing – divide up the work among numerous linked systems.

3

Types of HPC systems• Shared-memory: OpenMP

• Multiple processors share a single memory space

• Simple to program for many problems

• Scaling is problematic

• Distributed memory: MPI

• Each processing unit has its own memory space

• Excellent scaling properties

• Can be more complex to program due to explicit communications

• Accelerators (GPU, Intel MIC)

• Specialist processing units attached to main CPU

• Can be difficult to extract good performance

• (Conceptually similar to old vector architectures.)

4

P M

Interconnect

P M

P M

P M

P M

P M

PP PP PPPP PP PP

BusBus

Memory

Cray T3D• First national parallel

computing service

• EPCC: 1994-1999

• 512 x 150 MHz EV5 Alphas processors

• 76 GFlop/s peak

• 32 GB memory

• Distributed memory system

• Supplemented with Cray T3E

• 344 x 450 MHz EV56 Alpha processors

• 310 GFlop/s peak

• 40 GB memory

5

• Computer Services for Academic

Research

• Manchester: 1998-2006

• Shared memory systems• Number of concurrent systems

• SGI Altix 3700 • 512 Itanium 2 processors• 1Terabyte of memory• 2.7 TFlop/s peak

• SGI Origin 3800• 512 MIPS R12000 processors• 512 GBytes of memory.

• 400 GFlop/s peal

6

• Cluster of shared memory nodes

• The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010

• Phase 1:• 80 nodes, 1280 processors

• 16 x 1.3 GHz Power4

• 6.7 TFlop/s peak

• 1280 GB memory

• Phase 2:• 50 nodes, 1600 processors

• 32 x 1.7 GHz Power4+


• 1.6 TB memory

• Phase 3:• 160 nodes, 2560 cores

• 8 x Dual core 1.6 GHz Power5


• 5.12 TB memory

7

Distributed Shared Memory (clusters)• Dominant architecture is a hybrid of these two

approaches: Distributed Shared Memory.

• Due to most HPC systems being built from commodity hardware –trend to multicore processors.

• Each Shared memory block is known as a node.

• Usually 16-64 processors per node.

• Nodes can also contain accelerators.

• Majority of users try to exploit in the same way as for a

purely distributed machine

• As the number of cores per node increases this becomes increasingly inefficient…

• …and programming for these machines becomes increasingly complex.

8

Differences from Cloud computing• Performance

• Clouds usually use virtual machines which add an extra layer of software.

• In cloud you often share hardware resource with other users – HPC access is usually exclusive.

• Tight-coupling• HPC parallel programming usually assumes that the separate

processes are tightly coupled

• Requires a low-latency, high-bandwidth communication system between tasks

• Cloud usually does not usually have this

• Programming models• HPC use high-level compiled languages with extensive

optimisation.

• Cloud often based on interpreted/JIT.

9

Mode of access• Usually use SSH to log into an interactive node (or login

node)

• Used for editing and compiling code (usually in a terminal)

• Sometimes more powerful data-analysis/post-processing nodes provided

• Simulations usually run through a batch submission

system

• Write a job submission script and send it to a queue

• When there is enough space on the machine the job runs

• Once it is finished you log in to look at your output

• Transferring large amounts of data (terabytes) can be

challenging!

10

Batch system• Batch systems equate to Job Schedulers• Most HPC systems have 2 parts

• Front-end: Login nodes for users, compilation resource, filesystem, small numbers of

processors

• Back-end: Production system, large numbers of processors

• Important for any HPC system

• Software for managing and launching executables on back-end hardware

• Used to separate jobs from hardware

• HPC jobs require dedicated machine access

• Often back-end systems do not allow direct logon

• No public ip addresses, not running same level of O/S

• Even if they do, need a system for running jobs across large numbers of

processors/machines/frames/etc…

• Used to protect hardware, secure systems, optimise machine usage, enforce usage policies

• Very configurable by administrator/operator

• Some choice for user

11

Parallel Programming

12

Shared-memory programming• Most common is data-parallel using threads

• Each thread (usually running on a separate core) performs operation on part of larger data structure (e.g. an array).

• Communication through shared data

• Single-Instruction, Multiple Data (SIMD)

• Also the model for accelerators (with hundreds of cores rather than tens).

• OpenMP is the most common example• Add directives to your code which tell the compiler to parallelise a

particular region.

• Problems scaling to large thread counts (> 16).

13

Message-passing programming• Can be both task- and data-parallel

• Parallel tasks communicate by sending messages with both the sender and receiver involved (two-sided communication).

• Often used to express high-level, task parallelism

• Standardised as the MPI library which is available on all HPC resources

• Excellent scalability for many programs

• Dominant parallel programming paradigm

• Largest current machines seem to be pushing the limit of MPI scalability• O(100000) cores

14

Single-sided communications• Communication between parallel tasks where only one

task is involved

• Either put data directly into remote memory (of other task)…

• …or get data directly from remote memory.

• Can provide higher performance and better scaling than

MPI

• Allows for more flexible overlap of computation and

communication

• Programming can be complex as each task must be

aware that data can change without them knowing.

• Examples include SHMEM library and PGAS languages

• Lots of current research in this area

15

HECToR

UK National Supercomputing Facility

16

HECToR

17

HECToR

• Current national service: 2007 –

• Phase 1:• Cray XT4

• 1416 nodes, 11,328 cores

• 4 x 2.8 GHz dual-core Opterons


• 34 TB memory

• Phase 2a:• 1416 nodes, 22,656 cores

• 4 x 2.3 GHz quad-core Opterons

• 208 TFlop/s peak

• 45.3 TB memory

• Phase 2b• 1856 nodes, 44544 cores

• 2 x 2.1 GHz 12-core Opterons

• 373 Tflop/s peak

• 58 TB memory

Node building block• Node architecture:

• 2x 16-core 2.3GHz

• Really 4x 8-core

• 32GB DDR3 RAM

• 85.3GB/s (3.6GB/s/core)

• Gemini interconnect:

• 1 Gemini per 2 compute nodes

• Interconnect connected directly through HyperTransport

• Hardware support for:

• MPI

• Single-sided communications

• Direct memory access

• Latency ~1-1.5μs;

• Bandwidth 5GB/s

19

ArchitectureCray XE6 Supercomputer

Lustre high-performance, parallel filesystem

� Compute nodes

� Login nodes

� Lustre OSS

� Lustre MDS

� NFS Server

� Boot/SDB node

1 GigE Backbone

10 GigE

Backup

and

Archive Servers

Infiniband Switch

2816 compute nodes = 90,112 cores

90 TB Memory; 1PB disk; 1PB archive

20

GPGPU Testbed• 4 node GPGPU Testbed

• Quadband Infiniband interconnect

• 32 GB memory per node

• Quad-core Intel Xeon 2.4GHz

• Range of libraries, compilers, and GPGPU optimised packages (NAMD,

LAMMPS, AMBER)

GPGPUs

4x NVidia Fermi C2050 (3GB Memory)

4x NVidia Fermi C2050 (3GB Memory)

2x NVidia Fermi C2050 (3GB Memory)2x NVidia Fermi C2070 (6GB Memory)

2x AMD FireStream 9270

21

Modelling dinosaur gaits

Dr Bill Sellers, University of Manchester

Fractal-based models of turbulent flows

Christos Vassilicos & Sylvain Laizet,

Imperial College

Dye-sensitised solar cells

F. Schiffmann and J. VandeVondele

University of Zurich

22

What is it used for?

23

Simulation software

24

How do I get access?• Pump-priming access (Class 2a)

• Lightweight procedure for small amounts of time

• Peer-reviewed access (Class 1a)• Apply for time as part of standard RCUK grant application

• Large amount of CPU-only (Class 1b)• Lightweight application for large amounts of time-limited time

• Join one of the science consortia• Materials chemistry, UK turbulence

• Collaborate with someone who already has access

• If you have any questions about access mail ‘[email protected]’

26

HPC Future Look

Its all about more parallelism…

27

ARCHER

• Next UK National Supercomputer

• Cray XC30

• 3-4 times the

throughput of HECToR

• 7/8 standard memory

compute nodes

• 1/8 high memory

compute nodes (twice

the size)

28

• Likely to be 12-core Intel Ivy Bridge processors (2.6 or 2.7 GHz)

• Cray Aries high-performance interconnect

• Best practice guide soon be available• http://www.archer.ac.uk

2012 2015 2018

System Perf. 20 PFlops 100-200 PFlops 1 EFlops

Memory 1 PB 5 PB 10 PB

Node Perf. 200 GFlops 400 GFlops 1-10 TFlops

Concurrency 32 O(100) O(1000)

Interconnect BW 40 GB/s 100 GB/s 200-400 GB/s

Nodes 100,000 500,000 O(Million)

I/O 2 TB/s 10 TB/s 20 TB/s

MTTI Days Days O(1 Day)

Power 10 MW 10 MW 20 MW

What will future systems look like?

29

What does this mean for applications?

• The future of HPC (as for everyone else):

• Lots of cores per node (CPU + co-processor)

• Little memory per core

• Lots of compute power per network interface

• The balance of compute to communication power and

compute to memory are both radically different to now

• Must exploit parallelism at all levels

• Must exploit memory hierarchy efficiently

30

Final thoughts

• Parallel programming is key to exploiting both current and

future computer architectures

• Even non-HPC applications will have to become parallel

• If you are a programmer of any sort you should probably learn

about parallel programming and algorithms

• In HPC – message passing, shared-memory programming and vectorisation

• For cloud/web – concurrency, asynchronous operations and shared-memory programming

31

Training opportunities

• HECToR Training (free to academics):

• http://www.hector.ac.uk/cse/training/

• EPCC PRACE Advanced Training Centre (free to

academics):

• http://www.epcc.ed.ac.uk/training-education/course-programme/

• EPCC M.Sc. in HPC

• http://www.epcc.ed.ac.uk/msc/

32

Any questions?

[email protected]

[email protected]

33

hector - sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · •...

Documents