hector - sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · •...

33
HECToR What is it used for and how can you use it? Adrian Jackson, EPCC [email protected]

Upload: others

Post on 09-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

HECToRWhat is it used for and how can you use it?

Adrian Jackson, [email protected]

Page 2: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Introduction to HPC

Architectures and Parallel Programming

2

Page 3: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Why HPC?

• Scientific simulation and modelling drive the need for greater computing power.

• Single systems could not be made that had enough resource for the simulations needed.• Runtime of months on a single processor not uncommon• Parallel programs often start out as serial programs

• Making faster single chip is difficult due to both physical limitations and cost.

• Theoretical• Physical limitations to size and speed of a single chip• Capacitance increases with complexity• Speed of light, size of atoms, dissipation of heat• Voltage reduction vs Clock speed for power requirements

• Voltages become too small for “digital” differences

• Practical• Developing new chips is incredibly expensive• Must make maximum use of existing technology

• Adding more memory to single chip is expensive and leads to complexity.

• Solution: parallel computing – divide up the work among numerous linked systems.

3

Page 4: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Types of HPC systems• Shared-memory: OpenMP

• Multiple processors share a single memory space

• Simple to program for many problems

• Scaling is problematic

• Distributed memory: MPI

• Each processing unit has its own memory space

• Excellent scaling properties

• Can be more complex to program due to explicit communications

• Accelerators (GPU, Intel MIC)

• Specialist processing units attached to main CPU

• Can be difficult to extract good performance

• (Conceptually similar to old vector architectures.)

4

P M

Interconnect

P M

P M

P M

P M

P M

PP PP PPPP PP PP

BusBus

Memory

Page 5: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Cray T3D• First national parallel

computing service

• EPCC: 1994-1999

• 512 x 150 MHz EV5 Alphas processors

• 76 GFlop/s peak

• 32 GB memory

• Distributed memory system

• Supplemented with Cray T3E

• 344 x 450 MHz EV56 Alpha processors

• 310 GFlop/s peak

• 40 GB memory

5

Page 6: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

• Computer Services for Academic

Research

• Manchester: 1998-2006

• Shared memory systems• Number of concurrent systems

• SGI Altix 3700 • 512 Itanium 2 processors• 1Terabyte of memory• 2.7 TFlop/s peak

• SGI Origin 3800• 512 MIPS R12000 processors• 512 GBytes of memory.

• 400 GFlop/s peal

6

Page 7: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

• Cluster of shared memory nodes

• The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010

• Phase 1:• 80 nodes, 1280 processors

• 16 x 1.3 GHz Power4

• 6.7 TFlop/s peak

• 1280 GB memory

• Phase 2:• 50 nodes, 1600 processors

• 32 x 1.7 GHz Power4+

• 10.8 TFlop/s peak

• 1.6 TB memory

• Phase 3:• 160 nodes, 2560 cores

• 8 x Dual core 1.6 GHz Power5

• 15.36 TFlop/s peak

• 5.12 TB memory

7

Page 8: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Distributed Shared Memory (clusters)• Dominant architecture is a hybrid of these two

approaches: Distributed Shared Memory.

• Due to most HPC systems being built from commodity hardware –trend to multicore processors.

• Each Shared memory block is known as a node.

• Usually 16-64 processors per node.

• Nodes can also contain accelerators.

• Majority of users try to exploit in the same way as for a

purely distributed machine

• As the number of cores per node increases this becomes increasingly inefficient…

• …and programming for these machines becomes increasingly complex.

8

Page 9: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Differences from Cloud computing• Performance

• Clouds usually use virtual machines which add an extra layer of software.

• In cloud you often share hardware resource with other users – HPC access is usually exclusive.

• Tight-coupling• HPC parallel programming usually assumes that the separate

processes are tightly coupled

• Requires a low-latency, high-bandwidth communication system between tasks

• Cloud usually does not usually have this

• Programming models• HPC use high-level compiled languages with extensive

optimisation.

• Cloud often based on interpreted/JIT.

9

Page 10: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Mode of access• Usually use SSH to log into an interactive node (or login

node)

• Used for editing and compiling code (usually in a terminal)

• Sometimes more powerful data-analysis/post-processing nodes provided

• Simulations usually run through a batch submission

system

• Write a job submission script and send it to a queue

• When there is enough space on the machine the job runs

• Once it is finished you log in to look at your output

• Transferring large amounts of data (terabytes) can be

challenging!

10

Page 11: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Batch system• Batch systems equate to Job Schedulers• Most HPC systems have 2 parts

• Front-end: Login nodes for users, compilation resource, filesystem, small numbers of

processors

• Back-end: Production system, large numbers of processors

• Important for any HPC system

• Software for managing and launching executables on back-end hardware

• Used to separate jobs from hardware

• HPC jobs require dedicated machine access

• Often back-end systems do not allow direct logon

• No public ip addresses, not running same level of O/S

• Even if they do, need a system for running jobs across large numbers of

processors/machines/frames/etc…

• Used to protect hardware, secure systems, optimise machine usage, enforce usage policies

• Very configurable by administrator/operator

• Some choice for user

11

Page 12: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Parallel Programming

12

Page 13: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Shared-memory programming• Most common is data-parallel using threads

• Each thread (usually running on a separate core) performs operation on part of larger data structure (e.g. an array).

• Communication through shared data

• Single-Instruction, Multiple Data (SIMD)

• Also the model for accelerators (with hundreds of cores rather than tens).

• OpenMP is the most common example• Add directives to your code which tell the compiler to parallelise a

particular region.

• Problems scaling to large thread counts (> 16).

13

Page 14: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Message-passing programming• Can be both task- and data-parallel

• Parallel tasks communicate by sending messages with both the sender and receiver involved (two-sided communication).

• Often used to express high-level, task parallelism

• Standardised as the MPI library which is available on all HPC resources

• Excellent scalability for many programs

• Dominant parallel programming paradigm

• Largest current machines seem to be pushing the limit of MPI scalability• O(100000) cores

14

Page 15: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Single-sided communications• Communication between parallel tasks where only one

task is involved

• Either put data directly into remote memory (of other task)…

• …or get data directly from remote memory.

• Can provide higher performance and better scaling than

MPI

• Allows for more flexible overlap of computation and

communication

• Programming can be complex as each task must be

aware that data can change without them knowing.

• Examples include SHMEM library and PGAS languages

• Lots of current research in this area

15

Page 16: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

HECToR

UK National Supercomputing Facility

16

Page 17: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

HECToR

17

Page 18: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

HECToR

• Current national service: 2007 –

• Phase 1:• Cray XT4

• 1416 nodes, 11,328 cores

• 4 x 2.8 GHz dual-core Opterons

• 63.4 TFlop/s peak

• 34 TB memory

• Phase 2a:• 1416 nodes, 22,656 cores

• 4 x 2.3 GHz quad-core Opterons

• 208 TFlop/s peak

• 45.3 TB memory

• Phase 2b• 1856 nodes, 44544 cores

• 2 x 2.1 GHz 12-core Opterons

• 373 Tflop/s peak

• 58 TB memory

Page 19: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Node building block• Node architecture:

• 2x 16-core 2.3GHz

• Really 4x 8-core

• 32GB DDR3 RAM

• 85.3GB/s (3.6GB/s/core)

• Gemini interconnect:

• 1 Gemini per 2 compute nodes

• Interconnect connected directly through HyperTransport

• Hardware support for:

• MPI

• Single-sided communications

• Direct memory access

• Latency ~1-1.5μs;

• Bandwidth 5GB/s

19

Page 20: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

ArchitectureCray XE6 Supercomputer

Lustre high-performance, parallel filesystem

� Compute nodes

� Login nodes

� Lustre OSS

� Lustre MDS

� NFS Server

� Boot/SDB node

1 GigE Backbone

10 GigE

Backup

and

Archive Servers

Infiniband Switch

2816 compute nodes = 90,112 cores

90 TB Memory; 1PB disk; 1PB archive

20

Page 21: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

GPGPU Testbed• 4 node GPGPU Testbed

• Quadband Infiniband interconnect

• 32 GB memory per node

• Quad-core Intel Xeon 2.4GHz

• Range of libraries, compilers, and GPGPU optimised packages (NAMD,

LAMMPS, AMBER)

GPGPUs

4x NVidia Fermi C2050 (3GB Memory)

4x NVidia Fermi C2050 (3GB Memory)

2x NVidia Fermi C2050 (3GB Memory)2x NVidia Fermi C2070 (6GB Memory)

2x AMD FireStream 9270

21

Page 22: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Modelling dinosaur gaits

Dr Bill Sellers, University of Manchester

Fractal-based models of turbulent flows

Christos Vassilicos & Sylvain Laizet,

Imperial College

Dye-sensitised solar cells

F. Schiffmann and J. VandeVondele

University of Zurich

22

Page 23: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

What is it used for?

23

Page 24: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Simulation software

24

Page 25: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

25

Page 26: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

How do I get access?• Pump-priming access (Class 2a)

• Lightweight procedure for small amounts of time

• Peer-reviewed access (Class 1a)• Apply for time as part of standard RCUK grant application

• Large amount of CPU-only (Class 1b)• Lightweight application for large amounts of time-limited time

• Join one of the science consortia• Materials chemistry, UK turbulence

• Collaborate with someone who already has access

• If you have any questions about access mail ‘[email protected]

26

Page 27: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

HPC Future Look

Its all about more parallelism…

27

Page 28: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

ARCHER

• Next UK National Supercomputer

• Cray XC30

• 3-4 times the

throughput of HECToR

• 7/8 standard memory

compute nodes

• 1/8 high memory

compute nodes (twice

the size)

28

• Likely to be 12-core Intel Ivy Bridge processors (2.6 or 2.7 GHz)

• Cray Aries high-performance interconnect

• Best practice guide soon be available• http://www.archer.ac.uk

Page 29: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

2012 2015 2018

System Perf. 20 PFlops 100-200 PFlops 1 EFlops

Memory 1 PB 5 PB 10 PB

Node Perf. 200 GFlops 400 GFlops 1-10 TFlops

Concurrency 32 O(100) O(1000)

Interconnect BW 40 GB/s 100 GB/s 200-400 GB/s

Nodes 100,000 500,000 O(Million)

I/O 2 TB/s 10 TB/s 20 TB/s

MTTI Days Days O(1 Day)

Power 10 MW 10 MW 20 MW

What will future systems look like?

29

Page 30: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

What does this mean for applications?

• The future of HPC (as for everyone else):

• Lots of cores per node (CPU + co-processor)

• Little memory per core

• Lots of compute power per network interface

• The balance of compute to communication power and

compute to memory are both radically different to now

• Must exploit parallelism at all levels

• Must exploit memory hierarchy efficiently

30

Page 31: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Final thoughts

• Parallel programming is key to exploiting both current and

future computer architectures

• Even non-HPC applications will have to become parallel

• If you are a programmer of any sort you should probably learn

about parallel programming and algorithms

• In HPC – message passing, shared-memory programming and vectorisation

• For cloud/web – concurrency, asynchronous operations and shared-memory programming

31

Page 32: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Training opportunities

• HECToR Training (free to academics):

• http://www.hector.ac.uk/cse/training/

• EPCC PRACE Advanced Training Centre (free to

academics):

• http://www.epcc.ed.ac.uk/training-education/course-programme/

• EPCC M.Sc. in HPC

• http://www.epcc.ed.ac.uk/msc/

32

Page 33: HECToR - Sakaicommunity.hartree.stfc.ac.uk/access/content/group/admin/e... · 2013. 8. 29. · • The University of Edinburgh (EPCC), STFC (Daresbury), and IBM: 2002-2010 • Phase

Any questions?

[email protected]

[email protected]

33