an overview of the surfsara compute services data and computing infrastructure event – large-scale...

23
SURFsara Data and Computing Infrastructure Event Large-scale Computing An Overview of the SURFsara Compute Services Walter Lioen <[email protected]> Group Leader Supercomputing

Upload: doanhanh

Post on 28-Mar-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

SURFsara Data and Computing Infrastructure Event – Large-scale Computing

An Overview of the SURFsara Compute Services

Walter Lioen <[email protected]>

Group Leader Supercomputing

Compute Services – Portfolio

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 2

Capacity Computing vs. Capability Computing

• Capacity Computing

- E.g. parameter sweep: 1000 single core runs of 24 hours each

- PhD’s workstation / laptop: 3 years

- 1000 cores on grid / cluster: 1 day / weekend

- Grid computing

- Cluster computing

- Cloud computing

- Hadoop

• Capability computing

- Single problem requiring one or more of:

- Many cores (CPU time)

- Much memory (problem size)

- Fast interconnect (communication intensive)

- Fast I/O subsystem (large data sets)

- Supercomputing (general purpose)

- Accelerators (GPGPUs, Intel Xeon Phi)

- (Cluster computing)

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 3

Supercomputing: Cartesius

Phase 1 (production June 2013, total peak performance 271 TFlop/s)

• Direct Liquid Cooled thin node islands

- 360 thin nodes, 2 12-core 2.4 GHz Intel Ivy Bridge CPUs/node, 64 GB/node

- 180 thin nodes, 2 12-core 2.4 GHz Intel Ivy Bridge CPUs/node, 64 GB/node

• Fat node island

- 32 fat nodes, 4 8-core Intel Sandy Bridge CPUs/node, 256 GB/node

• Total

- 13,968 cores, 41.75 TB memory, 2.4 PB disk

- Interconnect: InfiniBand 56 Gbit/s bandwidth, 3 µs latency

- Top 500 November 2013: # 184 Phase 1.5 (scheduled production 2014 Q2, total peak performance ~ 470 TFlop/s)

• Addition of accelerator island

- 66 nodes, 2 Intel Ivy Bridge CPUs/node, 2 NVIDIA Tesla K40 GPGPUs/node

Phase 2 (scheduled production 2014 H2, total peak performance > 1 PFlop/s)

• On-demand addition of thin node islands with latest Intel Haswell CPUs

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 4

SURFsara National Supercomputing History

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014

Year Machine Rpeak

GFlop/s kW

GFlop/s / kW

1984 CDC Cyber 205 1-pipe 0.1 250 0.0004

1988 CDC Cyber 205 2-pipe 0.2 250 0.0008

1991 Cray Y-MP/4128 1.33 200 0.0067

1994 Cray C98/4256 4 300 0.0133

1997 Cray C916/121024 12 500 0.024

2000 SGI Origin 3800 1,024 300 3.4

2004 SGI Origin 3800 + Altix 3700 3,200 500 6.4

2007 IBM p575 Power5+ 14,592 375 40

2008 IBM p575 Power6 62,566 540 116

2009 IBM p575 Power6 64,973 560 116

2013 Bull bullx B710 (DLC) + R428 270,950 245 1106

2014 + Bull bullx B515 (NVIDIA K40) >200,000 <60 >3333

2014 Bull bullx complete system >1,000,000 >520 >1923

5

Moore’s Law (1965)

• The number of transistors on an integrated circuit doubles every 2 years

• Because of faster transistors, the speed doubles every 18 months

• The clock speed stopped doubling a couple of years ago

• Nowadays the number of cores doubles

• Moore noted that if car manufacturers had something like this,

cars would get 100,000 miles to the gallon and it would be

cheaper to buy a Rolls Royce than park it.

(Cars would also be only half an inch long.)

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 6

Cluster Computing: LISA

• At present

- 624 heterogeneous Dell nodes 16 – 24 GB

- 3 types dual quad, hexa, octo core Intel 1.80 – 2.26 GHz CPUs

- Qlogic 4× DDR Infiniband → 20 Gbit/s inter-node

- Total: 6,528 cores, 46 Tflop/s, 16 TB

- 100 TB disk space

- OS: Debian Linux AMD64

• SURFsara installed the LISA system on behalf of

- The University of Amsterdam (UvA)

- The VU University of Amsterdam (VU)

- The SURF Foundation

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 7

Grid infrastructure and usage

• National Grid infrastructure (NGI_NL):

- Operated in collaboration with Nikhef and RUG-CIT

- Grid compute clusters at Nikhef, RUG-CIT and SURFsara

- Grid Services Cluster at SURFsara

- Life Science Grid (operated by SURFsara)

- Grid Front-End storage at Nikhef and SURFsara (total capacity 5.2 PB)

- Part of European Grid Infrastructure (EGI)

• Grid Usage:

- In 2013 more than 53 million core hours (6050 core years)

- Scientific projects/users: Number of e-infra projects: 34

- Number of users: 226

March 13, 2014 SURFsara Data and Computing Infrastructure Event – Large-scale Computing 8

Life Science Grid sites

• LUMC – Leiden

• WUR – Wageningen

• NKI – Amsterdam

• AMC – Amsterdam

• Radboud Universiteit – Nijmegen

• ErasmusMC – Rotterdam

• Technische Universiteit Delft

• Rijksuniversiteit Groningen

• Keygene – Wageningen

• Each of these clusters has 128 cores and 40 TB of storage space

• SURFsara 3000 cores and large tiered storage (disk and tape)

• NIKHEF: 3300 cores, RUG-CIT: 1000 cores

March 13, 2014 SURFsara Data and Computing Infrastructure Event – Large-scale Computing 9

HPC Cloud

• You can clone your laptop or server

Create an image and upload it into the cloud

• And expand it …

Largest VM goes up to 32 cores and 256 GB RAM

No overcommitting → HPC Cloud

• Or clone it to multiple copies, create your own cluster

Start multiple worker nodes when needed

• But remember: you are your own sys admin …

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 10

HPC Cloud

Current HPC Cloud infrastructure

• 30 HPC Nodes: 32 cores, 256 GB RAM, 1 TB local disk

• 10 "light” User Service Nodes: 8 cores/node, 64 GB RAM, 6 TB local disk

• 1250 cores total

• 400TB shared storage (ISCSI, NFS, CIFS, CDMI ...)

March 13, 2014 SURFsara Data and Computing Infrastructure Event – Large-scale Computing 11

High memory compute nodes

• Machine

- 2 TB RAM

- 40 core

- 8 TB disk storage

• Two machines

- 1 as part of the Life Science Grid

- 1 as part of the HPC cloud

Service available Q2 2014

Added value:

• Particular in Life Sciences there are still many applications which use few cores

and scale in RAM. (e.g. de novo assembly of genomes)

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 12

Hadoop & noSQL

• Hadoop can be used to process large data files in a distributed way. It comes with an easy to use programmer interface and is able to handle huge data files, in structured, semi structured or unstructured form.

• Hadoop infrastructure - 1400 cores (4x capacity of 2012) - 2 PB of HDFS storage (2x capacity of 2012)

• Usage 2013: - 35 different users and user groups - 300 CPU-years of processing - 160TB data stored

• New: noSQL service - Schemaless databases, not everything fits into rows and columns easily - Faster development, easy to scale

March 13, 2014 SURFsara Data and Computing Infrastructure Event – Large-scale Computing 13

Remote Visualization: RVS / Elvis cluster

• Remote visualization, benefits for users:

- Remote visualization cluster with GPUs next to data in SURFsara datacenter

- Visualization application runs on remote high-end GPU cluster (possibly in

parallel)

- User accesses application through remote desktop (VNC)

- No datasets, only pixels are transferred

• Remote Visualization Service:

- 9 render nodes, per node: 2 x 6-core Intel Xeon E5-2620, 2 x NVIDIA GeForce

GTX780Ti, 64 GB RAM, 800 GB scratch

- 33 TB home file system

- Software stack maintained by SURFsara: ParaView, VisIt, VTK, VMD, MIPAV,

Blender, any other OpenGL application

- Expert visualization support

March 13, 2014 SURFsara Data and Computing Infrastructure Event – Large-scale Computing 14

Accelerated Computing

• GPGPUs

- Cartesius

- RVS / Elvis

• Intel Xeon Phi

- Dell, Intel and SURFsara: Intel Xeon Phi Center of Competence

- In the future, SURFsara will provide Intel Xeon Phi trainings and expertise

- Mona (part of LISA currently has 2 Xeon Phis)

- In the near future to be expanded to a modest Xeon Phi cluster

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 15

• Goal: 5 – 6 ‘Top 10’ (world)

supercomputers in Europe

• PRACE Preparatory Phase

2008 – 2009: 250 people

• PRACE 1st Implementation Phase

Jul 2010 – Jun 2012:

400 people, 2400 PM

• PRACE 2nd Implementation Phase

Sep 2011 – Aug 2013: 3000 PM

• PRACE 3rd Implementation Phase

Jul 2012 – Jun 2014:

600 – 650 people, 1500 PM

• SURFsara: 20 – 10 people,108 – 60 PM

PRACE – Partnership for Advanced Computing in Europe

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 16

What to use?

• Capability computing: Cartesius - Large multi-node jobs - I/O intensive

• Capacity computing: LISA - Many small jobs / task farming - Small parallel jobs, however, can handle 256 node (8 core) jobs - Not I/O intensive

• Capacity computing: Grid - Data parallelism (processing)

• Capacity computing: Hadoop - Data parallelism (I/O, MapReduce)

• Special purpose: GPGPU, Intel Xeon Phi - Think of relatively small compute intensive kernels

(because of relatively large programming effort)

• HPC Cloud - special OS / software stack requirements - expanding / scaling existing machine

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 17

Getting Access

• Cartesius supercomputer: - application via NWO: https://www.surfsara.nl/systems/cartesius/account - requires project proposal - will be peer reviewed - reporting required

• LISA compute cluster: - your application depends on your affiliation: https://www.surfsara.nl/systems/lisa/account - if application via NWO then similar to Cartesius application above,

otherwise, it only takes one (or two) e-mails

• National e-Infrastructure (a.o. Grid, Hadoop, HPC Cloud) - application via SURFsara: https://e-infra.surfsara.nl - technical evaluation - reporting required

• PRACE (Partnership for Advanced Computing in Europe): - http://www.prace-ri.eu - DECI (Distributed Extreme Computing Initiative) calls (Tier-1, Cartesius-like) - Preparatory Access continues call (Tier-0, cut-offs, three access types) - Project Access calls (Tier-0, twice a year)

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 18

HPC Helpdesk

• Monday to Friday, between 9 am and 5 pm - [email protected] - 020 – 800 1400

• Questions

• Problems - Access - Hardware - Software - File restore (after accidental removal by owner …)

• User assistance - Software configuration / installation - Porting (limited: hours/days not weeks) - Debugging (limited) - Optimization / tuning (limited or via SURFsara/PRACE)

• Software package installation requests

• Access requests (e-mail only)

• Any special requests

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 19

Application Support

• Regular user support - Via HPC Helpdesk - Typical effort: from a few minutes to a couple of days

• Application enabling for Dutch Compute Challenge Projects - Potential effort by SURFsara staff: 1 to 6 person months per project - Request as part of the computing time request through NWO

• Performance improvement of applications - Typically meant for promising user applications - Potential effort by SURFsara staff: 3 to 6 person months per project - Use the request form on the SURFsara web site

https://www.surfsara.nl/support/application-support-for-cartesius-and-lisa

• Support for PRACE applications - PRACE offers access to European world-class systems - SURFsara participates in PRACE support in application enabling

• Visualization projects

• User training and workshops

• Please contact SURFsara at [email protected]

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 20

SURF Outreach & Support (SOS)

• SURF’s operating companies offer a world-class communication, computing and

support infrastructure to facilitate scientific and scholarly research

- data transfer

- compute

- (international) collaboration

- visualization

- storage

• For these services we aim to provide a single point of contact

for researchers and ICT staff

• SOS offers:

- to raise awareness among researchers of the possibilities of SURF

e-infrastructure for scientific research

- to foster collaboration and to enable research projects

March 13, 2014 SURFsara Data and Computing Infrastructure Event – Large-scale Computing 21

Activities for researchers and research

support officers in 2014

• SURF Roadshow →

visit universities and university medical

centers

• Integrated services portfolio →

description of services and support at

www.surf.nl/support4research

• Central information point →

direct advice and support via

[email protected]

What SURF can do for researchers

March 13, 2014 SURFsara Data and Computing Infrastructure Event – Large-scale Computing 22

Thank you for listening!

SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 23