an overview of the surfsara compute services data and computing infrastructure event – large-scale...
TRANSCRIPT
SURFsara Data and Computing Infrastructure Event – Large-scale Computing
An Overview of the SURFsara Compute Services
Walter Lioen <[email protected]>
Group Leader Supercomputing
Compute Services – Portfolio
SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 2
Capacity Computing vs. Capability Computing
• Capacity Computing
- E.g. parameter sweep: 1000 single core runs of 24 hours each
- PhD’s workstation / laptop: 3 years
- 1000 cores on grid / cluster: 1 day / weekend
- Grid computing
- Cluster computing
- Cloud computing
- Hadoop
• Capability computing
- Single problem requiring one or more of:
- Many cores (CPU time)
- Much memory (problem size)
- Fast interconnect (communication intensive)
- Fast I/O subsystem (large data sets)
- Supercomputing (general purpose)
- Accelerators (GPGPUs, Intel Xeon Phi)
- (Cluster computing)
SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 3
Supercomputing: Cartesius
Phase 1 (production June 2013, total peak performance 271 TFlop/s)
• Direct Liquid Cooled thin node islands
- 360 thin nodes, 2 12-core 2.4 GHz Intel Ivy Bridge CPUs/node, 64 GB/node
- 180 thin nodes, 2 12-core 2.4 GHz Intel Ivy Bridge CPUs/node, 64 GB/node
• Fat node island
- 32 fat nodes, 4 8-core Intel Sandy Bridge CPUs/node, 256 GB/node
• Total
- 13,968 cores, 41.75 TB memory, 2.4 PB disk
- Interconnect: InfiniBand 56 Gbit/s bandwidth, 3 µs latency
- Top 500 November 2013: # 184 Phase 1.5 (scheduled production 2014 Q2, total peak performance ~ 470 TFlop/s)
• Addition of accelerator island
- 66 nodes, 2 Intel Ivy Bridge CPUs/node, 2 NVIDIA Tesla K40 GPGPUs/node
Phase 2 (scheduled production 2014 H2, total peak performance > 1 PFlop/s)
• On-demand addition of thin node islands with latest Intel Haswell CPUs
SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 4
SURFsara National Supercomputing History
SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014
Year Machine Rpeak
GFlop/s kW
GFlop/s / kW
1984 CDC Cyber 205 1-pipe 0.1 250 0.0004
1988 CDC Cyber 205 2-pipe 0.2 250 0.0008
1991 Cray Y-MP/4128 1.33 200 0.0067
1994 Cray C98/4256 4 300 0.0133
1997 Cray C916/121024 12 500 0.024
2000 SGI Origin 3800 1,024 300 3.4
2004 SGI Origin 3800 + Altix 3700 3,200 500 6.4
2007 IBM p575 Power5+ 14,592 375 40
2008 IBM p575 Power6 62,566 540 116
2009 IBM p575 Power6 64,973 560 116
2013 Bull bullx B710 (DLC) + R428 270,950 245 1106
2014 + Bull bullx B515 (NVIDIA K40) >200,000 <60 >3333
2014 Bull bullx complete system >1,000,000 >520 >1923
5
Moore’s Law (1965)
• The number of transistors on an integrated circuit doubles every 2 years
• Because of faster transistors, the speed doubles every 18 months
• The clock speed stopped doubling a couple of years ago
• Nowadays the number of cores doubles
• Moore noted that if car manufacturers had something like this,
cars would get 100,000 miles to the gallon and it would be
cheaper to buy a Rolls Royce than park it.
(Cars would also be only half an inch long.)
SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 6
Cluster Computing: LISA
• At present
- 624 heterogeneous Dell nodes 16 – 24 GB
- 3 types dual quad, hexa, octo core Intel 1.80 – 2.26 GHz CPUs
- Qlogic 4× DDR Infiniband → 20 Gbit/s inter-node
- Total: 6,528 cores, 46 Tflop/s, 16 TB
- 100 TB disk space
- OS: Debian Linux AMD64
• SURFsara installed the LISA system on behalf of
- The University of Amsterdam (UvA)
- The VU University of Amsterdam (VU)
- The SURF Foundation
SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 7
Grid infrastructure and usage
• National Grid infrastructure (NGI_NL):
- Operated in collaboration with Nikhef and RUG-CIT
- Grid compute clusters at Nikhef, RUG-CIT and SURFsara
- Grid Services Cluster at SURFsara
- Life Science Grid (operated by SURFsara)
- Grid Front-End storage at Nikhef and SURFsara (total capacity 5.2 PB)
- Part of European Grid Infrastructure (EGI)
• Grid Usage:
- In 2013 more than 53 million core hours (6050 core years)
- Scientific projects/users: Number of e-infra projects: 34
- Number of users: 226
March 13, 2014 SURFsara Data and Computing Infrastructure Event – Large-scale Computing 8
Life Science Grid sites
• LUMC – Leiden
• WUR – Wageningen
• NKI – Amsterdam
• AMC – Amsterdam
• Radboud Universiteit – Nijmegen
• ErasmusMC – Rotterdam
• Technische Universiteit Delft
• Rijksuniversiteit Groningen
• Keygene – Wageningen
• Each of these clusters has 128 cores and 40 TB of storage space
• SURFsara 3000 cores and large tiered storage (disk and tape)
• NIKHEF: 3300 cores, RUG-CIT: 1000 cores
March 13, 2014 SURFsara Data and Computing Infrastructure Event – Large-scale Computing 9
HPC Cloud
• You can clone your laptop or server
Create an image and upload it into the cloud
• And expand it …
Largest VM goes up to 32 cores and 256 GB RAM
No overcommitting → HPC Cloud
• Or clone it to multiple copies, create your own cluster
Start multiple worker nodes when needed
• But remember: you are your own sys admin …
SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 10
HPC Cloud
Current HPC Cloud infrastructure
• 30 HPC Nodes: 32 cores, 256 GB RAM, 1 TB local disk
• 10 "light” User Service Nodes: 8 cores/node, 64 GB RAM, 6 TB local disk
• 1250 cores total
• 400TB shared storage (ISCSI, NFS, CIFS, CDMI ...)
March 13, 2014 SURFsara Data and Computing Infrastructure Event – Large-scale Computing 11
High memory compute nodes
• Machine
- 2 TB RAM
- 40 core
- 8 TB disk storage
• Two machines
- 1 as part of the Life Science Grid
- 1 as part of the HPC cloud
Service available Q2 2014
Added value:
• Particular in Life Sciences there are still many applications which use few cores
and scale in RAM. (e.g. de novo assembly of genomes)
SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 12
Hadoop & noSQL
• Hadoop can be used to process large data files in a distributed way. It comes with an easy to use programmer interface and is able to handle huge data files, in structured, semi structured or unstructured form.
• Hadoop infrastructure - 1400 cores (4x capacity of 2012) - 2 PB of HDFS storage (2x capacity of 2012)
• Usage 2013: - 35 different users and user groups - 300 CPU-years of processing - 160TB data stored
• New: noSQL service - Schemaless databases, not everything fits into rows and columns easily - Faster development, easy to scale
March 13, 2014 SURFsara Data and Computing Infrastructure Event – Large-scale Computing 13
Remote Visualization: RVS / Elvis cluster
• Remote visualization, benefits for users:
- Remote visualization cluster with GPUs next to data in SURFsara datacenter
- Visualization application runs on remote high-end GPU cluster (possibly in
parallel)
- User accesses application through remote desktop (VNC)
- No datasets, only pixels are transferred
• Remote Visualization Service:
- 9 render nodes, per node: 2 x 6-core Intel Xeon E5-2620, 2 x NVIDIA GeForce
GTX780Ti, 64 GB RAM, 800 GB scratch
- 33 TB home file system
- Software stack maintained by SURFsara: ParaView, VisIt, VTK, VMD, MIPAV,
Blender, any other OpenGL application
- Expert visualization support
March 13, 2014 SURFsara Data and Computing Infrastructure Event – Large-scale Computing 14
Accelerated Computing
• GPGPUs
- Cartesius
- RVS / Elvis
• Intel Xeon Phi
- Dell, Intel and SURFsara: Intel Xeon Phi Center of Competence
- In the future, SURFsara will provide Intel Xeon Phi trainings and expertise
- Mona (part of LISA currently has 2 Xeon Phis)
- In the near future to be expanded to a modest Xeon Phi cluster
SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 15
• Goal: 5 – 6 ‘Top 10’ (world)
supercomputers in Europe
• PRACE Preparatory Phase
2008 – 2009: 250 people
• PRACE 1st Implementation Phase
Jul 2010 – Jun 2012:
400 people, 2400 PM
• PRACE 2nd Implementation Phase
Sep 2011 – Aug 2013: 3000 PM
• PRACE 3rd Implementation Phase
Jul 2012 – Jun 2014:
600 – 650 people, 1500 PM
• SURFsara: 20 – 10 people,108 – 60 PM
PRACE – Partnership for Advanced Computing in Europe
SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 16
What to use?
• Capability computing: Cartesius - Large multi-node jobs - I/O intensive
• Capacity computing: LISA - Many small jobs / task farming - Small parallel jobs, however, can handle 256 node (8 core) jobs - Not I/O intensive
• Capacity computing: Grid - Data parallelism (processing)
• Capacity computing: Hadoop - Data parallelism (I/O, MapReduce)
• Special purpose: GPGPU, Intel Xeon Phi - Think of relatively small compute intensive kernels
(because of relatively large programming effort)
• HPC Cloud - special OS / software stack requirements - expanding / scaling existing machine
SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 17
Getting Access
• Cartesius supercomputer: - application via NWO: https://www.surfsara.nl/systems/cartesius/account - requires project proposal - will be peer reviewed - reporting required
• LISA compute cluster: - your application depends on your affiliation: https://www.surfsara.nl/systems/lisa/account - if application via NWO then similar to Cartesius application above,
otherwise, it only takes one (or two) e-mails
• National e-Infrastructure (a.o. Grid, Hadoop, HPC Cloud) - application via SURFsara: https://e-infra.surfsara.nl - technical evaluation - reporting required
• PRACE (Partnership for Advanced Computing in Europe): - http://www.prace-ri.eu - DECI (Distributed Extreme Computing Initiative) calls (Tier-1, Cartesius-like) - Preparatory Access continues call (Tier-0, cut-offs, three access types) - Project Access calls (Tier-0, twice a year)
SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 18
HPC Helpdesk
• Monday to Friday, between 9 am and 5 pm - [email protected] - 020 – 800 1400
• Questions
• Problems - Access - Hardware - Software - File restore (after accidental removal by owner …)
• User assistance - Software configuration / installation - Porting (limited: hours/days not weeks) - Debugging (limited) - Optimization / tuning (limited or via SURFsara/PRACE)
• Software package installation requests
• Access requests (e-mail only)
• Any special requests
SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 19
Application Support
• Regular user support - Via HPC Helpdesk - Typical effort: from a few minutes to a couple of days
• Application enabling for Dutch Compute Challenge Projects - Potential effort by SURFsara staff: 1 to 6 person months per project - Request as part of the computing time request through NWO
• Performance improvement of applications - Typically meant for promising user applications - Potential effort by SURFsara staff: 3 to 6 person months per project - Use the request form on the SURFsara web site
https://www.surfsara.nl/support/application-support-for-cartesius-and-lisa
• Support for PRACE applications - PRACE offers access to European world-class systems - SURFsara participates in PRACE support in application enabling
• Visualization projects
• User training and workshops
• Please contact SURFsara at [email protected]
SURFsara Data and Computing Infrastructure Event – Large-scale Computing March 13, 2014 20
SURF Outreach & Support (SOS)
• SURF’s operating companies offer a world-class communication, computing and
support infrastructure to facilitate scientific and scholarly research
- data transfer
- compute
- (international) collaboration
- visualization
- storage
• For these services we aim to provide a single point of contact
for researchers and ICT staff
• SOS offers:
- to raise awareness among researchers of the possibilities of SURF
e-infrastructure for scientific research
- to foster collaboration and to enable research projects
March 13, 2014 SURFsara Data and Computing Infrastructure Event – Large-scale Computing 21
Activities for researchers and research
support officers in 2014
• SURF Roadshow →
visit universities and university medical
centers
• Integrated services portfolio →
description of services and support at
www.surf.nl/support4research
• Central information point →
direct advice and support via
What SURF can do for researchers
March 13, 2014 SURFsara Data and Computing Infrastructure Event – Large-scale Computing 22