environment and climate change canada hpc renewal project

Post on 01-Jan-2017

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

One Team – One Culture – One Purpose – One SSC

Environment and Climate Change Canada HPC Renewal Project:Procurement Results

17th Workshop on HPC in meteorology

ECMWF, Reading, UK

Alain St-Denis & Luc Corbeil

October 2016

One Team – One Culture – One Purpose – One SSC

Outline

• Background

• History

• Scope

• RFP

• Outcome

2

One Team – One Culture – One Purpose – One SSC

HPC Renewal for ECCC Background

• Environment Canada highly dependent on HPC in delivery of mandate: simulation of Environmental Forecasts for health, safety, security and economic well-being of Canadians.

• Contract with IBM expiring with few remaining options to extend

• Linked to Meteorological Service of Canada (MSC) Renewal Treasury Board Submission

Component 1: Monitoring Networks

Component 2: Supercomputing capacity

Component 3: Weather Warnings and Forecast System

• Joint ECCC-SSC submission for Supercomputing Capacity3

One Team – One Culture – One Purpose – One SSC

New player: Shared Services Canada

• Created in 2012, to take responsibility of email, networks and data center for the whole Government of Canada.

• Supercomputing IT people working for ECCC transferred to SSC.

• Scope of the HPC team expanded to all science departments

• As in any reorganization, there are challenges and opportunities!

One Team – One Culture – One Purpose – One SSC 5

Shared Services Canada was

formed to consolidate and

streamline the delivery of IT

infrastructure services,

specifically email, data centre

and network services. Our

mandate is to do this so that

federal organizations and their

stakeholders have access to

reliable, efficient and secure IT

infrastructure services at the

best possible value.

SSC will Innovate, ensure full Value for Money and achieve Service Excellence !

Service to

Canadians

Departmental

Programs

SSC

Services

Shared Services Canada – Our Mandate

One Team – One Culture – One Purpose – One SSC

A Bit of History

• ECCC has been using a supercomputer for weather forecasting and atmospheric science for more than half a century

6

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

Millio

ns o

f F

loati

ng

Po

int

Op

era

tio

ns p

er

Seco

nd

Year

Peak Sustained

Power7

360/65

NEC IBM

1

CRAYCDC

G20

IBMBendix

7600176

X-XMP 28

X-XMP 4-16

SX-3/44SX-3/44R

SX-4/16

SX-4/80M3SX-5/32M2

SX-6/80M10

Power4

Power5

One Team – One Culture – One Purpose – One SSC

A Bit of (More Recent) History

• Request for Information (Fall 2012,

• Invitation to Qualify (Fall 2013, 4 bidders qualified)

• Review Refine Requirements (Summer 2014)

• Requests for Proposal (November 2014 – June 2015)

• Treasury Board Approval (April 2016)

• Contract Award (May 27 2016)

7

One Team – One Culture – One Purpose – One SSC

Scope

Scope In replacement of

Supercomputer clusters Two 8192 P7 cores clusters

Pre/Post-Processing clusters (PPP) Two 640 X86 cores custom clusters

Global Parallel Storage (Site-Store) CNFS and ESS clusters

Near-Line Storage (HP-NLS) StorNext based archiving cluster

Home directories Netapp home directories

8

As well as

• Hosting of the Solution

• High Performance Interconnects

• Software & tools

• Maintenance & Support

• Training & Conversion support

• On-going High Availability

One Team – One Culture – One Purpose – One SSC

ECCC Supercomputing Procurement Requirements

• Contract for Hosted HPC Solution: 8.5 years + one 2.5 year option (Transition year + two upgrades + one optional)

• Connectivity betweenHPC Solution Data Halls and Dorval

• No more than 70km between Hall A, Hall B& Dorval

• Flexible Options for additional needs

Hall B

NCF

Solution Data Hall A Solution Data Hall B

Inter-H

all L

ink (x2

) Inter-Hall Link (x2)

Inter-Hall Link (x2)

Hall A

On-going Availability

One Team – One Culture – One Purpose – One SSC

High Level Architecture

10

SCF Data Flow – Logical View 2014-10-07LPT, HPN/DADS, SSC

HP-NLS

SupercomputerB

HP-NLS

HPN Data Transfer

Storage Synchronization

Scratch

Cache Cache

Scratch

Site

Store

Home

Out-of-Band Management

Site

Store

Home

DATA Feeds

Pre/Post

ProcessingB

Pre/Post

ProcessingA

NCF

DATA Feeds

SupercomputerA

Solution

Data Hall B

Solution

Data Hall A

One Team – One Culture – One Purpose – One SSC

Outcome

• IBM was awarded the contract

Evaluation based on benchmark performance on a fixed budget

• IBM's Proposal for initial system

Supercomputer: Cray XC-40, Intel Broadwell, Sonexion Lustre Storage

PPP: Cray CS-400, Intel Broadwell

Site-Store and Homes: IBM Elastic Storage Server (ESS, GPFS-based)

HP-NLS: based on IBM High Performance Storage System (HPSS)

11

One Team – One Culture – One Purpose – One SSC

Sizing

• Computing

About 35,000 Intel Broadwell cores per Data Hall ♦ Super and PPP combined

• More than 40PB of disk storage

2.5 PB scratch storage per supercomputer (one per data hall)

18 PB site store per data hall

1.1 PB disk cache to the archive per data hall

• More than 230 petabytes of tape storage (two copies)

12

One Team – One Culture – One Purpose – One SSC

Comparison

13

HP-NLS storage (vs current tape capacity), petabytes

Site-Store, homes storage (vs current), petabytes

Sustained TFlopsSupercomputer and PPP

(vs P7, current PPP)

Peak TFlopsSupercomputer and PPP

(vs P7, current PPP)

Cores count Supercomputer and PPP (vs P7, current

PPP)

Scratch storage (vs p7), petabytes

0

1

2

3

4

5

6

Increase Factors

One Team – One Culture – One Purpose – One SSC

The Newest Addition to a Long History

14

Bendix G20IBM 360/65

CDC 7600 CDC 176

Cray 1S

Cray XMP-28

Cray XMP 416NEC SX-3/44

NEC SX-3/44R

NEC SX-4/16

NEC SX-4/80M3NEC SX-5/32/M2

NEC SX-6/80M10IBM P4IBM P5

IBM P7

IBM/XC-40

0.01

0.10

1.00

10.00

100.00

1000.00

10000.00

100000.00

1000000.00

10000000.00

100000000.00

1000000000.00

10000000000.00

Historical Performance, EC Supercomputers (Flops)

Sustained

Peak

One Team – One Culture – One Purpose – One SSC

Resulting Architecture

15

One Team – One Culture – One Purpose – One SSC

HPC Implementation Milestones: Delivery to Acceptance

• Data Hall and Hosting Site Certification

• Functionality Testing (IT infra)

• Security Accreditation

• Performance testing

• Conversion of Operational codes (Automated Environmental Analysis & Production (AEAPPS)

• Meeting the above triggers a 30 day availability test

Inspection

Functionality Testing

Performance Testing

Conversion

RFU

Acceptance

16

One Team – One Culture – One Purpose – One SSC

Challenge

• Change the Supercomputer clusters, PPP clusters, archiving system and homes. All at once. Never been done

A lot of preparation work has been done ahead of time♦ Most codes have already been ported to Intel architecture

♦ Our General Purpose Science Clusters available for PPP migration work

– Linux containers are being leveraged to smooth the transition

17

One Team – One Culture – One Purpose – One SSC 18

Thank you!

Questions?

top related