vhpc on nectar research cloud - hpc advisory council · m3 cluster is in production since 2016 at...

Monash eResearch CentreSwe Aung (Senior Research DevOps Engineer)Gin Tan (Senior HPC Consultant)

vHPC on Nectar Research Cloud HPC-AI Advisory Council Perth 28 Aug

Gov and NCRIS AI and Compute

Middleware Cloud

Disruptive Technology

Data

Virtual LaboratoriesSelf Service

Transparency

Data Management

Governance

Cloud Computing

Research NetworksData Storage

Instrument Data WorkflowsFAIR

Open Science

Accessibility Infrastructure and Capacity

Modelling

Artificial Intelligence

Simulation

Design and PrototypingData Science

IoT and Sensors

Visualisation

Software Defined Everything

Data Tools, Techniques, Infrastructure

Contributors and Partners

Disciplines

Monash eResearch Projects

High Performance Computing

MapReduce

Cloud APIVR/AR

Current Monash eResearch Landscape

Hardware

Deep LearningGPUs

Collaboration

ASICs

The Australian Research Cloud

Nectar Research Cloud started since 2012 and

it provides flexible scalable computing power to

all Australian researchers, with computing

infrastructure, software and services that allow

the research community to store, access and

run data, remotely and rapidly. Around 43,000

vCPUs are in use right now.

National eResearch Collaboration Tools and Resources project (Nectar)

MASSIVE M3 is a GPU-centric data processing cluster MASSIVE is an ISO quality-accredited, high-performance data processing facility Access via

• merit allocation through the

National Computational Merit Allocation Scheme (NCMAS).• integration with national instruments based in various institutions• through national informatics infrastructure projects, including the

Characterisation Virtual Laboratory • Monash Researchers

Overview of MASSIVE at https://www.massive.org.au/

Detailed technical information at https://docs.massive.org.au/

HPC for Characterisation and Imaging

MASSIVE

https://ncmas.nci.org.au/2019/

https://ncmas.nci.org.au/2019/

http://www.cvl.org.au

http://www.cvl.org.au

https://www.massive.org.au/

https://docs.massive.org.au/

Infrastructure3 computers: M1, M2, M3~6000 cores 298 GPUs3PB fast parallel FS

Usage 400+ active projects2,000+ user accounts100+ institutions across Australia

Interactive Vis600+ users

HPC for Characterisation and Imaging

MASSIVE

InstrumentIntegrationIntegrating with key Instrument Facilities.– IMBL, XFM– CryoEM– MX– MBI– NCRIS: NIF, AMMRF

Large cohort of researchers new to HPC

PartnersMonash UniversityAustralia's Nuclear Science and Technology Organisation (ANSTO)Commonwealth Scientific and Industrial Research Organisation(CSIRO)University of Wollongong

Affiliate PartnersARC Centre of Excellence in Integrative Brain FunctionARC Centre of Excellence in Advanced Molecular Imaging

Top HPC communitiesclassification of M3 projects

MASSIVE M3

M3 cluster is in production since 2016 at Monash University

heterogeneous system to support heterogeneous workload

and operated in research cloud

Cores:5356 for both Haswell and Skylake CPUs

NVIDIA GPU coprocessors for data processing and visualisation:48 NVIDIA Tesla K8040 NVIDIA Pascal P10066 NVIDIA Pascal P460 NVIDIA Volta V1006 NVIDIA DGX1-V 32 NVIDIA Grid K1 GPUs for medium and low end visualisationFilesystem:A 3 petabyte Lustre parallel file system usable after upgrade

Interconnect:100 Gb/s Ethernet Mellanox Spectrum that running Cumulus OS

Monash Research Cloud (R@CMon)

A high level overview of Monash research cloud cells in Nectar. Monash-01 (commodity cloud), monash-02 (research cloud with specialist compute hardwares) and monash-03 (private cell that provide infrastructure for Massive, Monarch, SensiLab and hosting for partnerships)

vHPC Infrastructure Architecture DiagramHigh level architectural overview of monash-03. Baremetal provisioning using Ironic (openstack).Provide Redundancy Neutron provider networks to all hpc cluster network, public network, restricted Monash VLAN, and advance networking for LBaaS, virtual router, etc.VM instances are numa optimised with host cpu passthrough to match all the cpu flags.

vHPC Compute EnvironmentOverview of HPC compute nodeNuma optimised HPC VM.Host CPU passthrough with all the flagsLow latency SR-IOV passthrough to the instances for parallel filesystemNeutron provider network for HPC cluster network

Ansible (infrastructure as code)

● provisioning tool● configuration management● just run the playbooks

● Portable!!!● Tags!● Openstack modules/ Ansible Galaxy● CMDB in the repos

RunningCumulus

RunningUbuntu

RunningOpenstack

RunningCentos

Automation - both hypervisor/instances

- now that we have the codes to provision a cluster in the cloud - why do we need still need to worry about the hypervisor configuration- in this case we still do because we are using SR-IOV to present the

card to the instances- the dependencies on the driver on the firmware essentially a

time-consuming tasks for the sysadmin- Ansible AWX

- as the central on-premise task management platform at Monash, providing the internal teams with a secure, centralized management interface for patching and provisioning servers

- schedule workflow to run ansible playbooks to drain the compute node, upgrade the hypervisor, upgrade the instances and return the compute node in the queue.

- Provisioning from the beginning when the hardware arrives in the DC

Machine Learning - AI in Monash

● strong nvidia partnership ● apart from the life science research, one of the

biggest communities that we have are from machine learning

● supporting different machine learning techniques● to help maximise the usage on GPU, using

checkpointing for ML jobs● dgx machine - six in the queue + five would be

added later this year● AI platform to support the community ● Running AI study unit, giving 150 undergraduate

students the Nvidia P100 in the campus cluster● first MASSIVE GPU User Community held on the

17 July

M3 CPU & GPU Usage per Quarter - Machine Learning

Challenges and improvements

● there are cases when baremetal are still preferable○ file servers○ specialised machine e.g. DGX

● virtualisation and openstack ○ bugs that are not reported yet or did not get enough

attention to fix (for eg: ubuntu kvm support for VM > 1TB memory)

● IO bottleneck: adding the connectivity: other cloud instances, services, desktops, long term archive

● Ansible journey:○ adoption: we are using ansible on all the

infrastructure that we maintaining○ education: training & learning for the teams even

outside the Monash eResearch Centre○ knowledge organisation: continue to improve

Many thanks to all my colleagues in:

Nectar cloud: https://nectar.org.au/research-cloudMASSIVE: https://www.massive.org.auCVL: https://www.cvl.org.auMonash eResearch: https://www.monash.edu/researchinfrastructure/eresearch/about

email: [email protected]

https://nectar.org.au/research-cloud

https://www.massive.org.au/

https://www.cvl.org.au/

https://www.monash.edu/researchinfrastructure/eresearch/about

https://www.monash.edu/researchinfrastructure/eresearch/about

vhpc on nectar research cloud - hpc advisory council · m3 cluster is in production since 2016 at...

Documents