vhpc on nectar research cloud - hpc advisory council · m3 cluster is in production since 2016 at...
TRANSCRIPT
Monash eResearch CentreSwe Aung (Senior Research DevOps Engineer)Gin Tan (Senior HPC Consultant)
vHPC on Nectar Research Cloud HPC-AI Advisory Council Perth 28 Aug
Gov and NCRIS AI and Compute
Middleware Cloud
Disruptive Technology
Data
Virtual LaboratoriesSelf Service
Transparency
Data Management
Governance
Cloud Computing
Research NetworksData Storage
Instrument Data WorkflowsFAIR
Open Science
Accessibility Infrastructure and Capacity
Modelling
Artificial Intelligence
Simulation
Design and PrototypingData Science
IoT and Sensors
Visualisation
Software Defined Everything
Data Tools, Techniques, Infrastructure
Contributors and Partners
Disciplines
Monash eResearch Projects
High Performance Computing
MapReduce
Cloud APIVR/AR
Current Monash eResearch Landscape
Hardware
Deep LearningGPUs
Collaboration
ASICs
The Australian Research Cloud
Nectar Research Cloud started since 2012 and
it provides flexible scalable computing power to
all Australian researchers, with computing
infrastructure, software and services that allow
the research community to store, access and
run data, remotely and rapidly. Around 43,000
vCPUs are in use right now.
National eResearch Collaboration Tools and Resources project (Nectar)
MASSIVE M3 is a GPU-centric data processing cluster MASSIVE is an ISO quality-accredited, high-performance data processing facility Access via
• merit allocation through the
National Computational Merit Allocation Scheme (NCMAS).• integration with national instruments based in various institutions• through national informatics infrastructure projects, including the
Characterisation Virtual Laboratory • Monash Researchers
Overview of MASSIVE at https://www.massive.org.au/
Detailed technical information at https://docs.massive.org.au/
HPC for Characterisation and Imaging
MASSIVE
Infrastructure3 computers: M1, M2, M3~6000 cores 298 GPUs3PB fast parallel FS
Usage 400+ active projects2,000+ user accounts100+ institutions across Australia
Interactive Vis600+ users
HPC for Characterisation and Imaging
MASSIVE
InstrumentIntegrationIntegrating with key Instrument Facilities.– IMBL, XFM– CryoEM– MX– MBI– NCRIS: NIF, AMMRF
Large cohort of researchers new to HPC
PartnersMonash UniversityAustralia's Nuclear Science and Technology Organisation (ANSTO)Commonwealth Scientific and Industrial Research Organisation(CSIRO)University of Wollongong
Affiliate PartnersARC Centre of Excellence in Integrative Brain FunctionARC Centre of Excellence in Advanced Molecular Imaging
Top HPC communitiesclassification of M3 projects
MASSIVE M3
M3 cluster is in production since 2016 at Monash University
heterogeneous system to support heterogeneous workload
and operated in research cloud
Cores:5356 for both Haswell and Skylake CPUs
NVIDIA GPU coprocessors for data processing and visualisation:48 NVIDIA Tesla K8040 NVIDIA Pascal P10066 NVIDIA Pascal P460 NVIDIA Volta V1006 NVIDIA DGX1-V 32 NVIDIA Grid K1 GPUs for medium and low end visualisationFilesystem:A 3 petabyte Lustre parallel file system usable after upgrade
Interconnect:100 Gb/s Ethernet Mellanox Spectrum that running Cumulus OS
Monash Research Cloud (R@CMon)
A high level overview of Monash research cloud cells in Nectar. Monash-01 (commodity cloud), monash-02 (research cloud with specialist compute hardwares) and monash-03 (private cell that provide infrastructure for Massive, Monarch, SensiLab and hosting for partnerships)
vHPC Infrastructure Architecture DiagramHigh level architectural overview of monash-03. Baremetal provisioning using Ironic (openstack).Provide Redundancy Neutron provider networks to all hpc cluster network, public network, restricted Monash VLAN, and advance networking for LBaaS, virtual router, etc.VM instances are numa optimised with host cpu passthrough to match all the cpu flags.
vHPC Compute EnvironmentOverview of HPC compute nodeNuma optimised HPC VM.Host CPU passthrough with all the flagsLow latency SR-IOV passthrough to the instances for parallel filesystemNeutron provider network for HPC cluster network
Ansible (infrastructure as code)
● provisioning tool● configuration management● just run the playbooks
● Portable!!!● Tags!● Openstack modules/ Ansible Galaxy● CMDB in the repos
RunningCumulus
RunningUbuntu
RunningOpenstack
RunningCentos
Automation - both hypervisor/instances
- now that we have the codes to provision a cluster in the cloud - why do we need still need to worry about the hypervisor configuration- in this case we still do because we are using SR-IOV to present the
card to the instances- the dependencies on the driver on the firmware essentially a
time-consuming tasks for the sysadmin- Ansible AWX
- as the central on-premise task management platform at Monash, providing the internal teams with a secure, centralized management interface for patching and provisioning servers
- schedule workflow to run ansible playbooks to drain the compute node, upgrade the hypervisor, upgrade the instances and return the compute node in the queue.
- Provisioning from the beginning when the hardware arrives in the DC
Machine Learning - AI in Monash
● strong nvidia partnership ● apart from the life science research, one of the
biggest communities that we have are from machine learning
● supporting different machine learning techniques● to help maximise the usage on GPU, using
checkpointing for ML jobs● dgx machine - six in the queue + five would be
added later this year● AI platform to support the community ● Running AI study unit, giving 150 undergraduate
students the Nvidia P100 in the campus cluster● first MASSIVE GPU User Community held on the
17 July
M3 CPU & GPU Usage per Quarter - Machine Learning
Challenges and improvements
● there are cases when baremetal are still preferable○ file servers○ specialised machine e.g. DGX
● virtualisation and openstack ○ bugs that are not reported yet or did not get enough
attention to fix (for eg: ubuntu kvm support for VM > 1TB memory)
● IO bottleneck: adding the connectivity: other cloud instances, services, desktops, long term archive
● Ansible journey:○ adoption: we are using ansible on all the
infrastructure that we maintaining○ education: training & learning for the teams even
outside the Monash eResearch Centre○ knowledge organisation: continue to improve
Many thanks to all my colleagues in:
Nectar cloud: https://nectar.org.au/research-cloudMASSIVE: https://www.massive.org.auCVL: https://www.cvl.org.auMonash eResearch: https://www.monash.edu/researchinfrastructure/eresearch/about
email: [email protected]