hpc/htc and cloud - home - openstack is open source ... · hpc/htc and cloud: making them work ......

16
HPC/HTC and Cloud: Making them work together efficiently Rajul Kumar Northeastern University [email protected]

Upload: buikhanh

Post on 17-Apr-2018

237 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

HPC/HTC and Cloud:Making them work together efficiently

Rajul Kumar

Northeastern University

[email protected]

Page 2: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

Our group

Rajul Kumar

Northeastern [email protected]

Evan Weinberg

Boston [email protected]

Chris Hill

Massachusetts Institute of [email protected]

Page 3: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

HPC and Cloud convergence

High Performance Computing (HPC)

• HPC users have infinite demand for resources

Cloud

• Overprovisioned to meet the peak workloads and mostly stay underutilized

Can we make HPC soak up these idle cycles without impacting cloud workload

Page 4: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

Simple Case: Single node HTC jobs

• High Throughput Computing (HTC) jobs focus on efficient execution ofloosely-coupled tasks

• Backfilled HTC jobs get killed to release resources for HPC workload

• Invested compute cycles are lost and requires complete rework

Suspend and resume the Virtual Machine running the jobs as and when the resources are available

Page 5: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

Implementation

HPC cluster OpenStack cloud

Resource monitorHPC

HTC

Cloud

Page 6: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

Implementation

HPC cluster OpenStack cloud

Resource monitor

OpenVPN

Page 7: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

Implementation

Control daemon

HPC cluster OpenStack cloud

Resource monitors

OpenVPN

Page 8: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

Implementation

Control daemon

HP

C c

lust

er

Op

enStack clo

ud

Resource monitors

OpenVPN

HPC jobs

HPC job arrives

Page 9: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

Implementation

Control daemon

Resource monitors

OpenVPN

HP

C c

lust

er

Op

enStack clo

ud

HTC jobs moved to Cloud

Page 10: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

Implementation

Control daemon

Resource monitors

OpenVPN

HP

C c

lust

er

Op

enStack clo

ud

Cloud utilization increases

Page 11: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

Implementation

Control daemon

Resource monitors

OpenVPN

HP

C c

lust

er

Op

enStack clo

ud

HTC job suspended to release resources for cloud

Page 12: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

Implementation

Control daemon

Resource monitors

OpenVPN

HP

C c

lust

er

Op

enStack clo

ud

Cloud utilization goes low

Page 13: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

Implementation

Control daemon

Resource monitors

OpenVPN

HP

C c

lust

er

Op

enStack clo

ud

HTC jobs resumed on cloud

Page 14: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

Modifications to Slurm

Slurm – A workload manager for HPC cluster

• Manages the resource and job scheduling

• Marks a node DOWN and removes the jobs for an unreachable node

• Does the same for a suspended virtual node

Modified Slurm to manage the suspended node and keep the job states intact

Page 15: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

Future prospects

• Harden and utilize full data center performance (hardware, network etc.)

• Running multi-node jobs in virtual environment

• Move the jobs between Virtual Machine and Bare metal nodes

• Experiment with container frameworks

Page 16: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster

Conclusion

• Dynamic HPC/HTC cluster with least overhead and impact

• Better productive utilization of the HPC/HTC cluster

• Better resource utilization of the cloud

http://info.massopencloud.org