water/air cooling system of the k computer idle mode€¦ · water/air cooling system of the k...

1
A View from the Facility Operations Side on the Water/Air Cooling System of the K Computer Jorji Nonaka, Keiji Yamamoto, Akiyoshi Kuroda, Toshiyuki Tsukamoto (RIKEN R-CCS) Kazuki Koiso, Naohisa Sakamoto (Kobe University) Abstract The Operations and Computer Technologies Division at the RIKEN R-CCS is responsible for the operations of the entire HPC Facility, which includes the supercomputer itself and its auxiliary subsystems such as the power supply and water/air cooling subsystems. It is worth noting that part of these subsystems will be reused in the next supercomputer Fugaku, thus a better understanding of the operational behavior as well as the potential impacts especially on the hardware failure and power consumption would be greatly beneficial. In this poster, we will present some preliminary impressions of the impact of the water/air cooling system on the K computer system, focusing on the potential benefits of the use of low water/air temperature respectively for the CPU (15 o C) and DRAM (17 o C) produced by the chilled water cooling system. We expect that the obtained knowledge will be helpful for the decision support and/or operation planning of the next supercomputer Fugaku. Contact: Jorji Nonaka <[email protected]> HPC Usability Development Unit (HUD Unit) Operations and Computer Technologies Division RIKEN Center for Computational Science Acknowledgements Part of the results was obtained by using the K computer at the RIKEN R-CCS. We are grateful for the colleagues at the RIKEN R-CCS who directly or indirectly collaborated in this work, and we especially thank Fumiyoshi Shoji (Director of the Operations and Computer Technologies Division), Atsuya Uno (Unit Leader of the System Operations and Development Unit), and Shun Ito (currently at Fujitsu), for their helpful collaboration during the experiments, and also some local staffs from Fujitsu for their supportive assistance. CPU cooling water 10 o C chilled water is used to control the CPU cooling water temperature (set to 15 o C). This graph shows a 1-day input and output water temperature, and the water flow inside a heat exchanger. Idle mode This graph shows the impact of the water cooling temperature on the power consumption of an entire compute rack (T45) during the idle period of the K computer. We observed an increase of around 1.75% (20 o C) and 3.5% (25 o C) in the energy consumption. Benchmark applications We utilized five benchmark applications with well-known behavior to evaluate the power consumption of an entire compute rack (T45). We could observe a power consumption increase of less than 4%, when increasing the CPU cooling water temperature in 10 o C (25 o C). Conclusions We could observe in practice some of the theoretical benefits (energy consumption and hardware failure) of using low cooling water temperature (15±1 o C) when running the K computer. We could also observe that even increasing the CPU cooling water temperature in 10 o C, it may still allow the hardware to operate within specification with limited impact on the energy consumption and hardware failure rate. We expect that the obtained knowledge will be helpful for the decision support and operation planning of the next supercomputer Fugaku. Temperature variation inside a compute rack CPU and the cooling air temperature variation inside a compute rack (T45) during the execution of some benchmark applications. SLEEP (Do nothing); PEK99 (CPU intensive); MEM72 (Memory intensive); SUB09 (CPU/Memory balanced use); and ADVMV (Kernel from a production grade application). CPU / ICC SB / DRAM Cooling Water (Around 15 o C) Cooling Air (Around 17 o C) Energy consumption Hardware failure CPU and DRAM failures Spatiotemporal distribution of the compute racks which have substituted CPU and DRAM due to the hardware failure (From Feb. 2012 to May 2019). Accumulated number of failures per rack did not exceed three (CPU) and five (DRAM), and the neighborhood of rack T45 concentrated the racks with higher DRAM failures. Chilled Water (Around 10 o C) Compute Rack InterConnect Controller System Board SPARC64 VIIIfx CPU DDR3 Memory CPU ICC DRAM System Board Water-cooling module Evaluations We utilized a single compute rack (T45), with an attached power monitoring and logging device, and the low priority “Micro” class job in order to verify the temperature variation behavior, and the energy consumption. CPU DRAM

Upload: others

Post on 25-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Water/Air Cooling System of the K Computer Idle mode€¦ · Water/Air Cooling System of the K Computer Jorji Nonaka, Keiji Yamamoto, Akiyoshi Kuroda, Toshiyuki Tsukamoto (RIKEN R-CCS)

A View from the Facility Operations Side on the Water/Air Cooling System of the K ComputerJorji Nonaka, Keiji Yamamoto, Akiyoshi Kuroda, Toshiyuki Tsukamoto (RIKEN R-CCS)Kazuki Koiso, Naohisa Sakamoto (Kobe University)

AbstractThe Operations and Computer Technologies Division at the RIKEN R-CCS is responsible for the operations of the entire HPC Facility, whichincludes the supercomputer itself and its auxiliary subsystems such as the power supply and water/air cooling subsystems. It is worth noting thatpart of these subsystems will be reused in the next supercomputer Fugaku, thus a better understanding of the operational behavior as well asthe potential impacts especially on the hardware failure and power consumption would be greatly beneficial. In this poster, we will presentsome preliminary impressions of the impact of the water/air cooling system on the K computer system, focusing on the potential benefits of theuse of low water/air temperature respectively for the CPU (15oC) and DRAM (17oC) produced by the chilled water cooling system. We expectthat the obtained knowledge will be helpful for the decision support and/or operation planning of the next supercomputer Fugaku.

Contact: Jorji Nonaka <[email protected]>

HPC Usability Development Unit (HUD Unit)Operations and Computer Technologies Division

RIKEN Center for Computational Science

AcknowledgementsPart of the results was obtained by using the K computer at the RIKEN R-CCS. We are grateful for the colleagues at the RIKEN R-CCS whodirectly or indirectly collaborated in this work, and we especially thank Fumiyoshi Shoji (Director of the Operations and ComputerTechnologies Division), Atsuya Uno (Unit Leader of the System Operations and Development Unit), and Shun Ito (currently at Fujitsu),for their helpful collaboration during the experiments, and also some local staffs from Fujitsu for their supportive assistance.

CPU cooling water10oC chilled water is used to control theCPU cooling water temperature (set to15oC). This graph shows a 1-day inputand output water temperature, and thewater flow inside a heat exchanger.

Idle modeThis graph shows the impact of thewater cooling temperature on the powerconsumption of an entire compute rack(T45) during the idle period of the Kcomputer. We observed an increase ofaround 1.75% (20oC) and 3.5% (25oC) inthe energy consumption.

Benchmark applicationsWe utilized five benchmark applicationswith well-known behavior to evaluatethe power consumption of an entirecompute rack (T45). We could observe apower consumption increase of less than4%, when increasing the CPU coolingwater temperature in 10oC (25oC).

ConclusionsWe could observe in practice some of the theoretical benefits (energyconsumption and hardware failure) of using low cooling water temperature(15±1oC) when running the K computer. We could also observe that evenincreasing the CPU cooling water temperature in 10oC, it may still allow thehardware to operate within specification with limited impact on the energyconsumption and hardware failure rate. We expect that the obtainedknowledge will be helpful for the decision support and operation planningof the next supercomputer Fugaku.

Temperature variation inside a compute rackCPU and the cooling air temperature variation inside a compute rack (T45)during the execution of some benchmark applications. SLEEP (Do nothing);PEK99 (CPU intensive); MEM72 (Memory intensive); SUB09 (CPU/Memorybalanced use); and ADVMV (Kernel from a production grade application).

CPU / ICC SB / DRAM

Cooling Water (Around 15oC)

Cooling Air (Around 17oC)

Energy consumption

Hardware failure

CPU and DRAM failuresSpatiotemporal distribution ofthe compute racks which havesubstituted CPU and DRAM dueto the hardware failure (From

Feb. 2012 to May 2019).

Accumulated number offailures per rack did not exceedthree (CPU) and five (DRAM),and the neighborhood of rackT45 concentrated the rackswith higher DRAM failures.

Chilled Water (Around 10oC)

Compute Rack

InterConnect Controller System BoardSPARC64 VIIIfx CPU

DDR3 Memory

CPU

ICC

DRAM

System Board

Water-cooling moduleEvaluationsWe utilized a single compute rack (T45),with an attached power monitoring andlogging device, and the low priority“Micro” class job in order to verify thetemperature variation behavior, and theenergy consumption.

CPU

DRAM