5.1.0 lenovo intelligent computing orchestration...1 overview 1.1 introduction to lico lenovo...

58
Lenovo Intelligent Computing Orchestration 5.1.0 Installation Guide Publication: Date: 05/04/2018 Version: 1.0

Upload: others

Post on 10-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

Lenovo Intelligent Computing Orchestration

5.1.0Installation Guide

Publication:Date: 05/04/2018Version: 1.0

Page 2: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

Contents 1 Overview .................................................................................................................................................6

1.1 Introduction to LiCO ..............................................................................................................6

1.2 Operating Environment ........................................................................................................6

1.3 Prerequisites .............................................................................................................................6

1.4 Instructions ...............................................................................................................................8

2 Deploying cluster environment .......................................................................................................9

2.1 Installing an OS on the Management Node ..................................................................9

2.2 Deploying the OS on Other Nodes in the Cluster .......................................................9

2.2.1 Configuring Environmental Variables ..............................................................................9

2.2.2 Get Local Repository ...........................................................................................................11

2.2.3 Installing Lenovo xCAT .......................................................................................................12

2.2.4 Prepare OS Mirrors for Other Nodes .............................................................................13

2.2.5 Set xCAT node information ...............................................................................................13

2.2.6 Add Host Resolution ...........................................................................................................14

2.2.7 Configuring DHCP and DNS Services ............................................................................14

2.2.8 Installing a Node OS through the Network .................................................................15

2.2.9 Checkpoint A .........................................................................................................................15

2.3 Installing Infrastructure Software for Node .................................................................15

2.3.1 List of Infrastructure Software to be installed .............................................................15

2.3.2 Set Local Yum Repository for Management Node ....................................................16

2.3.3 Configuring Local Yum Repository for Compute and Login Nodes ....................16

2.3.4 Configuring LiCO Dependencies Repository ...............................................................17

2.3.5 Installing Slurm ......................................................................................................................17

2.3.6 Configuring NFS ...................................................................................................................18

2.3.7 Configuring NTP ...................................................................................................................19

2.3.8 Installing CUDA .....................................................................................................................19

2.3.9 Configuring Slurm ................................................................................................................21

2.3.10 Installing Ganglia ..................................................................................................................21

2.3.11 Installing MPI .........................................................................................................................22

2.3.12 Installing Singularity ............................................................................................................23

2.3.13 Checkpoint B ..........................................................................................................................23

3 Installing LiCO Dependencies ........................................................................................................25

Page 3: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

3.1 List of LiCO Dependencies to be installed ...................................................................25

3.2 Installing RabbitMQ .............................................................................................................25

3.3 Installing PostgreSQL ..........................................................................................................25

3.4 Installing InfluxDB .................................................................................................................26

3.5 Installing Confluent ..............................................................................................................27

3.6 Configuring user authentication ......................................................................................27

3.6.1 Installing OpenLDAP-server .............................................................................................27

3.6.2 Installing libuser ....................................................................................................................28

3.6.3 Installing openldap-client ..................................................................................................29

3.6.4 Installing nss-pam-ldapd ..................................................................................................29

3.7 Installing Gmond GPU Plug-In ........................................................................................30

4 Installing LiCO .....................................................................................................................................31

4.1 List of LiCO Components to be installed ......................................................................31

4.2 Getting the LiCO Installation Package ...........................................................................31

4.3 Configuring the Local Yum Depository for LiCO .......................................................32

4.4 Installing the Management Node ...................................................................................32

4.5 Installing the Login Node ..................................................................................................33

4.6 Installing the Compute Node ...........................................................................................33

5 Configuring LiCO ...............................................................................................................................34

5.1 Configuring Service Account ............................................................................................34

5.2 Configuring Cluster Nodes ................................................................................................34

5.2.1 Room Information ................................................................................................................34

5.2.2 Logic Group Information ...................................................................................................35

5.2.3 Room Row Information ......................................................................................................35

5.2.4 Rack Information ..................................................................................................................35

5.2.5 Chassis Information .............................................................................................................36

5.2.6 Node Information .................................................................................................................36

5.3 Configuring LiCO Services .................................................................................................37

5.3.1 Infrastructure Configuration .............................................................................................38

5.3.2 Database Configuration .....................................................................................................38

5.3.3 Login Configuration .............................................................................................................38

5.3.4 Storage Configuration ........................................................................................................38

5.3.5 Scheduler Configuration ....................................................................................................39

5.3.6 Alert Configuration ..............................................................................................................39

Page 4: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

5.3.7 Cluster Configuration ..........................................................................................................39

5.3.8 Functional Configuration ...................................................................................................39

5.4 Configuring LiCO Components .......................................................................................40

5.4.1 lico-vnc-mond ......................................................................................................................40

5.4.2 lico-env ....................................................................................................................................40

5.4.3 lico-portal ...............................................................................................................................40

5.4.4 lico-ganglia-mond ..............................................................................................................41

5.4.5 lico-confluent-proxy ...........................................................................................................41

5.4.6 lico-confluent-mond ..........................................................................................................42

5.4.7 lico-wechat-agent ...............................................................................................................42

5.5 Initializing the System .........................................................................................................43

5.6 Initializing Users ....................................................................................................................43

5.7 Importing System Images ..................................................................................................43

6 Starting LiCO .......................................................................................................................................44

7 Appendix ...............................................................................................................................................45

7.1 Configuring VNC ..................................................................................................................45

7.2 Configuring Confluent web console ..............................................................................45

7.2.1 RHEL .........................................................................................................................................46

7.2.2 CentOS .....................................................................................................................................46

7.3 LiCO commands ...................................................................................................................46

7.3.1 Set the LDAP administrator password ...........................................................................46

7.3.2 Change user’s role ...............................................................................................................46

7.3.3 Resume user ..........................................................................................................................47

7.3.4 Import user .............................................................................................................................47

7.3.5 Import AI image ....................................................................................................................47

7.4 Cluster Service Summary ...................................................................................................47

7.5 Security improvement .........................................................................................................48

7.5.1 Binding setting ......................................................................................................................48

7.5.2 Firewall setting ......................................................................................................................51

7.6 slurm.conf ...............................................................................................................................52

7.7 gres.conf ..................................................................................................................................53

7.8 Chassis Model List ................................................................................................................53

7.9 Product List .............................................................................................................................54

7.10 Import system image ..........................................................................................................54

Page 5: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

7.10.1 Create image .........................................................................................................................54

7.10.2 Import images into LiCO as system level image ........................................................55

7.11 Troubleshooting Slurm issues ..........................................................................................55

7.12 Update OS packages ...........................................................................................................56

7.13 Using a newer kernel with RETPOLINE support .........................................................57

Page 6: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

1 Overview

1.1 Introduction to LiCO

Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for high performance computing (HPC) and artificial intelligence (AI). It provides cluster management and monitoring, job scheduling and management, cluster user management, account management, and file system management. With LiCO, users can centralize resource allocation in one supercomputing cluster. The software can support simultaneous HPC and AI jobs. LiCO supports users carrying out operations by logging into the management system interface with a browser, or using command lines after logging into a cluster login node with another Linux shell.

1.2 Operating Environment

Servers:

Lenovo ThinkSystem servers.

Operating System:

Red Hat Enterprise Linux (abbr. RHEL) 7.4/CentOS 7.4

Client:

Browser: Chrome (v. 62.0 or higher) or Firefox (v. 56.0 or higher) is recommended. Display Resolution: 1280 x 800 is recommended

1.3 Prerequisites

Before your installation, please reference LiCO best recipe to make sure the cluster hardware use the proper firmware levels, drivers and settings. You can get the best recipe document from the below link: https://support.lenovo.com/us/en/solutions/ht506408

Before your installation, please refer to the OSes part of LeSI 18A_SI best recipe to install the OS security patch. You can get the best recipe document from the below link: https://datacentersupport.lenovo.com/us/en/solutions/HT506335

The installation described in this Guide is based on CentOS 7.4. The purpose is to have a quick overview of LiCO. For RHEL 7.4, you can follow similar steps.

You can setup CentOS/RedHat base repository (online or local) on management node. Unless stated in this Guide, all commands run on the management node. If you must open firewall, please refer Cluster Service Summary to modify the firewall

Page 7: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

rules. The user is responsible for regularly updating the components and OS. It is important to

regularly patch and update components and OS to prevent security vulnerabilities. For how to update OS packages, please refer to chapter 7.12 Update OS packages.

This document is for the typical cluster that contains management, login and compute nodes, as shown in the figure below. However, LiCO also support the cluster only contains management and compute nodes. For this kind of cluster, all the LiCO modules installed on login node need to be installed on management node.

Management node: It is the core of the HPC/AI cluster, undertaking primary functions such as cluster management, monitoring, scheduling, strategy management, and user and account management. Compute node: As the name implies, the compute node completes computing tasks. Login node: The login node connects the cluster to the external network or cluster. Users must use the login node to login to upload application data, to develop compilers and submit scheduled tasks. Parallel File System: Provides a shared storage function. It is connected to the cluster nodes via a high-speed network. Parallel file system setup is outside of scope for this Guide. A simple NFS setup is used instead. Nodes BMC interface: BMC interface is used to access node’s BMC system. Nodes eth interface: Ethernet interface is used to manage the nodes in cluster, it also can be used to transfer computing data. High speed network interface: The high speed network is optional. It is always used to support parallel file system, also can be used to transfer computing data.

Page 8: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

1.4 Instructions

This guide is a PDF document. To make sure that you can get the correct command line by coping and pasting, please open this document by Adobe Acrobat Reader. Adobe Acrobat Reader is a free PDF viewer, you can get it from the official website: https://get.adobe.com/reader/

Please replace the <*_USERNAME> and <*_PASSWORD> part to your actual username and password in this document.

Page 9: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

2 Deploying cluster environment

If the cluster environment already exists, then you may skip this chapter. (Check the infrastructure software list to see that software is already installed and can pass the Checkpoint A ,Checkpoint B).

2.1 Installing an OS on the Management Node

Install an official version of CentOS 7.4 on the management node and you can select the minimum installation.

2.2 Deploying the OS on Other Nodes in the Cluster

2.2.1 Configuring Environmental Variables

After logging into the management node, run the commands below to configure environmental variables for the entire installation process:

su root

cd ~

vi lico_env.local

Based on the following prompts, edit lico_env.local and save. (In the final file, ignore all annotations starting with #): Note: This article assumes that the node's BMC user name and password are the same, if inconsistent, need to be modified when installing to: Set xCAT node information

# Management node hostname

sms_name="head"

# Set the domain name

domain_name="hpc.com"

# Set OpenLDAP domain name

lico_ldap_domain_name="dc=hpc,dc=com"

# IP address of management node in the cluster intranet

sms_ip="192.168.0.1"

# Web interface corresponding to the management node IP address

sms_eth_internal="eth0"

# Subnet mask in the cluster intranet. If all nodes in the cluster already have OS

# installed, retain the default configurations.

internal_netmask="255.255.0.0"

# BMC username and password

bmc_username="<BMC_USERNAME>"

Page 10: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

bmc_password="<BMC_PASSWORD>"

# OS mirror pathway for xCAT

iso_path="/isos"

# Local zypper repository directory for OS

os_repo_dir="/install/custom/server"

sdk_repo_dir="/install/custom/sdk"

# Local zypper repository directory for xCAT

xcat_repo_dir="/install/custom/xcat"

# Local Yum repository directory for Lenovo OpenHPC

ohpc_repo_dir="/install/custom/ohpc"

# Local Yum repository directory for LiCO-dep

lico_dep_repo_dir="/install/custom/lico-dep"

# Local Yum repository directory for LiCO

lico_repo_dir="/install/custom/lico"

# Total compute nodes

num_computes="2"

# Prefix of compute node hostname. If OS has already been installed on all the

# nodes of the cluster, change the configuration according to actual conditions.

compute_prefix="c"

# Compute node hostname list. If OS has already been installed on all the

# nodes of the cluster, change the configuration according to actual conditions.

c_name[0]=c1

c_name[1]=c2

# Compute node IP list. If OS has already been installed on all the

# nodes of the cluster, change the configuration according to actual conditions.

c_ip[0]=192.168.0.6

c_ip[1]=192.168.0.16

# Network interface card MAC address corresponding to the compute node IP. If OS

# has already been installed on all the #nodes of the cluster, change the

# configuration according to actual conditions.

c_mac[0]=fa:16:3e:73:ec:50

c_mac[1]=fa:16:3e:27:32:c6

# Compute node BMC address list.

c_bmc[0]=192.168.1.6

c_bmc[1]=192.168.1.16

# Total login nodes

num_logins="1"

# Login node hostname list. If OS has already been installed on all the nodes

# of the cluster, change the configuration according to actual conditions..

l_name[0]=l1

# Login node IP list. If OS has already been installed on all the nodes

# of the cluster, change the configuration according to actual conditions.

l_ip[0]=192.168.0.15

#Network interface card MAC address corresponding to the login node IP.

Page 11: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

# If OS has already been installed on all the nodes of the cluster, change

# the configuration according to actual conditions.

l_mac[0]=fa:16:3e:2c:7a:47

# Login node BMC address list.

l_bmc[0]=192.168.1.15

Run the following command to take configuration file to take effect:

chmod 600 lico_env.local

source lico_env.local

After the cluster environment is set up, you need to configure the public network IP on the login or management node to log in LiCO web portal from the external network.

2.2.2 Get Local Repository

Create directory for ISOs storing.

mkdir -p ${iso_path}

CentOS: Download CentOS-7-x86_64-Everything-1708.iso from the official website, copy it to the pathway ${iso_path} and run the commands below:

# run the command below to get verification code of the iso file, and you can get # another

verification code from here, then make sure they are the same.

cd ${iso_path}

sha256sum CentOS-7-x86_64-Everything-1708.iso

cd ~

#mount image

mkdir -p ${os_repo_dir}

mount -o loop ${iso_path}/CentOS-7-x86_64-Everything-1708.iso ${os_repo_dir}

#configuration local repository

cat << eof > ${iso_path}/EL7-OS.repo

[EL7-OS]

name=el7-centos

enabled=1

gpgcheck=0

type=rpm-md

baseurl=file://${os_repo_dir}

eof

cp -a ${iso_path}/EL7-OS.repo /etc/yum.repos.d/

Page 12: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

RHELS: Copy the RHEL-7.4-20170711.0-Server-x86_64-dvd1.iso and RHEL-7.4-20170711.0-Server-x86_64-dvd1.iso.MD5SUM files to the ${iso_path} directory and run the following commands:

#Check the validity of the iso file:

cd ${iso_path}

md5sum -c RHEL-7.4-20170711.0-Server-x86_64-dvd1.iso.MD5SUM

cd ~

#mount image

mkdir -p ${os_repo_dir}

mount -o loop ${iso_path}/RHEL-7.4-20170711.0-Server-x86_64-dvd1.iso ${os_repo_dir}

#configuration local repository

cat << eof > ${iso_path}/RHELS74-OS.repo

[RHELS7-OS]

name=RHELS7-OS

enabled=1

gpgcheck=0

type=rpm-md

baseurl=file://${os_repo_dir}

eof

cp -a ${iso_path}/RHELS74-OS.repo /etc/yum.repos.d/

2.2.3 Installing Lenovo xCAT

Download the package: https://hpc.lenovo.com/downloads/18a/xcat-2.13.8.lenovo3_confluent-1.8.2_lenovo_confluent-0.8.1-el7.tar.bz2 Upload the package to management node, and then run the commands below to install xCAT:

# Create xcat local repository

yum install -y bzip2

mkdir -p $xcat_repo_dir

tar -xvf xcat-2.13.8.lenovo3_confluent-1.8.2_lenovo_confluent-0.8.1-el7.tar.bz2 -C $xcat_repo_dir

cd $xcat_repo_dir/lenovo-hpc-el7

./mklocalrepo.sh

cd ~

# install xcat

yum install -y xCAT

systemctl start xcatd

source /etc/profile.d/xcat.sh

Page 13: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

2.2.4 Prepare OS Mirrors for Other Nodes

If all nodes in the cluster have an OS installed, you may skip this step. CentOS: Please perform the following command to prepare the OS image for the other nodes:

copycds ${iso_path}/CentOS-7-x86_64-Everything-1708.iso

RHELS: Please perform the following command to prepare the OS image for the other nodes:

copycds ${iso_path}/RHEL-7.4-20170711.0-Server-x86_64-dvd1.iso

CentOS: Run the commands below to confirm that the OS image has been copied.

lsdef -t osimage

#Output should be

centos7.4-x86_64-install-compute (osimage)

centos7.4-x86_64-netboot-compute (osimage)

centos7.4-x86_64-statelite-compute (osimage)

RHELS: Run the commands below to confirm that the OS image has been copied.

lsdef -t osimage

#Output should be

rhels7.4-x86_64-install-compute (osimage)

rhels7.4-x86_64-netboot-compute (osimage)

rhels7.4-x86_64-statelite-compute (osimage)

CentOS: Nouveau module is an accelerated open source driver for NVIDIA cards. Following NVIDIA official installation guide, this module should disabled before installing CUDA driver:

chdef -t osimage centos7.4-x86_64-install-compute addkcmdline="rdblacklist=nouveau

nouveau.modeset=0 R::modprobe.blacklist=nouveau"

RHELS: Nouveau module is an accelerated open source driver for NVIDIA cards. Following NVIDIA official installation guide, this module should disabled before installing CUDA driver:

chdef -t osimage rhels7.4-x86_64-install-compute addkcmdline="rdblacklist=nouveau

nouveau.modeset=0 R::modprobe.blacklist=nouveau"

2.2.5 Set xCAT node information

Page 14: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

Run the commands below to import the compute node configuration in the lico_env.local file to xCAT:

for ((i=0; i<$num_computes; i++)); do

mkdef -t node ${c_name[$i]} groups=compute,all arch=x86_64 netboot=xnba mgt=ipmi

bmcusername=${bmc_username} bmcpassword=${bmc_password} ip=${c_ip[$i]} mac=${c_mac[$i]}

bmc=${c_bmc[$i]} serialport=0 serialspeed=115200;

done

Run the commands below to import the login node configuration in the lico_env.local file to xCAT:

for ((i=0; i<$num_logins; i++)); do

mkdef -t node ${l_name[$i]} groups=login,all arch=x86_64 netboot=xnba mgt=ipmi

bmcusername=${bmc_username} bmcpassword=${bmc_password} ip=${l_ip[$i]} mac=${l_mac[$i]}

bmc=${l_bmc[$i]} serialport=0 serialspeed=115200;

done

Note: If the BMC user name and password of the node are inconsistent, run the following command to modify:

tabedit ipmi

Run the commands below to configure the root account password for the node (Set a password based on the command to be modified)

chtab key=system passwd.username=root passwd.password=<ROOT_PASSWORD>

2.2.6 Add Host Resolution

If the cluster already has OS installed and can resolve the IP address through the hostname, skip this step. Run the commands below to add host resolution:

chdef -t site domain=${domain_name}

chdef -t site master=${sms_ip}

chdef -t site nameservers=${sms_ip}

sed -i "/^\s*${sms_ip}\s*.*$/d" /etc/hosts

sed -i "/\s*${sms_name}\s*/d" /etc/hosts

echo "${sms_ip} ${sms_name} ${sms_name}.${domain_name} " >> /etc/hosts

makehosts

2.2.7 Configuring DHCP and DNS Services

If all nodes in the cluster have OS installed, skip this step.

makenetworks

makedhcp -n

makedns -n

Page 15: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

2.2.8 Installing a Node OS through the Network

If all nodes in the cluster have OS installed, skip this step. CentOS: Run the commands below to set and install the necessary OS mirror.

nodeset all osimage=centos7.4-x86_64-install-compute

rsetboot all net -u

rpower all reset

RHELS: Run the commands below to set and install the necessary OS mirror.

nodeset all osimage=rhels7.4-x86_64-install-compute

rsetboot all net -u

rpower all reset

It takes several minutes to finish the OS installation, you can use the below command to check the progress:

nodestat all

2.2.9 Checkpoint A

Run the commands below to check if installation is complete:

psh all uptime

#Output should be

c1: 05:03am up 0:02, 0 users, load average: 0.20, 0.13, 0.05

c2: 05:03am up 0:02, 0 users, load average: 0.20, 0.14, 0.06

l1: 05:03am up 0:02, 0 users, load average: 0.17, 0.13, 0.05

……

2.3 Installing Infrastructure Software for Node

2.3.1 List of Infrastructure Software to be installed

The installation node fields are expressed as follows: H: Management node, L: Login node, C: Compute node

Software

Name Component Name Version

Service

Name

Installation

Node Notes

nfs nfs-utils 1.3.0 nfs-server H

ntp ntp 4.2.6 ntpd H

slurm ohpc-slurm-server 1.3.3 munge, H

Page 16: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

slurmctld

ohpc-slurm-client 1.3.3 munge,

slurmd

C,L

ganglia ganglia-gmond-ohpc 3.7.2 gmond H,C,L

singularity singularity-ohpc 2.4 H

cudnn 7 C cuda

cuda 9.1 C

Only needs to

be installed

on the GPU

node

Openmpi3-gnu7-ohpc 3.0.0 H

mpich-gnu7-ohpc 3.2 H

mpi

mvapich2-gnu7-ohpc 2.2 H

Install at least

one of three

types of MPI

2.3.2 Set Local Yum Repository for Management Node

Download the package: https://hpc.lenovo.com/lico/downloads/5.1/Lenovo-OpenHPC-1.3.3.CentOS_7.x86_64.tar Then upload the package to management node. Run the commands below to configure the local Lenovo OpenHPC repository:

mkdir -p $ohpc_repo_dir

tar xvf Lenovo-OpenHPC-1.3.3.CentOS_7.x86_64.tar -C $ohpc_repo_dir

$ohpc_repo_dir/make_repo.sh

2.3.3 Configuring Local Yum Repository for Compute and

Login Nodes

Run the commands below to install the Yum toolkit:

psh all yum --setopt=\*.skip_if_unavailable=1 install -y yum-utils

Run the commands below to add a local repository:

cp /etc/yum.repos.d/Lenovo.OpenHPC.local.repo /var/tmp

sed -i '/^baseurl=/d' /var/tmp/Lenovo.OpenHPC.local.repo

sed -i '/^gpgkey=/d' /var/tmp/Lenovo.OpenHPC.local.repo

echo "baseurl=http://${sms_name}/${ohpc_repo_dir}/CentOS_7" >>

/var/tmp/Lenovo.OpenHPC.local.repo

echo "gpgkey=http://${sms_name}/${ohpc_repo_dir}/CentOS_7/repodata/repomd.xml.key" >>

/var/tmp/Lenovo.OpenHPC.local.repo

# Distribute files

Page 17: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

xdcp all /var/tmp/Lenovo.OpenHPC.local.repo /etc/yum.repos.d/

psh all echo -e %_excludedocs 1 \>\> ~/.rpmmacros

Run the following command to shut down the yum source access to the external network. This step can be performed according to the actual situation. If the operating system itself does not install enough packages, the subsequent installation steps may fail:

psh all yum-config-manager --disable CentOS\*

2.3.4 Configuring LiCO Dependencies Repository

Download the package: https://hpc.lenovo.com/lico/downloads/5.1/lico-dep-5.1.el7.x86_64.tgz Then upload the package to management node. Run the commands below to configure the Yum repository for the management node. Please make sure management node should configure local operating system yum repository for following actions:

mkdir -p $lico_dep_repo_dir

tar xvf lico-dep-5.1.el7.x86_64.tgz -C $lico_dep_repo_dir

$lico_dep_repo_dir/mklocalrepo.sh

If the cluster already exists, check whether your version is consistent with the list of chapter 3.1 List of LiCO Dependencies to be installed. Run the commands below to configure the Yum repository for other nodes:

cp /etc/yum.repos.d/lico-dep.repo /var/tmp

sed -i '/^baseurl=/d' /var/tmp/lico-dep.repo

sed -i '/^gpgkey=/d' /var/tmp/lico-dep.repo

echo "baseurl=http://${sms_name}/${lico_dep_repo_dir}" >> /var/tmp/lico-dep.repo

echo "gpgkey=http://${sms_name}/${lico_dep_repo_dir}/RPM-GPG-KEY-LICO-DEP-EL7" >>

/var/tmp/lico-dep.repo

# Distribute files

xdcp all /var/tmp/lico-dep.repo /etc/yum.repos.d

2.3.5 Installing Slurm

Run the commands below to install the base package:

yum install -y lenovo-ohpc-base

Page 18: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

Run the commands below to install Slurm:

yum install -y ohpc-slurm-server

Run the commands below to install the Slurm client:

psh all yum install -y ohpc-base-compute ohpc-slurm-client lmod-ohpc

The following optional command will prevent non-root logins to the compute nodes unless they are already running a slurm job on that node submitted by the userid being used for logging in:

psh all echo "\""account required pam_slurm.so"\"" \>\> /etc/pam.d/sshd

2.3.6 Configuring NFS

Run the following commands to create the share directory of /opt/ohpc/pub. This directory is necessary. If you have already shared this directory from the management node and mounted it on all of the other nodes, you can skip this step.

#Management node sharing /opt/ohpc/pub for OpehHPC

yum install -y nfs-utils

echo "/opt/ohpc/pub *(ro,no_subtree_check,fsid=11)" >> /etc/exports

exportfs -a

# Installing NFS for Cluster Nodes

psh all yum install -y nfs-utils

#Configure shared directory for cluster nodes

psh all mkdir -p /opt/ohpc/pub

psh all echo "\""${sms_ip}:/opt/ohpc/pub /opt/ohpc/pub nfs nfsvers=3,nodev,noatime 0 0"\"" \>\> /etc/fstab

#Mount shared directory

psh all mount /opt/ohpc/pub

Run the following commands to create user share directory, this document takes /home as an example, also you can choose other directory.

#Management node sharing /home

echo "/home *(rw,no_subtree_check,fsid=10,no_root_squash)" >> /etc/exports

exportfs -a

# if /home already mounted, unmount it first

psh all "sed -i '/ \/home /d' /etc/fstab"

psh all umount /home

#Configure shared directory for cluster nodes

Page 19: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

psh all echo "\""${sms_ip}:/home /home nfs nfsvers=3,nodev,nosuid,noatime 0 0"\"" \>\> /etc/fstab

#Mount shared directory

psh all mount /home

2.3.7 Configuring NTP

If NTP service has already been configured for nodes in the cluster, skip this step. Run the commands below:

echo "server 127.127.1.0" >> /etc/ntp.conf

echo "fudge 127.127.1.0 stratum 10" >> /etc/ntp.conf

systemctl enable ntpd

systemctl start ntpd

psh all yum install -y ntp

psh all echo "\""server ${sms_ip}"\"" \>\> /etc/ntp.conf

psh all systemctl enable ntpd

psh all systemctl start ntpd

# check service

psh all "ntpq -p | tail -n 1"

2.3.8 Installing CUDA

Run the commands below to install CUDA and CUDNN on all the GPU compute nodes (if only a subset of nodes have GPUs, replace "compute" argument in psh commands with node range corresponding to GPU nodes): 1.Install CUDA: Download cuda_9.1.85_387.26_linux.run from https://developer.nvidia.com/cuda-downloads and copy it to share directory /home. If the operating system is configured to boot to a graphical desktop, run the commands below to configure the operating system to boot to the text console, and then restart the system:

psh compute systemctl set-default multi-user.target

psh compute reboot

Download NVIDIA driver from http://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/tesla/390.46/nvidia-diag-driver-local-repo-rhel7-390.46-1.0-1.x86_64.rpm&lang=us&type=Tesla and copy it to share directory /home.

Run the following commands as shown:

Page 20: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

Run the commands below to install CUDA:

psh compute yum install -y kernel-devel gcc gcc-c++

psh compute /home/cuda_9.1.85_387.26_linux.run --silent --toolkit --samples --no-opengl-libs --verbose

--override

2.Install cuDNN: Download cuDNN 7.0.5 (The downloaded package is cudnn-9.1-linux-x64-v7.tgz) from https://developer.nvidia.com/cudnn to directory /root: Run the commands below to install cuDNN:

cd ~

tar -xvf cudnn-9.1-linux-x64-v7.tgz

xdcp compute cuda/include/cudnn.h /usr/local/cuda/include

xdcp compute cuda/lib64/libcudnn_static.a /usr/local/cuda/lib64

xdcp compute cuda/lib64/libcudnn.so.7.0.5 /usr/local/cuda/lib64

psh compute "ln -s /usr/local/cuda/lib64/libcudnn.so.7.0.5 /usr/local/cuda/lib64/libcudnn.so.7"

psh compute "ln -s /usr/local/cuda/lib64/libcudnn.so.7 /usr/local/cuda/lib64/libcudnn.so"

psh compute chmod a+r /usr/local/cuda/include/cudnn.h

psh compute chmod a+r /usr/local/cuda/lib64/libcudnn*

3.Configuring Environmental Variables Certain environment variables need to be set in order to ensure proper operation of the CUDA package. This can be accomplished by modifying the configuration files described in the commands below. These commands should be run on the management node, even though CUDA isn't installed on the management node, to facilitate deployment of these files, and the CUDA environment variables, across all of the compute nodes in the cluster that contain GPUs:

echo "/usr/local/cuda/lib64" >> /etc/ld.so.conf.d/cuda.conf

echo "export CUDA_HOME=/usr/local/cuda" >> /etc/profile.d/cuda.sh

echo "export PATH=/usr/local/cuda/bin:\$PATH" >> /etc/profile.d/cuda.sh Distribute configuration files:

xdcp compute /etc/ld.so.conf.d/cuda.conf /etc/ld.so.conf.d/cuda.conf

xdcp compute /etc/profile.d/cuda.sh /etc/profile.d/cuda.sh Run the commands below on the GPU nodes to determine if the GPU can be identified:

psh compute ldconfig

psh compute nvidia-smi

psh compute "cd /root/NVIDIA_CUDA-9.1_Samples/1_Utilities/deviceQuery; make; ./deviceQuery" | xcoll

4.Set CUDA's self-start

psh compute rpm -ivh /home/nvidia-diag-driver-local-repo-rhel7-390.46-1.0-1.x86_64.rpm

psh compute yum install -y cuda-drivers

Page 21: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

#configuration

psh compute sed -i '/Wants=syslog.target/a\Before=slurmd.service' /usr/lib/systemd/system/nvidia-

persistenced.service

psh compute systemctl daemon-reload

psh compute systemctl enable nvidia-persistenced

psh compute systemctl start nvidia-persistenced

2.3.9 Configuring Slurm

Download slurm.conf from https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/ to /etc/slurm on the management node, and modify this file according to the instructions in section 7.6. Run the commands below to distribute the configuration:

xdcp all /etc/slurm/slurm.conf /etc/slurm/slurm.conf

xdcp all /etc/munge/munge.key /etc/munge/munge.key

Download gres.conf from https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/ to /etc/slurm on the GPU node, and follow the instructions in that section 7.7 to modify this file as needed. Non-GPU nodes do not need this file. Run the commands below to start service:

#Start management node service

systemctl enable munge

systemctl enable slurmctld

systemctl restart munge

systemctl restart slurmctld

#Start other node service

psh all systemctl enable munge

psh all systemctl restart munge

psh all systemctl enable slurmd

psh all systemctl restart slurmd

2.3.10 Installing Ganglia

Install Ganglia on management node, run the commands below:

# install Ganglia

yum install -y ganglia-gmond-ohpc

Page 22: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

# Download gmond.conf from

#https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/ganglia/management/, and copy it

#to the /etc/ganglia/ directory on the management node, then modify

# the hostname in the /etc/ganglia/gmond.conf file to the management

# node's hostname for the udp_send_channel setting.

echo net.core.rmem_max=10485760 > /usr/lib/sysctl.d/gmond.conf

/usr/lib/systemd/systemd-sysctl gmond.conf

sysctl -w net.core.rmem_max=10485760

# Install Ganglia on compute node

psh all yum install -y ganglia-gmond-ohpc

# Download gmond.conf from https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/ganglia/,

# and copy it to the /var/tmp/ directory of the management node, then modify

# the hostname in the /var/tmp/gmond.conf file to the management

# node's hostname for the udp_send_channel setting.

#Distribute configuration

xdcp all /var/tmp/gmond.conf /etc/ganglia/gmond.conf

#Start management node service

systemctl enable gmond

systemctl start gmond

# start other nodes service

psh all systemctl enable gmond

psh all systemctl start gmond

#run the command to see of all the nodes are listed

gstat -a

2.3.11 Installing MPI

Run the commands below:

yum install -y openmpi3-gnu7-ohpc mpich-gnu7-ohpc mvapich2-gnu7-ohpc

The above commands will install three modules (OpenMPI, MPICH, and MVAPICH) to the system, and the user can use lmod to choose the specific MPI module to be used. OpenHPC provides a module package to set the default module. The following command will set the OpenMPI module as the default:

yum install -y lmod-defaults-gnu7-openmpi3-ohpc

To set the MPICH module as the default, run:

yum install -y lmod-defaults-gnu7-mpich-ohpc

To set the MVAPICH module as the default, run:

Here is table of interconnect support for each MPI type from OpenHPC (x means support)

Ethernet(TCP) InfiniBand Omni-Path

yum install -y lmod-defaults-gnu7-mvapich2-ohpc

Page 23: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

MPICH x

MVAPICH2 x

MVAPICH2(psm2) x

OpenMPI x x x

OpenMPI(PMIx) x x x

Note: If you want to use MVAPICH2 (psm2), you should install mvapich2-psm2-gnu7-ohpc. If you want to use OpenMPI (PMIx), you should install openmpi3-pmix-slurm-gnu7-ohpc. However, openmpi3-gnu7-ohpc and openmpi3-pmix-slurm-gnu7-ohpc is incompatible and mvapich2-psm2-gnu7-ohpc and mvapich2-gnu7-ohpc is incompatible.

2.3.12 Installing Singularity

Singularity is an HPC-facing lightweight container framework. Run the commands below to install Singularity:

yum install -y singularity-ohpc

Edit the /opt/ohpc/pub/modulefiles/ohpc file, and on the “module try-add” block, add the blow content as last line:

module try-add singularity

On the “module del” block, add the blow content as first line:

module del singularity

Run the following command:

source /etc/profile.d/lmod.sh

Note: Changes to /opt/ohpc/pub/modulefiles/ohpc may be lost when default modules are changed by installing lmod-defaults* package. In that case, modify /opt/ohpc/pub/modulefiles/ohpc file again, or, alternatively, add "module try-add singularity" to the bottom of /etc/profile.d/lmod.sh.

2.3.13 Checkpoint B

Run the commands below to test if Slurm is installed normally:

sinfo

#Output should be

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

normal* up 1-00:00:00 2 idle c[1-2]

……

The status of all nodes should be ‘idle’, ‘idle*’ is not acceptable. Run the commands below to add a test account:

useradd -m test

echo "MERGE:" > syncusers

echo "/etc/passwd -> /etc/passwd" >> syncusers

Page 24: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

echo "/etc/group -> /etc/group" >> syncusers

echo "/etc/shadow -> /etc/shadow" >> syncusers

xdcp all -F syncusers

Log in the test account and use Slurm distributed test program:

su - test

mpicc -O3 /opt/ohpc/pub/examples/mpi/hello.c

srun -n 8 -N 1 -w compute --pty /bin/bash

prun ./a.out

#Output should be

Master compute host = c1

Resource manager = slurm

Launch cmd = mpiexec.hydra -bootstrap slurm ./a.out

Hello, world (8 procs total)

--> Process # 0 of 8 is alive. -> c1

--> Process # 4 of 8 is alive. -> c2

--> Process # 1 of 8 is alive. -> c1

--> Process # 5 of 8 is alive. -> c2

--> Process # 2 of 8 is alive. -> c1

--> Process # 6 of 8 is alive. -> c2

--> Process # 3 of 8 is alive. -> c1

--> Process # 7 of 8 is alive. -> c2

Note: After the finished the command, notice that you exit to the root user of the management

node.

Page 25: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

3 Installing LiCO Dependencies

3.1 List of LiCO Dependencies to be installed

The installation node fields are expressed as follows: H: Management node, L: Login node, C: Compute node

Software

Name Component Name Version

Service

Name

Installation

Node Notes

rabbitmq rabbitmq-server 3.6.15 rabbitmq-

server

H

postgresql postgresql-server 9.2.23 postgresql H

influxdb influxdb 1.4.2 influxdb H

confluent confluent 1.8.1 confluent H

slapd-ssl-config 1.0.0 slapd H

nss-pam-ldapd 0.8.13 nslcd H,C,L

libuser 0.60 H

openldap

libuser-python 0.60 H

gmond gpu

plugin

gmond-ohpc-gpu-

module

1.0.0 C Only needs

to be

installed on

the GPU

node

3.2 Installing RabbitMQ

LiCO uses RabbitMQ as a message broker. Run the commands below to install:

#Install RabbitMQ on the management node

yum install -y rabbitmq-server

#Start RabbitMQ service

systemctl enable rabbitmq-server

systemctl start rabbitmq-server

3.3 Installing PostgreSQL

LiCO uses PostgreSQL as an object-relational database for data storage. Run the commands below to install:

#Install PostgreSQL on the management node

yum install -y postgresql-server

Page 26: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

#Initialization and passwords can be changed as needed.

su - postgres

echo <PG_PASSWORD> > /var/tmp/pwfile

initdb -U postgres --pwfile /var/tmp/pwfile /var/lib/pgsql/data

rm /var/tmp/pwfile

exit

#Starting PostgreSQL

systemctl enable postgresql

systemctl start postgresql

#Create LiCO database

export PGPASSWORD=<PG_PASSWORD>

psql -U postgres -c "CREATE DATABASE lico;"

3.4 Installing InfluxDB

LiCO uses InfluxDB as a time series database for storage monitoring. Run the commands

below to install: Run the following command to create InfluxDB users.

#Enter the InfluxDB shell

influx

#create database

create database lico

#use database

use lico

# To create an administrator user, please note that the password must be a string, otherwise the error is

# reported.

create user <INFLUX_USERNAME> with password '<INFLUX_PASSWORD>' with all privileges

# exit the influxDB shell

exit

#configuration

sed -i '/auth-enabled = false/a\ auth-enabled = true' /etc/influxdb/config.toml

#restart InfluxDB

systemctl restart influxdb

#Install InfluxDB

yum install -y influxdb

#Start InfluxDB

systemctl enable influxdb

systemctl start influxdb

Page 27: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

3.5 Installing Confluent

Run the commands below to install:

yum install -y python2-crypto

yum install -y confluent

# Start confluent

systemctl enable confluent

systemctl start confluent

# Create confluent count

confetty create /users/<CONFLUENT_USERNAME> password=<CONFLUENT_PASSWORD>

If you need to use the web console reference appendix of Configuring Confluent web console

3.6 Configuring user authentication

3.6.1 Installing OpenLDAP-server

OpenLDAP is an open-source version of the lightweight directory access protocol. LiCO recommends using OpenLDAP to manage users; however, it also supports other authentication services compatible with Linux-PAM. If you have already configured OpenLDAP for the cluster, or another authentication service is being used, skip this step. Run the commands below:

#Install OpenLDAP

yum install -y slapd-ssl-config

slapadd -v -l /usr/share/openldap-servers/lico.ldif -f /etc/openldap/slapd.conf -b ${lico_ldap_domain_name}

# set password

# Get the key using the following command and enter <LDAP_PASSWORD> when prompted.

slappasswd

# Edit the file /etc/openldap/slapd.conf to cover the contents of the rootpw with the key obtained.

rootpw <ENCTYPT_PASSWORD>

chown -R ldap:ldap /var/lib/ldap

chown ldap:ldap /etc/openldap/slapd.conf

Page 28: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

#Edit configuration files

vi /etc/sysconfig/slapd

# Please make sure the next two lines are uncommented

SLAPD_URLS="ldapi:/// ldap:/// ldaps:///"

SLAPD_OPTIONS="-f /etc/openldap/slapd.conf"

#Start OpenLDAP service

systemctl enable slapd

systemctl start slapd

# check service

systemctl status slapd

3.6.2 Installing libuser

The libuser module is a useful toolkit for OpenLDAP. The installation of this module is optional. However, for this document, some commands like ‘luseradd’ are implemented by libuser. So it is recommended to install libuser. Run the commands below to install libuser:

yum install -y libuser libuser-python

Configure libuser:

vi /etc/libuser.conf

[import]

login_defs = /etc/login.defs

default_useradd = /etc/default/useradd

[defaults]

crypt_style = sha512

modules = ldap

create_modules = ldap

[userdefaults]

LU_USERNAME = %n

LU_GIDNUMBER = %u

LU_GECOS = %n

# Pay attention to modify option below

LU_HOMEDIRECTORY = /home/%n

LU_SHADOWNAME = %n

LU_SHADOWMIN = 0

LU_SHADOWMAX = 99999

Page 29: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

[groupdefaults]

LU_GROUPNAME = %n

[files]

[shadow]

[ldap]

# modify <LDAP_ADDRESS> to management node IP

server = ldap://<LDAP_ADDRESS>

# Pay attention to modify option below

# make sure <DOMAIN> should be the same with ${lico_ldap_domain_name} defined in lico_env.local

basedn = <DOMAIN>

userBranch = ou=People

groupBranch = ou=Group

binddn = uid=admin,<DOMAIN>

bindtype = simple

[sasl]

3.6.3 Installing openldap-client

Run the commands below:

echo "TLS_REQCERT never" >> /etc/openldap/ldap.conf

xdcp all /etc/openldap/ldap.conf /etc/openldap/ldap.conf

3.6.4 Installing nss-pam-ldapd

nss-pam-ldapd is a name service switch module and pluggable authentication module. LiCO uses nss-pam-ldapd for user authentication. Install nss-pam-ldapd on the management node,run the commands below:

yum install -y nss-pam-ldapd authconfig

authconfig --useshadow --usemd5 --enablemkhomedir --disablecache --enablelocauthorize --

disablesssd --disablesssdauth --enableforcelegacy --enableldap --enableldapauth --disableldaptls

--ldapbasedn=${lico_ldap_domain_name} --ldapserver="ldap://${sms_name}" --updateall

echo "rootpwmoddn uid=admin,${lico_ldap_domain_name}" >> /etc/nslcd.conf

#Start management node service

systemctl enable nslcd

systemctl start nslcd

Page 30: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

Install nss-pam-ldapd on other nodes, run the commands below:

psh all yum install -y nss-pam-ldapd authconfig

psh all authconfig --useshadow --usemd5 --enablemkhomedir --disablecache --enablelocauthorize

--disablesssd --disablesssdauth --enableforcelegacy --enableldap --enableldapauth --

disableldaptls --ldapbasedn="${lico_ldap_domain_name}" --ldapserver="ldap://${sms_name}" --

updateall

psh all echo "\""rootpwmoddn uid=admin,${lico_ldap_domain_name}"\"" \>\> /etc/nslcd.conf

#Start other node services

psh all systemctl enable nslcd

psh all systemctl start nslcd

3.7 Installing Gmond GPU Plug-In

On all GPU nodes, run the commands below to install:

psh compute yum install -y gmond-ohpc-gpu-module

psh compute "ls /etc/ganglia/conf.d/*.pyconf|grep -v nvidia|xargs rm"

# Start gmond

psh compute systemctl restart gmond

Page 31: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

4 Installing LiCO

4.1 List of LiCO Components to be installed

The installation node fields describe as follows: H: Management node, L: Login node, C: Compute node

Software Name Component Name Version Service Name Installation

Node Notes

lico-core lico-core 5.1.0 lico H

lico-portal lico-portal 5.1.0 H,L

lico-confluent-

proxy

1.0.0 H

lico-vnc-proxy 1.0.0 H

lico-core-

extend

lico-ai-image 1.1.0 H

lico-env 1.0.0 H,C,L lico-env

lico-ai-expert 1.1.0 C Only for AI

functions

lico-ganglia-mond 1.0.0 lico-ganglia-

mond

H

lico-confluent-

mond

1.0.0 lico-

confluent-

mond

H

lico monitor

lico-vnc-mond 1.0.0 lico-vnc-

mond

C Install if you

need to run

VNC

lico-sms-agent 1.1.0 lico-sms-

agent

L Install if you

need to send

alerts via

SMS

lico-wechat-agent 1.1.0 lico-wechat-

agent

L Install if you

need to send

alerts via

WeChat

lico alarm

notification

lico-mail-agent 1.2.0 lico-mail-

agent

L Install if you

need to send

alerts via

email

4.2 Getting the LiCO Installation Package

Page 32: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

Please obtain LiCO release package from Lenovo ESD website

(https://lenovoesd.flexnetoperations.com/control/lnvo/login). Please contact Lenovo

salesperson how to subscribe and get ESD authentication.

The LiCO 5.1.0 release package for EL7 is lico-release-5.1.0.el7.tar.gz. Upload the release package to the management node.

4.3 Configuring the Local Yum Depository for LiCO

Run the commands below to configure the local Yum depository for the management node:

mkdir -p $lico_repo_dir

tar zxvf lico-release-5.1.0.el7.tar.gz -C $lico_repo_dir --strip-components 1

cd $lico_repo_dir

./Makerepo

Run the commands below to configure the local Yum depository for other nodes:

cp /etc/yum.repos.d/lico-release.repo /var/tmp

sed -i '/baseurl=/d' /var/tmp/lico-release.repo

echo "baseurl=http://${sms_name}/${lico_repo_dir}/RPMS" >> /var/tmp/lico-release.repo

#Distribute repo files

xdcp all /var/tmp/lico-release.repo /etc/yum.repos.d/

4.4 Installing the Management Node

Run the commands below to install the LiCO module on the management node:

yum install -y lico-core lico-mond lico-confluent-proxy lico-ai-expert lico-env lico-ai-image

If you need to provide web service on the management node, run the commands below:

yum install -y lico-portal

If you need to provide email, SMS, and WeChat service on the management node, run the commands below:

#Install email module

yum install -y lico-mail-agent

#Install SMS module

yum install -y lico-sms-agent

#Install WeChat module

yum install -y lico-wechat-agent

If you need to use the VNC component, run the following command:

yum install -y lico-vnc-proxy

Page 33: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

4.5 Installing the Login Node

Run the commands below to install the LiCO module on the login node:

psh login yum install -y lico-env

If you need to provide web service on the login node, run the commands below:

psh login yum install -y lico-portal

If you need to provide email SMS, and WeChat service on the login node, run the commands below:

If you want to provide a basic compiling environment on login node, recommend to run the commands below:

psh login yum groupinstall -y “Development Tools”

psh login yum install -y glibc-devel

Note: This is an optional step. To install these packages successfully, an internet based repository may be needed. How to setup compiling environment is out of this document’s scope, you can setup the repository according to your network condition.

4.6 Installing the Compute Node

Run the commands below to install the LiCO module on the compute node:

psh compute yum install -y lico-env lico-ai-expert

If you need to use the VNC component, please refer appendix of Configuring VNC

#Install email module

psh login yum install -y lico-mail-agent

#Install SMS module

psh login yum install -y lico-sms-agent

#Install WeChat module

psh login yum install -y lico-wechat-agent

Page 34: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

5 Configuring LiCO

5.1 Configuring Service Account

In the management node, use the tool lico-passwd-tool. Follow the prompts to enter username and password for PostgreSQL, InfluxDB and Confluent to complete the configuration.

lico-passwd-tool

# Please fill in the following input according to the actual configuration

Please enter the postgres username:

Please enter the postgres password:

Please confirm the postgres password:

Please enter the influxdb username:

Please enter the influxdb password:

Please confirm the influxdb password:

Please enter the confluent username:

Please enter the confluent password:

Please confirm the confluent password:

5.2 Configuring Cluster Nodes

Before using LiCO, follow these steps to import cluster information to the system. Run the commands below:

cp /etc/lico/nodes.csv.example /etc/lico/nodes.csv

Edit cluster information file:

/etc/lico/nodes.csv

We recommend downloading this file to the local computer and edit using Excel or other table editing software. After you’re finished, you can upload it to the management node and overwrite the original file. The cluster information file is comprised of the following six parts:

5.2.1 Room Information

Room Information Table:

Page 35: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

Enter only one piece of server room information in the fields below:

Name Room Name

location_description Room Description

5.2.2 Logic Group Information

Managers can use logic groups to divide the nodes in the cluster into groups. The logic groups do not impact the use of computer resources or permissions configurations. Logic Group Information Table:

Enter at least one logic group in the fields below:

name Logic Group Name

5.2.3 Room Row Information

Room row is the rack order in the room, and you need to enter information for the rack row in which the cluster node is located. Row Information Table:

Enter at least one piece of row information in the fields below:

name Row Name (Cannot be repeated in the same room)

index Row Order (Must be a positive integer and cannot be repeated in the same room)

belonging_room Room Location (Add the configuration name to the room information table)

5.2.4 Rack Information

Input rack information for the cluster node location. The rack information table is below:

Enter at least the information of one rack in the fields below:

name Rack Name (Cannot be repeated in the same room)

Page 36: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

column Rack Location Column (Must be a positive integer and cannot be repeated in the same line)

belonging_row Rack Location Row Name (Add the configuration name to the row information table)

5.2.5 Chassis Information

If there is a chassis in the cluster, enter the chassis information. The chassis information table is below:

Fields description as following:

name Chassis Name (Cannot be repeated in the same room)

belonging_rack Rack Location Name (Use the name of the configuration in the rack information table.)

location_u_in_rack The location of the chassis base in the rack (Unit: u). In a standard cabinet, the value should be between 1 and 42.

machine_type Chassis Type (Can use model number. See appendix of Chassis Model List).

5.2.6 Node Information

Enter information for all nodes in the cluster into the node information table. The node information table can be found below(Broken display because the table is too long):

Fields description as following:

name The node hostname does not need a domain name.

nodetype For node type, choose: Head: Management node Login: Login node Compute: Compute node

immip IP address of the node’s BMC system.

Page 37: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

hostip IP address of the node on the host network.

machine_type Product name for the node. (For available product names, see appendix Product List).

ipmi_user XCC (BMC) Account for the Node

ipmi_pwd XCC (BMC) Password for the Node

belonging_service_ node

Large clusters require setting up a service node to which the node belongs. If there is no service node, leave the field blank.

belonging_rack Node Location Rack Name (Add the configuration name to the rack information table)

belonging_chassis Node Location Chassis Name (Leave blank if it can be located in any chassis.) Configure the chassis name in the chassis information table.

location_u Node Location: If the node is located in the chassis, enter the slot in the chassis in which the node is located; If the node is located in a rack, enter the location of the node base in the rack (Unit: u).

width Node Width (Full: 1, Half: 0.5)

height Node Height (Unit: u)

groups Node Location Logic Group Name (A node can belong to multiple logic groups. Group names should be separated by “;”.) Configure the logic group name in the logic group information table.

5.3 Configuring LiCO Services

The LiCO service configuration file is located in:

/etc/lico/lico.ini

This configuration file controls the operating parameters for various LiCO background service components. Modify based on your needs and with reference to the instructions below. If you change the configuration while LiCO is running, restart LiCO for the configuration to take effect。

systemctl restart lico

Page 38: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

All matters not raised in the configuration instructions below can be modified after consulting with service staff. Modifications made without a service consultation could result in the system failing to run normally.

5.3.1 Infrastructure Configuration

The following parts of the infrastructure configuration are modifiable:

#Cluster domain settings

domain = hpc.com

5.3.2 Database Configuration

The following parts of the database configuration are modifiable:

#PostgreSQL address

db_host = 127.0.0.1

#PostgreSQL port

db_port = 5432

#PostgreSQL database name

db_name = lico

#InfluxDB address

influx_host = 127.0.0.1

#InfluxDB port

influx_port = 8086

#InfluxDB database name

influx_database = lico

5.3.3 Login Configuration

The following parts of the login configuration are modifiable:

If user login failed more than the “login_fail_max_chance”, system will suspend this user for 45 minutes, the suspended user cannot login system even using a valid authentication information. Administrator can resume a suspended user by command line or web portal, please reference: Resume user or LiCO Administrator Guide.

5.3.4 Storage Configuration

The following parts of the storage configuration are modifiable:

#Maximum number of login password error attempts

login_fail_max_chance = 3

Page 39: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

#Shared storage directory

#If strictly adhering to the shared directory configurations in this document, change

#to: share_dir = /home

share_dir = /home

5.3.5 Scheduler Configuration

The following parts of the scheduler configuration are modifiable:

#The scheduler configuration currently supports Slurm, LSF, and Torque. Slurm is the default.

scheduler_software = slurm

5.3.6 Alert Configuration

The following parts of the alert configuration are modifiable:

#WeChat proxy server address

wechat_agent_url = http://127.0.0.1:18090

#WeChat notification template ID

wechat_template_id = <WECHAT_TEMPLATE_ID>

#SMS proxy server address

sms_agent_url = http://127.0.0.1:18092

#Email proxy server address

mail_agent_url = http://127.0.0.1:18091

The above only needs to be configured if WeChat, SMS, and email proxy modules are installed for the cluster, please obtain the <WECHAT_TEMPLATE_ID> from the following website: https://mp.weixin.qq.com/wiki?t=resource/res_main&id=mp1445241432

5.3.7 Cluster Configuration

The following parts of the cluster configuration are modifiable:

#Confluent port

confluent_port = 4005

5.3.8 Functional Configuration

The following parts of the functional configuration are modifiable:

[app:django]

#For the functional module used, modify based on the actual module purchased.

#If only using the HPC module, change to: use = hpc

#If only using the AI module, change to: use = ai

#After changing the configuration, you must enter lico init and refresh the data table.

use = hpc+ai

Page 40: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

5.4 Configuring LiCO Components

5.4.1 lico-vnc-mond

Create file /var/tmp/vnc-mond.ini and add following configuration:

[vnc]

url=http://127.0.0.1:18083/session

timeout=30

Note: Change 127.0.0.1 to the actual IP of management node. Distribute configuration file

xdcp compute /var/tmp/vnc-mond.ini /etc/lico/vnc-mond.ini

5.4.2 lico-env

Configure SSH commands, run the following commands:

psh compute echo "\""auth required pam_python.so pam_lico.py --url=http://${sms_name}:18080

--timeout=40 --ignore_conn_error"\"" \>\> /etc/pam.d/sshd

psh compute echo "\""account required pam_python.so pam_lico.py --url=http://${sms_name}:18080

--timeout=40 --ignore_conn_error"\"" \>\> /etc/pam.d/sshd

Configure su commands, run the following commands:

psh compute echo "\""auth required pam_python.so pam_lico.py --url=http://${sms_name}:18080

--timeout=40 --ignore_conn_error"\"" \>\> /etc/pam.d/su

psh compute echo "\""account required pam_python.so pam_lico.py --url=http://${sms_name}:18080

--timeout=40 --ignore_conn_error"\"" \>\> /etc/pam.d/su

5.4.3 lico-portal

Modify the pathway files below for nodes installed with the lico-portal module that need to provide external web services with different ports to prevent conflicting. You can edit file /etc/nginx/nginx.conf and change the port to 8080

listen 8080 default_server;

listen [::]:8080 default_server;

Moreover, you can modify https default port 443 to other ports, please modify it in the file /etc/nginx/conf.d/https.conf

listen <port> ssl http2;

Note: make sure the port is not used by other application and it is not blocked by the firewall.

Page 41: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

Edit file /etc/nginx/conf.d/sites-available/antilles.conf and change the first line.:

set $lico_host 127.0.0.1;

You can change the 127.0.0.1 to the management node IP if lico-portal does not run on management node.

Edit the file of /etc/lico/portal.conf can add a custom shortcut links, the configuration format can refer to the file: /etc/lico/portal.conf.example

If you want to hide the server version information, make the following modifications: Edit file /etc/nginx/nginx.conf and add server_tokens off in the http area. Example as following:

http{

......

sendfile on;

server_tokens off;

……

}

5.4.4 lico-ganglia-mond

Edit file /etc/lico/ganglia-mond.conf: Change cfg_db_host 127.0.0.1 and cfg_db_port 5432 to the actual PostgreSQL service. Change host 127.0.0.1 and port 8086 to the actual InfluxDB service. If you follow this document, configuration file should be as following on management node with default port.

influxdb {

cfg_db_host 127.0.0.1

cfg_db_port 5432

cfg_db_name lico

host 127.0.0.1

port 8086

database lico

timeout 10

}

5.4.5 lico-confluent-proxy

Edit /etc/lico/confluent-proxy.ini, change the database section as following:

[DEFAULT]

# database

db_host = 127.0.0.1

db_port = 5432

Page 42: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

db_name = lico

Change db_host = 127.0.0.1 and db_port = 5432 to the actual PostgreSQL service. If you follow this document, it should be installed on manangement node with default port. If there are multiple Confluents in the cluster, you need to configure the [app:main] section as following:

If you need to change information about the Confluent user, refer to Installing Confluent, create or change the user information, and update the information according to the steps shown in Configuring Service Account.

5.4.6 lico-confluent-mond

Edit file /etc/lico/confluent-mond.ini: Change db_host = 127.0.0.1 and db_port = 5432 to the actual PostgreSQL service. Change host = 127.0.0.1 and port = 8086 to the actual InfluxDB service. If you follow this document, they should be installed on management node with default port.

[database]

db_host = 127.0.0.1

db_port = 5432

db_name = lico

[influxdb]

host = 127.0.0.1

port = 8086

database = lico

timeout = 10

5.4.7 lico-wechat-agent

Edit file /etc/lico/wechat-agent as follows:

#The configurations below should be changed based on the specific environment

appid = <APPID>

secret = <SECRET>

Get <APPID> and <SECRET> references: https://mp.weixin.qq.com/wiki?t=resource/res_main&id=mp1445241432

[app:main]

use = cluster-confluent-proxy

Page 43: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

5.5 Initializing the System

Run the commands below to initialize LiCO:

lico init

5.6 Initializing Users

Run the commands below to initialize LiCO admin users. Add a LDAP user with username and password. You can change the username/password as needed, but if you do not want use LDAP you can skip this.

luseradd <HPC_ADMIN_USERNAME> -P <HPC_ADMIN_PASSWORD>

psh all "su - <HPC_ADMIN_USERNAME> -c whoami" | xcoll"

The “luseradd” command prompts you to enter the LDAP administrator password, enter the <LDAP_PASSWORD> you configured in section 3.6.1. Import user into LiCO, run the following command:

#Import user into LiCO and as admin

lico user_import -u <HPC_ADMIN_USERNAME> -r admin

5.7 Importing System Images

Obtain images from your salesperson, and refer the appendix to Import images into LiCO as system level image. However, you can also try to create image by yourselves. Please refer to the appendix Create image, and refer to the appendix Import images into LiCO as system level image.

Page 44: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

6 Starting LiCO

Run the commands below to start LiCO:

#If the management node has to provide web service, start Nginx.

systemctl enable nginx

systemctl start nginx

#If the login node has to provide web service, start Nginx.

psh login systemctl enable nginx

psh login systemctl start nginx

#Start LiCO-related services

systemctl start lico-ganglia-mond

systemctl start lico-confluent-mond

#Start LiCO

systemctl start lico

After LICO starts, delete the file lico_env.local, run the following commands to do this:

rm -rf /root/lico_env.local

After the LiCO service is started, you can access LiCO through web browser, open the link https://<ip of login node>:<port>/ (port is what you set in /etc/nginx/conf.d/https.conf in section 5.4.3) in the web browser. If the installation is correct, you will see the LiCO login page. You can use the LDAP account set in section 5.6 to log into LiCO.

Page 45: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

7 Appendix

7.1 Configuring VNC

This module only needs to be installed on a compute node that requires VNC functionality. Run the following command on the compute nodes, which you want to install the VNC function:

yum install -y gdm tigervnc tigervnc-server

yum install -y lico-vnc-mond

Edit /etc/gdm/custom.conf on these compute nodes, and make the following changes:

[xdmcp]

Enable=true

Run this command on these compute nodes to start VNC:

systemctl start lico-vnc-mond

vncserver -query localhost -securitytypes=none

If you need to install on all compute nodes, you can use the batch install command.

# install

psh compute yum install -y lico-vnc-mond

psh compute yum install -y gdm tigervnc tigervnc-server

# Distribution profile

xdcp compute /etc/gdm/custom.conf /etc/gdm/custom.conf

# start

psh compute systemctl start lico-vnc-mond

psh compute vncserver -query localhost -securitytypes=none

7.2 Configuring Confluent web console

If you want to open a node’s console from LiCO web portal, please config that node as below steps. After the following configurations are complete, you need to restart the node to make the configuration take effect.

Page 46: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

7.2.1 RHEL

Edit file /etc/default/grub, and append the following fields to the end of the GRUB_CMDLINE_LINUX line:

console=ttyS0,115200

Run the following command to configure the UEFI mode to start:

grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg

Run the following command to configure the legacy mode to start:

grub2-mkconfig -o /boot/grub2/grub.cfg

7.2.2 CentOS

Edit file /etc/default/grub, and append the following fields to the end of the GRUB_CMDLINE_LINUX line:

console=ttyS0,115200

Run the following command to configure the uefi mode to start:

grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg

Run the following command to configure the legacy mode to start:

grub2-mkconfig -o /boot/grub2/grub.cfg

7.3 LiCO commands

7.3.1 Set the LDAP administrator password

Note: this command works only when "use_libuser = true" in the file lico.ini

lico setldappasswd

Please input your ldap password:

Please confirm the ldap password:

7.3.2 Change user’s role

lico user_changerole -u <ROLE_USERNAME> -r admin

Parameter interpretation:

Page 47: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

-u Specify the username to modify

-r Specify the role to be set (admin/operator/user)

7.3.3 Resume user

lico user_resume <SUSPENDED_USERNAME>

Parameter interpretation:

Directly specify users who need to be resumed

7.3.4 Import user

Refer: Initializing Users

7.3.5 Import AI image

Refer: Importing System Image

7.4 Cluster Service Summary

Software

Name Component Name Service Name

Default port Installation Node

lico-core lico 18081/tcp H

lico-ganglia-mond lico-ganglia-mond 8661/tcp,8662/tcp H

lico-confluent-proxy 18081/tcp H

lico-confluent-mond lico-confluent-

mond

H

lico-vnc-proxy 18082/tcp,18083/t

cp

C

lico-vnc-mond lico-vnc-mond C

lico-sms-agent lico-sms-agent 18092/tcp L

lico-wechat-agent lico-wechat-agent 18090/tcp L

lico

lico-mail-agent lico-mail-agent 18091/tcp L

ngnix ngnix 80/tcp,443/tcp L|H

rabbitmq rabbitmq-server 5672/tcp H

postgresql postgresql 5432/tcp H

lico

depende

nt

confluent confluent 4005/tcp,13001/tc

p

H

Page 48: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

influxdb influxdb 8086/tcp,8088/tcp H

slapd 389/tcp,636/tcp H ldap

nslcd H,C,L

nfs nfs 2049/tcp H

ntp ntpd H

munge H,C

slurmctld 6817/tcp H

slurm

slurmd 6818/tcp C

cluster

ganglia gmond 8649/tcp,8649/ud

p

H,C,L

7.5 Security improvement

7.5.1 Binding setting

If you install system following this document, there are some components which listen ports are bind on all address by default. To improve the system security level, we recommend that you change the default settings. RabbitMQ

Recommend bind on loopback address (127.0.0.1). Edit /etc/rabbitmq/rabbitmq.config, remove {"::1", 5672}, for example:

[

{

rabbit,

[

{

tcp_listeners, [{"127.0.0.1", 5672}]

}

]

}

]

PostgreSQL It binds on loopback address (127.0.0.1) by default. Not recommend change the default setting.

Confluent It binds on loop address (127.0.0.1) by default. Not recommend change the default setting.

InfluxDB Recommend bind on loopback address (127.0.0.1). Edit /etc/influxdb/config.toml, uncomment the line #bind-address=”:8086” in http part and change it to bind-address=”127.0.0.1:8086”, for example:

[http]

Page 49: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

#Determines whether HTTP endpoint is enabled.

#enabled = true

#The bind address used by the HTTP service.

bind-address = "127.0.0.1:8086"

lico-core Recommend bind on internal address, if there is no login nodes in cluster and lico-portal is installed on the same node with lico-core, recommend bind on loopback address. Edit /etc/lico/supervisor.d/antilles.ini, change the parameter of command in program antilles, change “--bind:18080” to “--bind <INTELNAL IP>:18080”, for example:

[program:confluent_proxy]

command=/usr/bin/gunicorn --paste /etc/lico/confluent-proxy.ini --bind 172.20.0.14:18080 --

log-config /etc/lico/confluent-proxy.ini --workers 1 --threads 50 --timeout 3600 --worker-class

gevent --keep-alive 65 --log-level info --access-logfile - --error-logfile - --capture-output

lico-ganalia-mond The default setting is only trusted loopback address (127.0.0.1). Not recommend change the default setting.

lico-confluent-proxy Recommend bind on internal address, if there is no login nodes in cluster and lico-portal is installed on the same node with lico-confluent-proxy, recommend bind on loopback address. Edit /etc/lico/supervisor.d/confluent-proxy.ini, change the parameter of command in program confluent-proxy, change “--bind:18081” to “--bind <INTELNAL IP>:18081”, for example:

[program:confluent_proxy]

command=/usr/bin/gunicorn --paste /etc/lico/confluent-proxy.ini --bind 172.20.0.14:18081 --

log-config /etc/lico/confluent-proxy.ini --workers 1 --threads 50 --timeout 3600 --worker-class

gevent --keep-alive 65 --log-level info --access-logfile - --error-logfile - --capture-output

lico-vnc-proxy Recommend bind on internal address, if there is no login nodes in cluster and lico-portal is installed on the same node with lico-vnc-proxy, recommend bind on loopback address. Edit /etc/lico/supervisor.d/vncproxy.ini, change the parameter of command in program vncproxy, change “--bind:18083” to “--bind <INTELNAL IP>:18083”, the IP in websockify parameter “--token-source” also need to be changed to <INTELNAL IP>, for example:

[program:vncproxy]

command=/usr/bin/gunicorn --paste /etc/lico/vnc-proxy.ini –bind 172.20.0.14:18083 --log-config

/etc/lico/vnc-proxy.ini --workers 1 --timeout 3600 --worker-class gevent --keep-alive 65 --log-

level info --access-logfile - --error-logfile - --capture-output

……

[program:websockify]

command=/usr/bin/websockify 18082 --token-plugin=JSONTokenApi --token-

source='http://172.20.0.14:18083/lookup?token=%s'

Page 50: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

lico-wechat-agent Recommend bind on internal address, if there is no login nodes in cluster and lico-portal is installed on the same node with lico-wechat-agent, recommend bind on loopback address. Edit /etc/sysconfig/lico-wechat-agent, change the GUNICORN_CMD_ARGS, change “--bind:18090” to “--bind <INTELNAL IP>:18090”, for example:

# lico-wechat-agent environment file

GUNICORN_CMD_ARGS= \

--bind 172.20.0.14:18090 \

--log-config /etc/lico/wechat-agent.ini \

--workers 1 \

--threads 4 \

--worker-class gevent \

--timeout 3600 \

--keep-alive 65 \

--log-level info \

--access-logfile - \

--error-logfile - \

--capture-output True

lico-mail-agent Recommend bind on internal address, if there is no login nodes in cluster and lico-portal is installed on the same node with lico-mail-agent, recommend bind on loopback address. Edit /etc/sysconfig/lico-mail-agent, change the GUNICORN_CMD_ARGS, change “--bind:18091” to “--bind <INTELNAL IP>:18091”, for example:

# lico-mail-agent environment file

GUNICORN_CMD_ARGS= \

--bind 172.20.0.14:18091 \

--log-config /etc/lico/mail-agent.ini \

--workers 1 \

--threads 4 \

--worker-class gevent \

--timeout 3600 \

--keep-alive 65 \

--log-level info \

--access-logfile - \

--error-logfile - \

--capture-output True

lico-sms-agent Recommend bind on internal address, if there is no login nodes in cluster and lico-portal is installed on the same node with lico-mail-agent, recommend bind on

Page 51: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

loopback address. Edit /etc/sysconfig/lico-sms-agent, change the GUNICORN_CMD_ARGS, change “--bind:18092” to “--bind <INTELNAL IP>:18092”, for example:

# lico-sms-agent environment file

GUNICORN_CMD_ARGS= \

--bind 172.20.0.14:18092 \

--log-config /etc/lico/sms-agent.ini \

--workers 1 \

--timeout 3600 \

--keep-alive 65 \

--log-level info \

--access-logfile - \

--error-logfile - \

--capture-output True

7.5.2 Firewall setting

Considering the security of the system, we recommend that you enable the firewall on the management node, and login nodes. If you setup the cluster and install LiCO follow this document, you can follow the below steps to setup your firewall. We recommend you reference the official firewall setup document to setup it by yourself. You can visit the official document from: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/security_guide/sec-configuring_firewalld Run the below commands to install and enable the firewall:

yum install -y firewalld

systemctl enable firewalld

systemctl start firewalld

Management node Run the below commands to add roles to public zone:

#Add SSH service port

firewall-cmd --zone=public --add-port=22/tcp --permanent

#Add httpd service port

firewall-cmd --zone=public --add-port=80/tcp --permanent

#Add NFS service port

firewall-cmd --zone=public --add-port=2049/tcp --permanent

#Add Ganglia gmond port

firewall-cmd --zone=public --add-port=8649/udp --permanent

#Add Slurm slurmctld port

firewall-cmd --zone=public --add-port=6817/tcp --permanent

#Add OpenLDAP slapd port

firewall-cmd --zone=public --add-port=636/tcp --permanent

Page 52: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

firewall-cmd --zone=public --add-port=389/tcp --permanent

#Add lico-confluent-proxy port

firewall-cmd --zone=public --add-port=18081/tcp --permanent

#Add lico-core port

firewall-cmd --zone=public --add-port=18080/tcp --permanent

#Add TensorBoard random binding port range

firewall-cmd --zone=public --add-port=20000-25000/tcp --permanent

Run the below command to add the internal network interface into the public zone:

firewall-cmd --zone=public --add-interface=eth0 --permanent

firewall-cmd --zone=public --add-interface=eth1 --permanent

Notes: eth0 and eth1 should be your internal and external network interface. Run the below command to enable roles:

firewall-cmd --complete-reload

Login node Run the below commands to add roles to public zone:

#Add SSH service port

firewall-cmd --zone=public --add-port=22/tcp --permanent

#Add Nginx service port, you can adjust 8443 to your setting

firewall-cmd --zone=public --add-port=8443/tcp --permanent

Run the below command to add the internal and external network interface into the public zone:

firewall-cmd --zone=public --add-interface=eth0 --permanent

firewall-cmd --zone=public --add-interface=eth1 --permanent

Notes: eth0 and eth1 should be your internal and external network interface. Run the below command to enable roles:

firewall-cmd --complete-reload

7.6 slurm.conf

Introduction to modifications to slurm.conf: Cluster Name:

ClusterName=mycluster

Management Node Name:

ControlMachine=c031

GPU Scheduling: In the cluster, this entry is used when there is a GPU node on the cluster. If there is no GPU node, delete this entry.

Page 53: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

GresTypes=gpu,gpu_mem

Cluster Node Definitions: NodeName shows the node name. Gres shows the number of GPUs and GPU graphics memory size in every node (if this is not a GPU node, then delete the Gres content). CPUs shows the number of CPUs in a node. RealMemory shows the node memory size (Unit: M).

NodeName=c031 Gres=gpu:4,gpu_mem:10000 CPUs=28 RealMemory=200000 State=UNKNOWN

NodeName=c032 Gres=gpu:4,gpu_mem:10000 CPUs=28 RealMemory=200000 State=UNKNOWN

Partition Definitions: PartitionName shows the name of the partition. Nodes shows the nodes in the partition. Default shows if this partition has a default partition. When a user submits a job, choose a partition. If the user does not designate a partition, then the default partition will be used.

PartitionName=compute Nodes=c0[31-32] Default=YES MaxTime=INFINITE State=UP

PartitionName=compute1 Nodes=c0[31-32] Default=NO MaxTime=INFINITE State=UP

EnforcePartLimits Definitions: If you want to submit a direct error response when a job requests resources and exceeds the cluster resource amount, use the configuration below, or the job will remain in the queue:

More detail about how to configure slurm.conf, please refer slurm offical site:

https://slurm.schedmd.com/slurm.conf.html

7.7 gres.conf

This file describes thes GPUs installed and GPU memory on the GPU nodes. The content of the gres.conf file may vary based on the GPU node. The "Count" attribute of the "gpu_mem" setting shows the amount GPU memory per GPU (Unit: MB).

7.8 Chassis Model List

Model Code model Number of Slots Appearance

d2 D2 Enclosure

4

EnforcePartLimits=ALL

Name=gpu File=/dev/nvidia[0-3]

Name=gpu_mem Count=10000

Page 54: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

7.9 Product List

Product Names Corresponding Machine Appearance

sd530 SD530

0.5U rack form factor

sr630 SR630

1U

sr650 SR650

2U

7.10 Import system image

System level container images can used for all the users in the cluster. The following steps show how to create and import system level container images.

7.10.1 Create image

LiCO is released with image bootstrap files for common used AI frameworks, bootstrap file looks like Docker file, user can use these image bootstrap files to create image, these files are under /opt/lico/examples/image/ on the management node. These bootstrap files are:

Framework Framework version

CPU/GPU Comments

Caffe 1.0 CPU

Caffe 1.0 CUDA9.1 --Support P100 and V100

--Caffe cannot support CUDA9 officially, we

changed the make file of the Caffe.

TensorFlow 1.6 CPU

TensorFlow 1.6 CUDA9.0 --Support P100 and V100

--TensorFlow cannot support CUDA9.1 officially,

so we use CUDA9.0.

Neon 2.4 CPU

Intel-Caffe 1.0.4 CPU

MXNet 1.1 CPU

MXNet 1.1 CUDA9.0 --Support P100 and V100

--MXNet cannot support CUDA9.1 officially, so we

use CUDA9.0.

Note: If there is no GPU nodes in the cluster, you can only create CPU images Note: GPU driver version of the cluster nodes should be 390.46 Prepare one build node,the minimum free storage of this node is 100G.This node should able to access internet. This node should have the same OS version, the same singularity version (2.4 https://github.com/singularityware/singularity/releases/tag/2.4) with the nodes in

Page 55: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

the cluster. If you want to create GPU images, this node should have the same GPU and GPU driver with the nodes in the cluster. Copy these bootstrap files from management node to this build node, such as these bootstrap files are copied to the new directory /opt/images (Note: This directory and /var/tmp cannot be an NFS mount.), and make images. Ensure that squashfs-tools is installed.

cd /opt/images/

singularity build caffe-1.0-cpu.image caffe/caffe-1.0-cpu

singularity build caffe-1.0-gpu-cuda91.image caffe/caffe-1.0-gpu-cuda91

singularity build tensorflow-1.6-cpu.image tensorflow/tensorflow-1.6-cpu

singularity build tensorflow-1.6-gpu-cuda90.image tensorflow/tensorflow-1.6-gpu-cuda90

singularity build mxnet-1.1-cpu.image mxnet/mxnet-1.1-cpu

singularity build mxnet-1.1-gpu-cuda90.image mxnet/mxnet-1.1-gpu-cuda90

singularity build intel-caffe-1.0.4-cpu.image intel-caffe/intel-caffe-1.0.4-cpu

singularity build neon-2.4-cpu.image neon/neon-2.4-cpu

7.10.2 Import images into LiCO as system level image

Copy the created images to the management node, such as the images are copied to directory /opt/images (Note: This directory and /var/tmp cannot be an NFS mount.), then root user use the following commands to import images into LiCO.

cd /opt/images

lico import_system_image caffe-cpu $PWD/caffe-1.0-cpu.image singularity caffe

lico import_system_image caffe-gpu $PWD/caffe-1.0-gpu-cuda91.image singularity caffe

lico import_system_image tensorflow-cpu $PWD/tensorflow-1.6-cpu.image singularity tensorflow

lico import_system_image tensorflow-gpu $PWD/tensorflow-1.6-gpu-cuda90.image singularity tensorflow

lico import_system_image mxnet-cpu $PWD/mxnet-1.1-cpu.image singularity mxnet

lico import_system_image mxnet-gpu $PWD/mxnet-1.1-gpu-cuda90.image singularity mxnet

lico import_system_image intel-caffe $PWD/intel-caffe-1.0.4-cpu.image singularity intel-caffe

lico import_system_image neon $PWD/neon-2.4-cpu.image singularity neon

7.11 Troubleshooting Slurm issues

Using Slurm command sinfo to check the node status, If node status is drain: You can use the command to change the node status to normal scontrol update NodeName=host1 State=RESUME. If node status is down: --Using slurm command scontrol show nodes to see the node detail information, see the reason in the output of this command. --Check whether all the nodes have the same slurm.conf file under /etc/slurm, --Check whether service of slurmd, munge are active on all the nodes, and whether service of slurmctld is active on the management node.

Page 56: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

--Check whether all the nodes have the same date and whether ntpd service is active on all the nodes.

If you meet the following warning text when using srun/prun to run mpi program:

Failed to create a completion queue (CQ):

……

Error: Cannot allocate memory

Please check whether soft memlock and hard memlock are unlimited in the file /etc/security/limits.conf on management node and compute nodes. If not, you should set them as unlimited and restart the nodes to take effect:

* soft memlock unlimited

* hard memlock unlimited

7.12 Update OS packages

Please check the latest version for CentOS/RHEL 7.4 on the web site http://mirror.centos.org/centos-

7/. The below steps assume the latest version is 7.4.1708.

1. Prepare packages: For Red Hat Enterprise Linux, if you have subscribed Red Hat Enterprise Linux, you should update package repository from Red Hat. For CentOS, you should prepare one CentOS 7.4 node which can access the Internet, then run the below command to create update package repository.

centos7_4_latest_version=7.4.1708

cat << eof > /etc/yum.repos.d/centos7_4_update.repo

[centos7_4_update]

name=centos7_4_update

baseurl=http://mirror.centos.org/centos/$centos7_4_latest_version/updates/x86_64/

mirrorlist=http://mirrorlist.centos.org/?release=$centos7_4_latest_version&arch=x86_64&repo=updates

gpgcheck=0

enabled=1

eof

yum install -y createrepo

yum install -y yum-utils

mkdir -p /opt/update

cd /opt/update

reposync --download-metadata -r centos7_4_update -e ./cache -n -a x86_64 -d

createrepo .

rm -rf cache

tar -zcf update.tgz centos7_4_update repodata

2. Update packages

Page 57: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

Run command on management node.

mkdir -p /install/custom/update Upload created update.tgz file to /install/custom/update of the management node. Then run the command on the management node.

tar -xf update.tgz -C /install/custom/update

cat << eof > /etc/yum.repos.d/centos7_4_update.repo

[centos7_4_update]

name=centos7_4_update

baseurl= http://${sms_name}/install/custom/update

gpgcheck=0

enabled=1

eof

xdcp all /etc/yum.repos.d/centos7_4_update.repo /etc/yum.repos.d/centos7_4_update.repo

Run the below command on the management node to update package.

yum -y update --skip-broken

psh all yum -y update --skip-broken

7.13 Using a newer kernel with RETPOLINE support

If an updated kernel is to be used on the system the has RETPOLINE support enabled (for example a kernel with mitigations for the Spectre/Meltdown security vulnerability), then in addition to the kernel update, the toolchain has to be updated as well in order for the NVIDIA driver to build against this kernel. Additionally glibc should be updated. The following minimum update levels for the kernel, toolchain and glibc include this support: 1. For RHEL:

https://access.redhat.com/errata/RHSA-2018:0395 https://access.redhat.com/errata/RHBA-2018:0408 https://access.redhat.com/errata/RHBA-2017:3296

2. For CentOS: https://lists.centos.org/pipermail/centos-announce/2018-March/022768.html https://lists.centos.org/pipermail/centos-announce/2018-March/022789.html https://lists.centos.org/pipermail/centos-announce/2017-December/022650.html

Make sure to setup and enable a yum repository for these packages before the steps in section 2.2 in this document for the management node and before the steps in section 2.3.4 for the compute and managed nodes. This can be done as follows:

mkdir -p /install/custom/retpoline/RPMS Place the RPMs from all of the above links for RHEL or CentOS in

Page 58: 5.1.0 Lenovo Intelligent Computing Orchestration...1 Overview 1.1 Introduction to LiCO Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for

/install/custom/retpoline/RPMS.

cd /install/custom/retpoline/

yum install -y createrepo

createrepo RPMS

cat << eof > /etc/yum.repos.d/retpoline.repo

[retpoline]

name=retpoline

baseurl=file:///install/custom/retpoline/RPMS

gpgcheck=0

enabled=1

eof

cat << eof > /var/tmp/retpoline.repo

[retpoline]

name=retpoline

baseurl= http://${sms_name}/install/custom/retpoline/RPMS

gpgcheck=0

enabled=1

eof

xdcp all /var/tmp/retpoline.repo /etc/yum.repos.d/retpoline.repo

After this the new kernel should be installed on the management, compute and login nodes and those nodes rebooted: Management node, before section 2.2:

yum update kernel

reboot

Compute and Login Nodes, before section 2.3.4:

psh all yum update -y kernel

psh all reboot