To setup a HPC cluster, you should always get the hardware ready first.
Here we demonstrate the setup of a small cluster unit with 1 head node and 3 child nodes. For simplicity we will the CPU architecture is assume to be x64, though similar setup also works on ARM nodes like Raspberry Pi.
This setup is scalable to as many nodes as you have, you may also have a separate network attached storage to the system for large data I/O. Here we will use the head node to act as the storage node.
It would be far easier to have all cluster nodes in same OS. To reduce graphical computation resources, I highly recommend Ubuntu Server for child nodes. If you are comfortable with SSH communications and CLI Linux environment, install Ubuntu Server for head node as well, or else you may choose any Ubuntu Desktops with same distribution number as the child nodes. This is to maintain the same dependency environment across whole cluster and software can be installed simultaneously across all nodes.
Both GUI and CLI works the same afterward…as long as all Slurm setups are accomplished under CLI environment…Linux newbies may feel more comfortable with GUI version as long as the file editing can be done without knowledge of CLI editors like Nano or Vim. Under GUI Linux you can call up the CLI terminal with Ctrl
+ Alt
+ T
.
It is essential to keep all nodes IPs constant over time to guarantee stable communication between nodes. In most modern day routers users may login to the admin interface to bind device IP according their MAC addresses under DHCP.
Slurm expects hosts to be named with a specific pattern:
<nodename>
<nodenumber>
. When choosing the hostname for the nodes, it would be convenient to name them systemically in order. (e.g.node01
,node02
,node03
,node04
,…)
Now we may setup the hostname:
sudo hostname node01 # whatever name you chose
sudo vim /etc/hostname # change the hostname here too
sudo vim /etc/hosts # change the hostname to "node01"
Node communication requires accurate time synchronizations. The ntpdate
package will periodically synchronize OS time in the background.
sudo apt-get install ntpdate -y
sudo reboot
Repeat the procedure for all nodes but each of them with a different node number.
Storage node is one the three key components of the HPC cluster. In order for the softwares/ data be able to run on any of the nodes in the cluster, each node should be able access to the same files. In a large scale cluster there is often an individual node for data storage purposes.
In this mini setup we will use the head node to act as the storage node. A specific folder will be exported as a network file system (NFS) and mounted among all nodes. If you have a separate network attached storage (NAS), you may mount that on all nodes as NFS.
sudo mkdir /clusterfs #create NFS directory at /clusterfs
sudo chown nobody.nogroup -R /clusterfs #/clusterfs now owned by pseduo user
sudo chmod 777 -R /clusterfs #R/W permission for all users to the NFS directory
sudo apt install nfs-kernel-server -y
/etc/exports
:
/clusterfs <ip-address>/24(rw,sync,no_root_squash,no_subtree_check)
where <ip-address>
is the IP of the head node. You may check with router interface or via ifconfig
. This permission setting allows any clients to mount the shared directory. e.g. if the LAN address is 192.168.0.123
, you will have
/clusterfs 192.168.0.123/24(rw,sync,no_root_squash,no_subtree_check)
rw
provides client R/W accesssync
forces changes to be written on each transactionno_root_squash
enables the root users of the clients to write files as root permissionsno_subtree_check
prevents errors caused by a file being changed while another system is using it.sudo exportfs -a
Now we have exported the NFS directory from head node to the network. On child nodes you need mount in order to work like a single directory. Repeat the following procedures for all child nodes
sudo apt install nfs-common -y
sudo mkdir /clusterfs #create NFS directory at /clusterfs
sudo chown nobody.nogroup -R /clusterfs #/clusterfs now owned by pseduo user
sudo chmod 777 -R /clusterfs #R/W permission for all users to the NFS directory
/etc/fstab
by adding: <head-node-ip>:/clusterfs /clusterfs nfs defaults 0 0
This line refers to mounting shared directory at head node to the “local” folder /clusterfs
as NFS.
sudo mount -a
Once you create a file in any node’s /clusterfs
it will be RWable in all other nodes.You will need a NIS server in order to synchronize all user’s account among the cluster network.
sudo apt-get install nis
sudo vim /etc/default/nis
Change the line
NISSERVER=master
sudo vim /etc/ypserv.securenets
Change the line
# This line gives access to everybody. PLEASE ADJUST!
# comment out
# 0.0.0.0 0.0.0.0
# add to the end: IP range you allow to access
255.255.255.0 10.0.0.0
sudo vim /var/yp/Makefile
Change the lines
# line 52: change
MERGE_PASSWD=true
# line 56: change
MERGE_GROUP=true
sudo vim /etc/hosts
Add the IP address for NIS
127.0.0.1 localhost
# add own IP address for NIS
10.0.0.30 dlp.srv.world dlp
/usr/lib/yp/ypinit -m
At this point, we have to construct a list of the hosts which will run NIS
servers. dlp.srv.world is in the list of NIS server hosts. Please continue to add
the names for the other hosts, one per line. When you are done with the
list, type a <control D>.
next host to add: dlp.srv.world
next host to add: # Ctrl + D キー
The current list of NIS servers looks like this:
dlp.srv.world
Is this correct? [y/n: y] y
We need a few minutes to build the databases...
Building /var/yp/srv.world/ypservers...
Running /var/yp/Makefile...
make[1]: Entering directory '/var/yp/srv.world'
Updating passwd.byname...
Updating passwd.byuid...
Updating group.byname...
Updating group.bygid...
Updating hosts.byname...
Updating hosts.byaddr...
Updating rpc.byname...
Updating rpc.bynumber...
Updating services.byname...
Updating services.byservicename...
Updating netid.byname...
Updating protocols.bynumber...
Updating protocols.byname...
Updating netgroup...
Updating netgroup.byhost...
Updating netgroup.byuser...
Updating shadow.byname... Ignored -> merged with passwd
make[1]: Leaving directory '/var/yp/srv.world'
dlp.srv.world has been set up as a NIS master server.
Now you can run ypinit -s dlp.srv.world on all slave server.
sudo systemctl restart nis
cd /var/yp
make
sudo apt-get install nis
sudo vim /etc/yp.conf
Change the line
# ypserver ypserver.network.com
# add to the end: [domain name] [server] [NIS server's hostname]
domain srv.world server dlp.srv.world
sudo vim /etc/nsswitch.conf
Change the lines
# line 7: add like follows
passwd: compat systemd nis
group: compat systemd nis
shadow: compat nis
gshadow: files
hosts: files dns nis
sudo vim /etc/pam.d/common-session
Add the lines
# add to the end
session optional pam_mkhomedir.so skel=/etc/skel umask=077
sudo systemctl restart rpcbind nis
Try to logout and login again and see the NIS works
yppasswd
Changing NIS account information for bionic on dlp.srv.world.
Please enter old password:
Changing NIS password for bionic on dlp.srv.world.
Please enter new password:
Please retype new password:
The NIS password has been changed on dlp.srv.world.
The hardware part is done. Coming next we will start to install the job scheduler Slurm.