Running on multiple nodes (cluster/zmq)#
If you have a HPC cluster using the SLURM scheduler you are on the wrong page and you should go here instead. This page documents how to run the engine on a bare cluster of Linux machine without any scheduler.
Pre-requisites#
Network and security considerations#
The worker nodes should be isolated from the external network using either a dedicated internal network or a firewall. Additionally, access to the DbServer ports should be limited (again by internal LAN or firewall) so that external traffic is excluded.
The following ports must be open on the master node:
1908 for DbServer (or any other port allocated for the DbServer in the
openquake.cfg
)1912-1920 for ZeroMQ receivers
8800 for the API/WebUI (optional)
The following port must be open on the workers node:
1909 for the ZeroMQ workerpools
The master node and the worker nodes must be able to communicate on the specified ports.
Moreover the user openquake
on the master must be able to access the
workers via ssh. This means that you have to generate and copy the
ssh keys properly, and the first time you must connect to the workers
manually. Then the engine will be able to start and stop zworker
processes at each new calculation.
Storage requirements#
Storage requirements depend a lot on the type of calculations you want to run. On a worker node you will need just the space for the operating system, the logs and the OpenQuake installation: less than 20GB are usually enough. Workers can be also diskless (using iSCSI or NFS for example).
On the master node you will also need space for:
the shared_dir directory (usually located under
/home
): it contains the calculations datastore (hdf5
files located in theoqdata
folder)the OpenQuake database (located under
/var/lib/openquake/oqdata/
): it contains only logs and metadata, the expected size is tens of megabytethe temporary folder (
/tmp
). A different temporary folder can be customized via theopenquake.cfg
On large installations we strongly suggest to create a separate partition for /home
.
Swap partitions#
Having swap active on resources dedicated to the OpenQuake Engine is strongly discouraged because of the performance penality when it’s being used. It will likely increase by many orders of magnitude the time required to complete the job, thus making the job actually stuck. It is much better to get a MemoryError and then reduce the size of the job.
Installation#
Please use the Universal installation script in
server
mode or devel_server
mode. The installer will save the
Python code in the folder /opt/openquake/venv
. Since
/opt/openquake
is exported to the workers via NFS there will be no
need to install anything on the worker nodes except Python.
OpenQuake Engine ‘master’ node configuration File#
Enable zmq distribution#
The following file (on all nodes) should be modified to enable zmq support:
/opt/openquake/openquake.cfg
[distribution]
# enable celery only if you have a cluster
oq_distribute = zmq
[dbserver]
file = /var/lib/openquake/oqdata/db.sqlite3
# address of the dbserver
# on multi-node cluster it must be the IP or hostname
# of the master node (on the master node cfg too)
host = < IP address of master>
port = 1908
receiver_ports = 1912-1920
authkey = somethingstronger
[zworkers]
host_cores = < IP address of worker1> -1, < IP address of worker2> -1
ctrl_port = 1909
Notice that the -1 in < IP address of worker1> -1
means that all the
cores in that worker will be used. You can use a number between 0 and
the maximum number of available core to limit the resource usage. The
engine will automatically start and stop zmq processes on the worker
nodes at each new calculation, provided the user openquake has ssh
access to the workers. Please note that you must list explicitly the
workers that you want to use.
NB: when using the zmq mechanism you should not touch the parameter
serialize_jobs
and keep it at its default value of true
.
Configuring daemons#
The required systemd services are configured from the universal installer into the folder /etc/systemd/system/
Master node#
OpenQuake Engine DbServer -
openquake-dbserver.service
OpenQuake Engine WebUI -
openquake-webui.service
(optional)
Monitoring zmq#
oq workers status
can be used to check the status of the worker nodes and the task distribution. An output like this is produced:
$ oq workers status
[('192.168.2.1', 1, 64), ('192.168.2.2', 7, 64), ('192.168.2.3', 7, 64)]
For each worker in the cluster you can see its IP and the cores which are currently running with respect to the number of cores available (for instance on the host 192.168.2.1 only 1 core of 64 is running, while in the other two workers 7 cores are running each).
There are a few useful commands to manage the workers, to be run as user
openquake
:
oq workers start
is used to start the workersoq workers stop
is used to stop the workers nicelyoq workers kill
is used to send a hardkill -9
to the workersoq workers debug
is used to test that the installation is correct
If a calculation is stuck in the “executing” state due to an IT
problem (like the cluster running out of memory followed by an oq workers kill
) you can fix its status with the command oq abort XXX
where XXX
is the calculation ID.
Running calculations#
Jobs can be submitted through the master node using the oq engine
command line interface, the API or the WebUI if active. See the documentation about how to run a calculation or about how to use the WebUI