Running jobs
Before you start running jobs, it is important to be aware of the queueing system Slurm.
Slurm queue system
Slurm is a job scheduling system and is used to allocate resources and manage user jobs on AI-LAB. Jobs on AI-LAB can only be run through Slurm.
The primary method to run a job via Slurm is by utilizing the command srun
. Let's try launching a job on a compute node:
srun hostname
Waiting in queue
Upon execution, you might receive a notification indicating your job has been queued, awaiting resource availability:
srun: job X queued and waiting for resources
Once a compute node becomes available, you'll receive confirmation:
srun: job X has been allocated resources
Once a compute node becomes available the hostname
command executes on the allocated compute node, revealing its identifier (e.g. ailab-l4-01
).
More Slurm commands
You can find additional Slurm commands available to customize your job submissions, such as setting the time limit for a job, specifying the number of CPUs or GPUs, and more.
Executing a containerized job with Singularity
To run a task within a container using Singularity, we need to add specific parameters to the Slurm command.
As an example, let's try running print('hello world')
using Python3
within the tensorflow_24.03.sif
container image from /ceph/container/tensorflow
directory.
srun singularity exec /ceph/container/tensorflow/tensorflow_24.03.sif python3 -c "print('hello world')"
srun
is the Slurm command used to submit a job.singularity
is the command-line interface for interacting with Singularity.exec
is a sub-command that tells Singularity to execute a command inside the specified container./ceph/container/tensorflow/tensorflow_24.03.sif
is the path to the container image.python3 -c "print('hello world')"
is the task that singularity executes.
While this execution proceeds smoothly, it's important to note that the command exclusively utilizes CPUs. The primary role of AI-LAB is to run software that utilises GPUs for computations. In order to run applications with a GPU you need to allocate a GPU to a job using Slurm.
Allocating a GPU to your job
You can allocate a GPU to a job using the --gres=gpu
option for Slurm. Additionally, you need to add the --nv
option to Singularity to enable NVIDIA drivers in the container.
Let's try running a small Python script that performs a simple matrix multiplication of random data to benchmark TensorFlow computing speed with 1 GPU allocated:
First copy benchmark_tensorflow.py
from /ceph/course/claaudia/docs
to your user directory (~/
):
cp /ceph/course/claaudia/docs/benchmark_tensorflow.py ~/
Then lets try allocating 1 GPU to the job by adding --gres=gpu:1
:
srun --gres=gpu:1 singularity exec --nv /ceph/container/tensorflow/tensorflow_24.03.sif python3 benchmark_tensorflow.py
Note that the above example allocate 1 GPU to the job. It is possible to allocate more, for example --gres=gpu:2
for two GPUs. Software for computing on GPU is not necessarily able to utilise more than one GPU at a time. It is your responsibility to ensure that the software you run can indeed utilise as many GPUs as you allocate. It is not allowed to allocate more GPUs than your job can utilise. Here is an example of a PyTorch script that can handle multiple GPUs.
Congratulations!
You've mastered the fundamentals of AI-LAB. Ready to take the next steps?