Running jobs
In this section, you will learn how to launch on a compute node (see section System overview). We do this by submitting a job to our queuing system, Slurm. We will be exploring two commands for doing this.
The srun command
One way of running a job via Slurm is by utilizing the srun
command. Let's try to launch a job on a compute node:
srun hostname
hostname
command is executed on the compute node that was allocated to our job. The output could be something like:
a256-t4-01.srv.aau.dk
a256-t4-01.srv.aau.dk
.
Running hostname
again - this time on the front end node (without srun
):
hostname
ai-fe02.srv.aau.dk
The sbatch command
Another way of launching a job is with the sbatch
command. This command can be launched with a batch-script or on-the-fly with the --wrap
argument.
sbatch - on the fly
Let's this approach by entering the follwing command:
sbatch --wrap="hostname"
Notice how the command output is not printed directly to the console, but instead we get a message from Slurm, telling us that our job was submitted to the queue:
Submitted batch job 737186
Once the job is finished, we will be able to find a file in the directory we launched our job in (slurm-737186.out
). Let's print this file:
cat slurm-737186.out
a256-t4-02.srv.aau.dk
sbatch - with a batch-script
For better reproducibility it can be a very good idea to launch your job with a batch-script. Let's try this with a minimal example. Assuming that we have the file hostname.sh
with the following content:
#!/usr/bin/env bash
#SBATCH --job-name=hostname
#SBATCH --partition=prioritized
#SBATCH --nodelist=a768-l40s-02
srun hostname
Here we will be launching the hostname command on the compute node a768-l40s-02
.
We launch a job from these instructions, with:
sbatch hostname.sh
Submitted batch job 737223
cat
command to print the content of the output file:
cat slurm-737223.out
a768-l40s-02.srv.aau.dk
Which one to use: srun vs sbatch
The most important difference between these two commands is:
srun
returns command output directly to the console -sbatch
writes it to a file.
A job launched with srun
is therefore dependent on the console session the front end node, and will only run as long as the output can be printed directly to the console. If the console session is interrupted, the job is terminated. A job launched with sbatch
does not depend on an external process in the same way, and will run until it is explicitly cancelled by the user.
Don't launch jobs from within interactive sessions
It is not very good practice to start your jobs from within interactive shell sessions (using srun --pty
), as this will start that runs until it reaches the time limit in the (6 days in the prioritized partition) - not when the actual job is finished. As a consequence the resources allocated to the job will be occupied for longer than needed.
As stated in our Fair usage-section, we want to encourage our users to be mindful of their resource consumption for the sake of their fellow researchers, and not occupy ressources that others could have put to use.
Conclusion:
srun
is best suited for development - where you want the command output to be printed directly in the console.sbatch
is best suited for long-running unmaintained jobs.
Try to use sbatch
as much as possible.
More Slurm commands
You can find additional Slurm commands available to customize your job submissions, such as setting the time limit for a job, specifying the number of CPUs or GPUs, and more.
Executing a containerized job with Singularity
To run a task within a container using Singularity, we need to add specific parameters to the Slurm command.
As an example, let's try running print('hello world')
using Python3
within a tensorflow_24.03-tf2-py3.sif
container image.
srun singularity exec tensorflow_24.03-tf2-py3.sif python3 -c "print('hello world')"
srun
is the Slurm command used to submit a job.singularity
is the command-line interface for interacting with Singularity.exec
is a sub-command that tells Singularity to execute a command inside the specified container.tensorflow_24.03-tf2-py3.sif
is the path to the container image.python3 -c "print('hello world')"
is the task that singularity executes.
While this execution proceeds smoothly, it's important to note that the command exclusively utilizes CPUs. The primary role of AI Cloud is to run software that utilises GPUs for computations. In order to run applications with a GPU you need to allocate a GPU to a job using Slurm.
Allocating a GPU to your job
You can allocate a GPU to a job using the --gres=gpu
option for Slurm. Additionally, you need to add the --nv
option to Singularity to enable NVIDIA drivers in the container.
Let's try running a small Python script that performs a simple matrix multiplication of random data to benchmark TensorFlow computing speed with a GPU allocated.
Type nano
and press ENTER
(or use the editor of your choice), and enter the following code:
import tensorflow as tf
import time
def benchmark_tensorflow():
# Create some random data
input_data = tf.random.normal((10000, 10000))
# Define a simple TensorFlow computation (for example, matrix multiplication)
@tf.function
def some_computation(x):
return tf.matmul(x, x)
# Warm-up to ensure graph optimizations are done
_ = some_computation(input_data)
# Run the computation and measure the time
start_time = time.time()
result = some_computation(input_data)
end_time = time.time()
# Print the elapsed time
print("Time taken: {:.4f} seconds".format(end_time - start_time))
if __name__ == "__main__":
benchmark_tensorflow()
Save by pressing CTRL + O
enter a file name, e.g. benchmark_tensorflow.py
and exit by pressing CTRL + X
. Now you should have benchmark_tensorflow.py
in your directory.
Lets try allocating 1 arbitrary available GPU to the job by adding --gres=gpu:1
:
srun --gres=gpu:1 singularity exec --nv tensorflow_24.03-tf2-py3.sif python3 benchmark_tensorflow.py
Note that the above example allocate 1 GPU to the job. It is possible to allocate more, for example --gres=gpu:2
for two GPUs. Software for computing on GPU is not necessarily able to utilise more than one GPU at a time. It is your responsibility to ensure that the software you run can indeed utilise as many GPUs as you allocate. It is not allowed to allocate more GPUs than your job can utilise. Here is an example of a PyTorch script that can handle multiple GPUs.
Congratulations!
You've mastered the fundamentals of AI Cloud. Ready to take the next steps?