Skip to content

Running jobs

Before you start running jobs, it is important to be aware of the queueing system Slurm.

Slurm queue system

Slurm is a job scheduling system and is used to allocate resources and manage user jobs on AI Cloud. Jobs on AI Cloud can only be run through Slurm.

The primary method to run a job via Slurm is by utilizing the command srun. Let's try launching a job on a compute node:

srun hostname

Waiting in queue

Upon execution, you might receive a notification indicating your job has been queued, awaiting resource availability:

srun: job X queued and waiting for resources

Once a compute node becomes available, you'll receive confirmation:

srun: job X has been allocated resources

Once a compute node becomes available the hostname command executes on the allocated compute node, revealing its identifier (e.g. a256-t4-02.srv.aau.dk).

More Slurm commands

You can find additional Slurm commands available to customize your job submissions, such as setting the time limit for a job, specifying the number of CPUs or GPUs, and more.


Executing a containerized job with Singularity

To run a task within a container using Singularity, we need to add specific parameters to the Slurm command.

As an example, let's try running print('hello world') using Python3 within a tensorflow_24.03-tf2-py3.sif container image.

srun singularity exec tensorflow_24.03-tf2-py3.sif python3 -c "print('hello world')"
  • srun is the Slurm command used to submit a job.
  • singularity is the command-line interface for interacting with Singularity.
  • exec is a sub-command that tells Singularity to execute a command inside the specified container.
  • tensorflow_24.03-tf2-py3.sif is the path to the container image.
  • python3 -c "print('hello world')" is the task that singularity executes.

While this execution proceeds smoothly, it's important to note that the command exclusively utilizes CPUs. The primary role of AI Cloud is to run software that utilises GPUs for computations. In order to run applications with a GPU you need to allocate a GPU to a job using Slurm.


Allocating a GPU to your job

You can allocate a GPU to a job using the --gres=gpu option for Slurm. Additionally, you need to add the --nv option to Singularity to enable NVIDIA drivers in the container.

Let's try running a small Python script that performs a simple matrix multiplication of random data to benchmark TensorFlow computing speed with a GPU allocated.

Type nano and press ENTER (or use the editor of your choice), and enter the following code:

import tensorflow as tf
import time

def benchmark_tensorflow():
    # Create some random data
    input_data = tf.random.normal((10000, 10000))

    # Define a simple TensorFlow computation (for example, matrix multiplication)
    @tf.function
    def some_computation(x):
        return tf.matmul(x, x)

    # Warm-up to ensure graph optimizations are done
    _ = some_computation(input_data)

    # Run the computation and measure the time
    start_time = time.time()
    result = some_computation(input_data)
    end_time = time.time()

    # Print the elapsed time
    print("Time taken: {:.4f} seconds".format(end_time - start_time))

if __name__ == "__main__":
    benchmark_tensorflow()

Save by pressing CTRL + O enter a file name, e.g. benchmark_tensorflow.py and exit by pressing CTRL + X. Now you should have benchmark_tensorflow.py in your directory.

Lets try allocating 1 arbitrary available GPU to the job by adding --gres=gpu:1:

srun --gres=gpu:1 singularity exec --nv tensorflow_24.03-tf2-py3.sif python3 benchmark_tensorflow.py

Note that the above example allocate 1 GPU to the job. It is possible to allocate more, for example --gres=gpu:2 for two GPUs. Software for computing on GPU is not necessarily able to utilise more than one GPU at a time. It is your responsibility to ensure that the software you run can indeed utilise as many GPUs as you allocate. It is not allowed to allocate more GPUs than your job can utilise. Here is an example of a PyTorch script that can handle multiple GPUs.


Congratulations!

You've mastered the fundamentals of AI Cloud. Ready to take the next steps?