Checking GPU utilisation
When you have launched a job on a GPU, it is good practice to verify that it is indeed utilising the GPU.
We can do this by logging in to the compute node, and calling the nvidia-smi command.
Start by locating the node, that your job is running on:
squeue --me
Which returns the following table:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
768059 prioritiz aicloud nobody R 0:06:02 1 a256-t4-01
In the column NODELIST(REASON) read which node your job is running on.
Then log in to the node:
ssh -l user@domain.aau.dk a256-t4-01.srv.aau.dk
And call:
nvidia-smi
Which prints a table:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 4 NVIDIA L40S Off | 00000000:03:00.0 Off | 0 |
| N/A 72C P0 287W / 350W | 27963MiB / 46068MiB | 89% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
This table is a snapshot of the GPU devices (physical GPU's) allocated to your Slurm job, and the values represent measures of a sampling period of 1 second.
The most important parameters to note are:
-
Volatile GPU-Util: How much of the sampling period the GPU was actively computing. As computations normally take place in batches, it is normal to see this value fluctuate.
-
Memory usage: How much of the available GPU-ram is being consumed. Note that this is expressed in mebibytes (that's MiB not MB), which more accurately represents memory values (binary).
Bonus tips:
Continuous updates
Prepending the watch command, will execute the nvidia-smi command every 2 seconds - allowing us to get continuous updates to the GPU activity:
watch nvidia-smi
Print only the important bits
Instead of printing the whole table, we can print only select values:
nvidia-smi --query-gpu=index,utilization.memory,utilization.gpu --format=csv
The watch command can of course also be prepended in order to get continuous updates.