Checking GPU usage
Monitoring GPU usage is a good practice for optimizing the performance of your jobs running, particularly if you intend to utilize multiple GPUs and verify their usage. This guide will provide step-by-step instructions on how to monitor GPU usage using the nvidia-smi
tool.
Start a job with GPU allocation
First, submit a job using srun
or sbatch
with one GPU or more allocated and execute some code inside a Singularity container. In this example we will use pytorch_24.03-py3.sif
container image and a PyTorch benchmark script torch_bm.py
:
srun --gres=gpu:1 singularity exec --nv pytorch_24.03-py3.sif python3 torch_bm.py
This script is NOT optimized for utilizing multiple GPUs, so in this example we will only allocate 1 GPU. Here is an example of a PyTorch script that can handle multiple GPUs.
Check job id
Open another terminal session, and check the status of your jobs using squeue --me
to find the job ID of the job you just submitted.
squeue --me
Connect to running job interactively
Once you have identified the job ID (let's assume it's 1978
in this example), connect to the running job interactively using the following command to start a new shell.
srun --jobid 1978 --interactive --pty /bin/bash
Monitor GPU usage
Inside the interactive session of your job, start monitoring GPU usage using nvidia-smi
with watch to update the output every second.
watch -n1 nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 Off | 00000000:01:00.0 Off | 0 |
| N/A 44C P0 36W / 72W | 245MiB / 23034MiB | 90% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L4 Off | 00000000:02:00.0 Off | 0 |
| N/A 38C P8 16W / 72W | 4MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L4 Off | 00000000:41:00.0 Off | 0 |
| N/A 38C P8 16W / 72W | 4MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L4 Off | 00000000:61:00.0 Off | 0 |
| N/A 38C P8 16W / 72W | 4MiB / 23034MiB | 0% Default |
...
The most important parameter to notice here is the GPU-Util
metric. Here, you can see that the first GPU is operating at 90% GPU utilization. This indicates excellent utilization of the GPU.
High Utilization (70-100%)
For many GPU-accelerated applications like deep learning training or scientific simulations, a high GPU utilization (often around 70-100%) during compute-intensive tasks is considered good. It indicates that the GPU is efficiently processing tasks without significant idle time.
Low to Moderate Utilization (10-40%)
In some cases, especially when the workload is less intensive or the application is idle waiting for data or other resources, the GPU utilization might be lower (e.g., 10-40%). This doesn't necessarily mean the GPU is underutilized or performing poorly; it could indicate a natural variation in workload or efficient scheduling of tasks.