Checking the status of compute nodes
It is often desirable to monitor the resource status of the compute nodes when you wish to run a job.
The sinfo
command shows basic information about partitions in the queue system and what the states of nodes in these partitions are.
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up 12:00:00 1 mix nv-ai-04
batch* up 12:00:00 8 idle a256-t4-[01-02],i256-a10-06,i256-a40-[01-02]...
prioritized up 6-00:00:00 8 idle a256-t4-[01-02],i256-a10-06,i256-a40-[01-02]...
PARTITION
can be understood as distinct categories or groups of compute nodes, essentially serving as separate queues for jobs.AVAIL
shows the availability of the partition whereup
is normal, working state where you can submit jobs to it.TIMELIMIT
shows the time limit imposed by each partition inHH:MM:SS
format.NODES
shows how many nodes are in the shown state in the specific partition.STATE
shows which state the listed nodes are in:mix
means that the nodes are partially full - some jobs are running on them and they still have available resources;idle
means that they are completely vacant and have all resources available;allocated
means that they are completely occupied. Many other states are possible, most of which mean that something is wrong.NODELIST
shows the specific compute nodes that is affected by the job.
You can also use the command scontrol show node
or scontrol show node <node name>
to show details about all nodes or a specific node, respectively.
scontrol show node a256-t4-01
NodeName=a256-t4-01 Arch=x86_64 CoresPerSocket=16
CPUAlloc=12 CPUTot=64 CPULoad=0.50
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:t4:6
...
The two commands sinfo
and scontrol show node
provide information which is either too little or way too much detail in most situations. As an alternative, we provide the tool nodesummary
to show a hopefully more intuitive overview of the used/available resources.
nodesummary