Running Jobs on AI-LAB

This guide will teach you how to run computational tasks on AI-LAB using the Slurm job scheduler. Slurm manages all the computing resources and ensures fair access for all users.

Understanding Slurm

Slurm is a job scheduling system that:

Manages resources: Allocates CPUs, GPUs, and memory to your jobs
Queues jobs: Organizes jobs when resources are busy
Ensures fairness: Prevents any single user from monopolizing resources

Two Ways to Run Jobs

AI-LAB offers two methods for running jobs:

srun - Interactive jobs for testing and debugging
sbatch - Batch jobs for longer computations

When to Use Each Method

Method	Best For	Duration	Interaction
`srun`	Testing, debugging, quick tasks	Short (< 1 hour)	Interactive
`sbatch`	Training models, long computations	Long (> 1 hours)	Non-interactive

Using srun (Interactive Jobs)

srun runs commands interactively on a compute node. Your terminal connects directly to the compute node, making it perfect for testing and debugging.

Basic srun Example

Let's start with a simple test:

srun hostname

This command will:

Request a compute node
Run the hostname command on that node
Display the result
Return you to the front-end node

What You'll See

When you run an srun command, you might see:

srun: job 12345 queued and waiting for resources
srun: job 12345 has been allocated resources
ailab-l4-01

This shows:

Your job ID (12345)
The job was queued (waiting for resources)
Resources were allocated
The hostname of the compute node (ailab-l4-01)

When to Use srun

✅ Good for:

Testing commands and scripts
Debugging code
Quick computations
Interactive exploration

❌ Not ideal for:

Long-running jobs (hours/days)
Jobs that need to run without you being connected
Production model training

Using sbatch (Batch Jobs)

sbatch is perfect for longer-running jobs. You create a script with your commands, submit it to the queue, and Slurm runs it when resources are available.

Creating a Job Script

Let's create a simple job script:

nano my_job.sh

Add this content:

my_job.sh

#!/bin/bash

#SBATCH --job-name=my_test_job  # Name of your job
#SBATCH --output=my_job.out     # Output file
#SBATCH --error=my_job.err      # Error file

# Your commands go here
hostname
echo "Hello from AI-LAB!"
date

Understanding the Script

#!/bin/bash: Tells the system to use bash shell
#SBATCH lines: Slurm directives that configure your job
Commands below: What you want to run

Submitting the Job

sbatch my_job.sh

You'll see:

Submitted batch job 12345

What Happens Next

Job is queued: Slurm adds your job to the queue
Resources allocated: When available, Slurm assigns compute resources
Job runs: Your script executes on the compute node
Output saved: Results are written to your specified output file

Checking Results

Once the job completes, check the output:

cat my_job.out    # View the output
cat my_job.err    # View any errors (if empty, no errors occurred)

When to Use sbatch

✅ Perfect for:

Training machine learning models
Long data processing tasks
Jobs that take hours or days
Running jobs overnight or while you're away

❌ Not needed for:

Quick tests or debugging
Interactive exploration
Commands that finish in minutes

Specifying Job Resources

Most jobs need specific resources like GPUs, memory, or time limits. You specify these using Slurm options.

Common Resource Options

Option	Description	Example	Notes
`--mem`	Memory allocation	`--mem=24G`	Max 24GB per GPU
`--cpus-per-task`	CPU cores	`--cpus-per-task=15`	Max 15 CPUs per GPU
`--gres`	GPUs	`--gres=gpu:1`	Request 1 GPU
`--time`	Time limit	`--time=01:00:00`	1 hour (HH:MM:SS)

Resource Guidelines

Memory: Request enough memory for your data and model

Small models: --mem=8G
Large models: --mem=24G

CPUs: More CPUs can speed up data loading and preprocessing

Basic: --cpus-per-task=4
Intensive: --cpus-per-task=15

GPUs: Essential for deep learning

Single GPU: --gres=gpu:1
Multiple GPUs: --gres=gpu:2 (only if your code supports it)

Time: Set realistic time limits

Quick tests: --time=00:30:00 (30 minutes)
Training: --time=04:00:00 (4 hours)
Default: 1 hour (if not specified)
Maximum: 12 hours

Multi-GPU Usage

You can request multiple GPUs with --gres=gpu:2, but only if your code actually uses them. Allocating unused GPUs violates our Fair Usage Policy.

Using Options with srun

Add options directly to your srun command:

srun --mem=24G --cpus-per-task=15 --gres=gpu:1 --time=01:00:00 hostname

Using Options with sbatch

Add options as #SBATCH directives in your script:

my_job.sh

#!/bin/bash

#SBATCH --job-name=my_training_job
#SBATCH --output=training.out
#SBATCH --error=training.err
#SBATCH --mem=24G
#SBATCH --cpus-per-task=15
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00

# Your training commands here
python train_model.py

Now that you know how to run jobs on AI-LAB, let's delve into how to get applications/containers