Running Jobs on AI-LAB
This guide will teach you how to run computational tasks on AI-LAB using the Slurm job scheduler. Slurm manages all the computing resources and ensures fair access for all users.
Understanding Slurm
Slurm is a job scheduling system that:
- Manages resources: Allocates CPUs, GPUs, and memory to your jobs
- Queues jobs: Organizes jobs when resources are busy
- Ensures fairness: Prevents any single user from monopolizing resources
Two Ways to Run Jobs
AI-LAB offers two methods for running jobs:
When to Use Each Method
Method | Best For | Duration | Interaction |
---|---|---|---|
srun |
Testing, debugging, quick tasks | Short (< 1 hour) | Interactive |
sbatch |
Training models, long computations | Long (> 1 hours) | Non-interactive |
Using srun (Interactive Jobs)
srun
runs commands interactively on a compute node. Your terminal connects directly to the compute node, making it perfect for testing and debugging.
Basic srun Example
Let's start with a simple test:
srun hostname
This command will:
- Request a compute node
- Run the
hostname
command on that node - Display the result
- Return you to the front-end node
What You'll See
When you run an srun command, you might see:
srun: job 12345 queued and waiting for resources
srun: job 12345 has been allocated resources
ailab-l4-01
This shows:
- Your job ID (12345)
- The job was queued (waiting for resources)
- Resources were allocated
- The hostname of the compute node (ailab-l4-01)
When to Use srun
✅ Good for:
- Testing commands and scripts
- Debugging code
- Quick computations
- Interactive exploration
❌ Not ideal for:
- Long-running jobs (hours/days)
- Jobs that need to run without you being connected
- Production model training
Using sbatch (Batch Jobs)
sbatch
is perfect for longer-running jobs. You create a script with your commands, submit it to the queue, and Slurm runs it when resources are available.
Creating a Job Script
Let's create a simple job script:
nano my_job.sh
Add this content:
#!/bin/bash
#SBATCH --job-name=my_test_job # Name of your job
#SBATCH --output=my_job.out # Output file
#SBATCH --error=my_job.err # Error file
# Your commands go here
hostname
echo "Hello from AI-LAB!"
date
Understanding the Script
#!/bin/bash
: Tells the system to use bash shell#SBATCH
lines: Slurm directives that configure your job- Commands below: What you want to run
Submitting the Job
sbatch my_job.sh
You'll see:
Submitted batch job 12345
What Happens Next
- Job is queued: Slurm adds your job to the queue
- Resources allocated: When available, Slurm assigns compute resources
- Job runs: Your script executes on the compute node
- Output saved: Results are written to your specified output file
Checking Results
Once the job completes, check the output:
cat my_job.out # View the output
cat my_job.err # View any errors (if empty, no errors occurred)
When to Use sbatch
✅ Perfect for:
- Training machine learning models
- Long data processing tasks
- Jobs that take hours or days
- Running jobs overnight or while you're away
❌ Not needed for:
- Quick tests or debugging
- Interactive exploration
- Commands that finish in minutes
Specifying Job Resources
Most jobs need specific resources like GPUs, memory, or time limits. You specify these using Slurm options.
Common Resource Options
Option | Description | Example | Notes |
---|---|---|---|
--mem |
Memory allocation | --mem=24G |
Max 24GB per GPU |
--cpus-per-task |
CPU cores | --cpus-per-task=15 |
Max 15 CPUs per GPU |
--gres |
GPUs | --gres=gpu:1 |
Request 1 GPU |
--time |
Time limit | --time=01:00:00 |
1 hour (HH:MM:SS) |
Resource Guidelines
Memory: Request enough memory for your data and model
- Small models:
--mem=8G
- Large models:
--mem=24G
CPUs: More CPUs can speed up data loading and preprocessing
- Basic:
--cpus-per-task=4
- Intensive:
--cpus-per-task=15
GPUs: Essential for deep learning
- Single GPU:
--gres=gpu:1
- Multiple GPUs:
--gres=gpu:2
(only if your code supports it)
Time: Set realistic time limits
- Quick tests:
--time=00:30:00
(30 minutes) - Training:
--time=04:00:00
(4 hours)
Multi-GPU Usage
You can request multiple GPUs with --gres=gpu:2
, but only if your code actually uses them. Allocating unused GPUs violates our Fair Usage Policy.
Using Options with srun
Add options directly to your srun command:
srun --mem=24G --cpus-per-task=15 --gres=gpu:1 --time=01:00:00 hostname
Using Options with sbatch
Add options as #SBATCH
directives in your script:
#!/bin/bash
#SBATCH --job-name=my_training_job
#SBATCH --output=training.out
#SBATCH --error=training.err
#SBATCH --mem=24G
#SBATCH --cpus-per-task=15
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00
# Your training commands here
python train_model.py
Now that you know how to run jobs on AI-LAB, let's delve into how to get applications/containers