AI-LAB Workshop
█████╗ ██╗ ██╗ █████╗ ██████╗
██╔══██╗██║ ██║ ██╔══██╗██╔══██╗
███████║██║█████╗██║ ███████║██████╔╝
██╔══██║██║╚════╝██║ ██╔══██║██╔══██╗
██║ ██║██║ ███████╗██║ ██║██████╔╝
╚═╝ ╚═╝╚═╝ ╚══════╝╚═╝ ╚═╝╚═════╝
██╗ ██╗ ██████╗ ██████╗ ██╗ ██╗███████╗██╗ ██╗ ██████╗ ██████╗ ███╗ ██████╗ ██╗ ██╗
██║ ██║██╔═══██╗██╔══██╗██║ ██╔╝██╔════╝██║ ██║██╔═══██╗██╔══██╗ ██╔██╗╚════██╗██║ ██║
██║ █╗ ██║██║ ██║██████╔╝█████╔╝ ███████╗███████║██║ ██║██████╔╝ ╚═╝╚═╝ █████╔╝███████║
██║███╗██║██║ ██║██╔══██╗██╔═██╗ ╚════██║██╔══██║██║ ██║██╔═══╝ ██╔═══╝ ╚════██║
╚███╔███╔╝╚██████╔╝██║ ██║██║ ██╗███████║██║ ██║╚██████╔╝██║ ███████╗ ██║
╚══╝╚══╝ ╚═════╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝ ╚═════╝ ╚═╝ ╚══════╝ ╚═╝
############### High-Performance Computing for AAU students and teachers ###############
by Frederik Petri
CLAAUDIA
#############################################################################################
Did you apply for access?
If not, please go to https://forms.office.com/e/caEhCRmqVN and fill out the form :)
What is AI-LAB?
A GPU cluster available to students and teachers
Hardware
- 2 login nodes
- 11 compute nodes each with 8 NVIDIA L4 GPUs
- Ceph file system
Software
- Linux
- Containerization for applications
- Slurm queue system
What can you use AI-LAB for?
- Access GPU resources for deep learning
- Enhance applied machine learning courses
- Collaborate on AI experiments and learning
- Semester project training and support
Why use AI-LAB?
- Easy to access
- Unlimited GPU resources
- Store course-specific materials
- Getting-started guides https://hpc.aau.dk/ai-lab/getting-started/
- CLAAUDIA support https://serviceportal.aau.dk/
Other HPC options?
UCloud
- Access through webbrowser
- Great for interactive work
- Large software selection
- Can handle sensitive data
- Read more: https://hpc.aau.dk/ucloud/
Other HPC options?
AI Cloud
- Large GPU cluster (AI-LABs big brother)
- Very limited access to students
- Need a strong justification
- Read more: https://hpc.aau.dk/ai-cloud/
Terms of use
- Access is removed at August 1st
- 12 hour limit on jobs
- Must not be used for confidential or sensitive data
Lets begin using AI-LAB!
- Log in via your local terminal using:
ssh -l user@student.aau.dk ailab-fe01.srv.aau.dk
- Enter
yes
if you get the following message:Are you sure you want to continue connecting (yes/no/[fingerprint])?
- Enter your AAU password (it will not be visible)
Exercises
- Log in to AI-LAB
File management
- Store files in your user directory
/ceph/home/student.aau.dk/user
- Shared files in
/ceph/container
and/ceph/course
- Shared project directories in
/ceph/project
Nice file commands to know
ls [path]
: List files in directorycd [path]
: Change directorycp [source] [destination]
: Copy a file within AI-LABcp -r [source] [destination]
: Copy a folder within AI-LABmkdir [name]
: Make a directory-
cat [filename]
: Print out file content in terminal -
~
: Shortcut path to your directory, e.g.mkdir ~/stuff
creates a folder calledstuff
in your directory.
Exercises
- Create a directory called
workshop
in your home directory using themkdir
command - Copy the file
/ceph/course/claaudia/docs/torch_bm.py
to theworkshop
folder usingcp
command - Change directory to the
workshop
folder usingcd
command
Create or edit files
nano [filename]
: Open or create a file- Use the arrow keys to move around (don't click with your mouse!)
- Save the file: Press
Ctrl + O
, rename the file if you want, then hitEnter
- Exit Nano: Press
Ctrl + X
Exercises
- Create a new file in the
workshop
folder calledrun.sh
(job script) usingnano
- Enter this:
#!/bin/bash
echo "Hello from AI-LAB!"
- Save it, and exit the file.
The queue system!
- Slurm for job scheduling and resource allocation.
- Jobs are queued and executed when resources are available.
Nice commands to know
sinfo
: Current status of the compute nodes.nodesummary
: Overview of allocated GPUs, CPUs, memorysqueue
: Overview of the current job queuesqueue --me
: Overview of your current job queue
How to submit a job?
srun [one line command]
: Runs a command interactively, taking over your terminal until the job finishes, e.g:srun echo "Hello from AI-LAB!"
sbatch [script]
: Runs a sequence of commands in the background (best for long term jobs)
Exercises
- Run the job script
run.sh
using thesbatch
command - Use
ls
to list the files in theworkhop
folder - There should be an output file called something like
slurm-1234.out
in the folder - Print out the content of the output file using the
cat
command
Containers
- Containers bundle application code, libraries, and dependencies
- Easy to install and manage complex applications
- AI-LAB uses Singularity for container management
- Singularity container files (also called images), has
.sif
as extension
Getting containers
Downloading containers
- https://catalog.ngc.nvidia.com/
- https://hub.docker.com/
- https://cloud.sylabs.io/library/
Pre-downloaded containers
ls /ceph/container
Customizing containers
- Add python packages via a virtual environment
- Build your own container image on your local computer
Using containers
singularity exec [path/to/container] [action]
: tells Singularity to start a container and execute an action in that container instance, e.g:srun singularity exec /ceph/container/pytorch/pytorch_24.07.sif python3 torch_bm.py
- You need to add
--nv
and--gres=gpu:1
to enable NVIDIA libraries and allocate a GPU to the job:- srun --gres=gpu:1 singularity exec --nv /ceph/container/pytorch/pytorch_24.07.sif python3 torch_bm.py
Lets put is all together in a job script
/ceph/course/claaudia/docs/run.sh:
#!/bin/bash
#SBATCH --job-name=torch_benchmark # Name of the job
#SBATCH --output=output_%j.log # Output file name (%j will be replaced with the job ID)
#SBATCH --error=error_%j.log # Error file name (%j will be replaced with the job ID)
#SBATCH --gres=gpu:1 # Request 1 GPU
#SBATCH --mem=16G # Request 16G Memory
#SBATCH --cpus-per-task=15 # Request 15 CPUs
# Run the command
singularity exec --nv /ceph/container/pytorch/pytorch_24.07.sif python3 torch_bm.py
The final exercises!
Lets run the torch_bm.py
script using a pre-made job script.
- Copy the job script
/ceph/course/claaudia/docs/run.sh
to theworkshop
folder usingcp /ceph/course/claaudia/docs/run.sh ~/workshop
- Use
cat run.sh
to check the content - Run the job script using
sbatch run.sh
- Use
squeue --me
to check if your job is running, defined by theR
belowST
- Locate the
JOBID
and cancel your job usingscancel [JOBID]
################################################################################
████████╗██╗ ██╗ █████╗ ███╗ ██╗██╗ ██╗███████╗
╚══██╔══╝██║ ██║██╔══██╗████╗ ██║██║ ██╔╝██╔════╝
██║ ███████║███████║██╔██╗ ██║█████╔╝ ███████╗
██║ ██╔══██║██╔══██║██║╚██╗██║██╔═██╗ ╚════██║
██║ ██║ ██║██║ ██║██║ ╚████║██║ ██╗███████║
╚═╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═══╝╚═╝ ╚═╝╚══════╝
################### If you need any help, please visit: #######################
https://serviceportal.aau.dk -> IT -> CLAAUDIA -> Support for CLAAUDIA services
################################################################################