LUMI
LUMI is a supercomputer located in Kajaani, Finland and ranks among the world's Top 10 supercomputers according to the Top 500 list. The true power of the system lies in it's oppurtunity to scale jobs massively, but we also want to encourage our users to experiment with the system, and to think of it as an extension of the compute capacity available to them as researchers at AAU.
Access
The system is funded by EuroHPC and the LUMI consortium which Denmark is a member of. For this reason CLAAUDIA can provide AAU-users with direct access to the system.
Recommendations for acquiring compute time on LUMI
Acquiring LUMI-resources should always be a two step proces:
-
Acquire resources for testing out the system:
It is always good to do a test run on the system prior to reaching out for a larger grant. Being able to demonstrate that you are able to utilise the system effectivly, and that your project fits the system, will greatly help your chances of being awarded the resources. Further acquiring too large a resource, which is left unused, might make this resource unavailable to others who might have been able to put it to good use.
Make use of one of the following:
- AAU's local resource pool: Fill out our application form.
- EuroHPC: Read more on our page dedicated to this option.
-
Acquire ressources for actual project work:
When you have demonstrated that your applicaiton was fit for the system, you may reach out for a larger grant.
Make use of one of the following:
- AAU's local resource pool: Fill out our application form. Suitable for modest/large grants, depending on our budget.
- DeiC's national resource pool: Read more on our page dedicated to this option.
- EuroHPC: Read more on our page dedicated to this option.
User support
User support for the system is provided jointly by CLAAUDIA and the LUMI User Support Team (LUST).
With a EuroHPC grant it is possible to apply for in-depth, HPC-expert assistance from the Epicure project.
Software
LUMI utilises two main software components:
- Slurm queueing system for distributing ressources.
- Singularity container framework for containerising software.
Users familiar with AI Cloud or AI-LAB will find that operating the system a familiar experience.
Hardware
Compute nodes
LUMI consists of multiple compute partitions. Two of the main ones are:
Partition | Number of nodes | Purpose | Node configuration |
---|---|---|---|
LUMI-C | 2978 | Scaleable, demanding CPU operations | 128 cores AMD EPYC with different RAM capacities 256, 512 and 1024 GB |
LUMI-G | 2048 | Scaleable, demanding GPU operations | 4 x AMD MI250x GPU's (128 GB GPU-RAM each) |
The table above is intended for the purpose of providing a rough overview of the hardware. A more complete overview can be found in the official LUMI documentation
A note on AMD GPU hardware
The AMD ROCM ecosystem has matured a lot in the recent years, and in most cases Nvidia-supported code can be ported to an AMD-system with only minor tweaks.
Network
All compute nodes are equipped with an high-speed interconnect which ensures high transfer speeds between compute nodes and storage partitions.
Storage
LUMI also several different storage partitions serving different purposes. Users should note that these have different storage quotas, and different billing rates, ie. the allocated storage units TB/hrs
are spent at different rates, depending on the storage partition is being used.
Using the system
Before logging in
Assuming that you have decided to make use of AAU's local resource pool, follow the instructions in the letter of approval, sent out following your resource application. This involves completing AAU's identity verification procedure and uploading an SSH key to the system.
Log in to the system
Log in according to official instructions. Please know that after uploading your SSH-key, you may need to wait ~20 minutes for the server to synchronise.
Looking around
Optionally git clone
our LUMI-starter-pack:
git clone https://github.com/aau-claaudia/lumi-starter-pack.git
LUMI's operating system CrayOS is a variant of Linux, and the system can thus be navigated using regular GNU/Linux commands.
LUMI uses Modules to manage software environments. Loading modules essentially just alters the $PATH
variable, allowing you to access additional software and/or versions. Learn about using modules here.
Each time the user logs in to the system, the lumi-tools
-module is automatically loaded, giving you access to commands like lumi-workspaces
, lumi-allocations
, lumi-check-quota
, which will allow you to inspect your project and it's resource consumption.
Transfer your files
Transfer your files according to the official instructions. We recommend Rsync for it's ability to continue a transfer if gets interrupted.
Don't forget - storage units are spent continously
Please be mindful of the fact that storage units are spent continuously. There's no need to constantly move files around, but we ask you to be mindful of your storage quota, and to get in touch with us if you are nearing the limits of your quota. We will likely be able to help you find more resources.
Prepare software environment
We recommend creating Singularity images using the tool Cotainr, which has been designed with LUMI in mind, eg. it has the flag --system=lumi-g
to help you collect the correct modules.
You can also find pre-built container images in: /appl/local/containers/sif-images
Run your first job
It is generally recommended to launch your jobs with batch-scripts. We provide the following example, which you can use as a basis for your own batch scripts.
#!/bin/bash
#SBATCH --job-name=torch_bm
#SBATCH --account=project_415001489
#SBATCH --partition=small-g
#SBATCH --gpus=1
#SBATCH --cpus-per-task=15
#SBATCH --time=01:00:00
#SBATCH --output=out.%x_%j
#SBATCH --error=err.%x_%j
# Directories
PROJECT="/project/$SLURM_JOB_ACCOUNT"
SCRATCH="/scratch/$SLURM_JOB_ACCOUNT"
FLASH="/flash/$SLURM_JOB_ACCOUNT"
mkdir -p $PROJECT $SCRATCH $FLASH
# Container image (here we are targetting one that comes preinstalled on the system)
lumi_images="/appl/local/containers/sif-images"
lumi_pytorch_base="$lumi_images/lumi-pytorch-rocm-6.2.3-python-3.12-pytorch-v2.5.1.sif"
CONTAINER="$lumi_pytorch_base"
# Script
SCRIPT="$PROJECT_DIR/torch_bm.py"
# The command to execute on the node(s)
srun --chdir="$PROJECT_DIR" singularity exec --bind="$PROJECT_DIR,$SCRATCH_DIR,$FLASH_DIR" $CONTAINER bash -c "\$WITH_CONDA; python3 $SCRIPT"
Consider making the following adjustments:
- Decide on a good naming convention for your runs. Pass this to
--job-name
. - Find your project account number using the command
lumi-workspaces
. Pass this to the--account
parameter. - Run the job on an appropriate compute partition:
- View the partitions with the command:
sinfo -o "%25P %5D %l"
- Read about the compute hardware in the official documentation.
- View the partitions with the command:
- Replace the paths to your
CONTAINER
andSCRIPT
.
Finally run this batch-script with sbatch run.sh
(or whatever you called the file).
Monitor the job
Confirm that it is running with:
squeue --me
jobid
from the squeue
command and run the following to monitor the GPU-activity:
srun --jobid=7100665 rocmi-smi