LUMI

LUMI is a supercomputer located in Kajaani, Finland and ranks among the world's Top 10 supercomputers according to the Top 500 list. The true power of the system lies in it's oppurtunity to scale jobs massively, but we also want to encourage our users to experiment with the system, and to think of it as an extension of the compute capacity available to them as researchers at AAU.

Access

The system is funded by EuroHPC and the LUMI consortium which Denmark is a member of. For this reason CLAAUDIA can provide AAU-users with direct access to the system.

Recommendations for acquiring compute time on LUMI

Acquiring LUMI-resources should always be a two step proces:

Acquire resources for testing out the system:

It is always good to do a test run on the system prior to reaching out for a larger grant. Being able to demonstrate that you are able to utilise the system effectivly, and that your project fits the system, will greatly help your chances of being awarded the resources. Further acquiring too large a resource, which is left unused, might make this resource unavailable to others who might have been able to put it to good use.

Make use of one of the following:
- AAU's local resource pool: Fill out our application form.
- EuroHPC: Read more on our page dedicated to this option.
Acquire ressources for actual project work:

When you have demonstrated that your applicaiton was fit for the system, you may reach out for a larger grant.

Make use of one of the following:
- AAU's local resource pool: Fill out our application form. Suitable for modest/large grants, depending on our budget.
- DeiC's national resource pool: Read more on our page dedicated to this option.
- EuroHPC: Read more on our page dedicated to this option.

User support

User support for the system is provided jointly by CLAAUDIA and the LUMI User Support Team (LUST).

With a EuroHPC grant it is possible to apply for in-depth, HPC-expert assistance from the Epicure project.

Software

LUMI utilises two main software components:

Slurm queueing system for distributing ressources.
Singularity container framework for containerising software.

Users familiar with AI Cloud or AI-LAB will find that operating the system a familiar experience.

Hardware

Compute nodes

LUMI consists of multiple compute partitions. Two of the main ones are:

Partition	Number of nodes	Purpose	Node configuration
LUMI-C	2978	Scaleable, demanding CPU operations	128 cores AMD EPYC with different RAM capacities 256, 512 and 1024 GB
LUMI-G	2048	Scaleable, demanding GPU operations	4 x AMD MI250x GPU's (128 GB GPU-RAM each)

The table above is intended for the purpose of providing a rough overview of the hardware. A more complete overview can be found in the official LUMI documentation

A note on AMD GPU hardware

The AMD ROCM ecosystem has matured a lot in the recent years, and in most cases Nvidia-supported code can be ported to an AMD-system with only minor tweaks.

Network

All compute nodes are equipped with an high-speed interconnect which ensures high transfer speeds between compute nodes and storage partitions.

Bandwidth measurements between storage and compute nodes

Storage

LUMI also several different storage partitions serving different purposes. Users should note that these have different storage quotas, and different billing rates, ie. the allocated storage units TB/hrs are spent at different rates, depending on the storage partition is being used.

Recommendations on using the storage partitions

Using the system

Before logging in

Assuming that you have decided to make use of AAU's local resource pool, follow the instructions in the letter of approval, sent out following your resource application. This involves completing AAU's identity verification procedure and uploading an SSH key to the system.

Log in to the system

Log in according to official instructions. Please know that after uploading your SSH-key, you may need to wait ~20 minutes for the server to synchronise.

Looking around

Optionally git clone our LUMI-starter-pack:

git clone https://github.com/aau-claaudia/lumi-starter-pack.git

LUMI's operating system CrayOS is a variant of Linux, and the system can thus be navigated using regular GNU/Linux commands.

LUMI uses Modules to manage software environments. Loading modules essentially just alters the $PATH variable, allowing you to access additional software and/or versions. Learn about using modules here.

Each time the user logs in to the system, the lumi-tools-module is automatically loaded, giving you access to commands like lumi-workspaces, lumi-allocations, lumi-check-quota, which will allow you to inspect your project and it's resource consumption.

Transfer your files

Transfer your files according to the official instructions. We recommend Rsync for it's ability to continue a transfer if gets interrupted.

Don't forget - storage units are spent continously

Please be mindful of the fact that storage units are spent continuously. There's no need to constantly move files around, but we ask you to be mindful of your storage quota, and to get in touch with us if you are nearing the limits of your quota. We will likely be able to help you find more resources.

Prepare software environment

We recommend creating Singularity images using the tool Cotainr, which has been designed with LUMI in mind, eg. it has the flag --system=lumi-g to help you collect the correct modules.

You can also find pre-built container images in: /appl/local/containers/sif-images

Run your first job

It is generally recommended to launch your jobs with batch-scripts. We provide the following example, which you can use as a basis for your own batch scripts.

#!/bin/bash

#SBATCH --job-name=torch_bm
#SBATCH --account=project_415001489
#SBATCH --partition=small-g
#SBATCH --gpus=1
#SBATCH --cpus-per-task=15
#SBATCH --time=01:00:00
#SBATCH --output=out.%x_%j
#SBATCH --error=err.%x_%j

# Directories
PROJECT="/project/$SLURM_JOB_ACCOUNT"
SCRATCH="/scratch/$SLURM_JOB_ACCOUNT"
FLASH="/flash/$SLURM_JOB_ACCOUNT"
mkdir -p $PROJECT $SCRATCH $FLASH

# Container image (here we are targetting one that comes preinstalled on the system)
lumi_images="/appl/local/containers/sif-images"
lumi_pytorch_base="$lumi_images/lumi-pytorch-rocm-6.2.3-python-3.12-pytorch-v2.5.1.sif"

CONTAINER="$lumi_pytorch_base"

# Script
SCRIPT="$PROJECT_DIR/torch_bm.py"

# The command to execute on the node(s)
srun --chdir="$PROJECT_DIR" singularity exec --bind="$PROJECT_DIR,$SCRATCH_DIR,$FLASH_DIR" $CONTAINER bash -c "\$WITH_CONDA; python3 $SCRIPT"

Consider making the following adjustments:

Decide on a good naming convention for your runs. Pass this to --job-name.
Find your project account number using the command lumi-workspaces. Pass this to the --account parameter.
Run the job on an appropriate compute partition:
- View the partitions with the command: sinfo -o "%25P %5D %l"
- Read about the compute hardware in the official documentation.
Replace the paths to your CONTAINER and SCRIPT.

Finally run this batch-script with sbatch run.sh (or whatever you called the file).

Monitor the job

Confirm that it is running with:

squeue --me

If you are running a GPU-demanding job; find the jobid from the squeue command and run the following to monitor the GPU-activity:

srun --jobid=7100665 rocmi-smi