Skip to content

System overview

Hardware

The AI-LAB platform is built around several key components, including two front-end nodes for managing tasks and code, and 11 compute nodes equipped with diverse hardware options.

In this overview, you will find a description of each major component of AI-LAB. Below, is a diagram illustrating the architecture of the AI-LAB platform.

flowchart LR
  subgraph id1[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 12px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 5px;">Compute nodes</p>]
  direction TB
  A["<span><img src="/assets/img/server.svg"  width='25' height='25' >ailab-l4-[01-11]</span>"]
  end

  subgraph id2[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 16px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 10px;">AI-LAB</p>]
  direction TB
  subgraph id3[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 12px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 5px;">Front-end nodes</p>]
    direction TB
    G["<span><img src="/assets/img/server.svg" width='25' height='25'>ailab-fe[01-02]</span>"]
    end
  id3 --> id1 

  subgraph id4[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 12px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 5px;">File storage</p>]
    direction TB
    E["<span><img src="/assets/img/server.svg" width='25' height='25'>Ceph</span>"]
    end

  id1 & id3 <--> id4
  end

  F[<span><img src="/assets/img/person.svg" width='25' height='25'>User laptop</span>]-- SSH --> id3

Front-end nodes

You start by logging into a front-end node, either ailab-fe01 or ailab-fe02. These nodes act as the gateway to the HPC system. Here, you can manage files, write and edit code, and prepare your computational tasks. It is important to note that front-end nodes are not intended for heavy computations; they are optimized for task preparation and interaction with the HPC environment.


Compute nodes

AI-LAB currently include the following compute nodes:

Node name CPU model Sockets Threads (Logical CPUs) Number of GPUs GPU Model RAM pr GPU (GB)
ailab-l4-[01-11] AMD EPYC 7543 32-Core 2 128 8 NVIDIA L4 24

Software

AI-LAB is based on Ubuntu Linux as its operating system. In practice, working on AI-LAB primarily takes place via a command-line interface.

AI-LAB leverages two primary software components: Slurm and Singularity. Understanding these tools and how they work together is crucial for efficiently utilizing the AI-LAB platform.

Slurm

Slurm is a powerful and highly configurable workload manager used for scheduling and managing compute jobs on AI-LAB. It provides essential features such as:

  • Job Scheduling: Allocating resources to jobs based on user requests and system policies.
  • Resource Management: Tracking and managing compute resources, ensuring optimal utilization.
  • Queue Management: Organizing jobs into queues, prioritizing and executing them based on policies and resource availability.

On AI-LAB, Slurm is responsible for managing the allocation and scheduling of compute resources, ensuring that user jobs are executed efficiently and fairly.

flowchart LR
  B["<span><img src="/assets/img/server.svg" width='25' height='25'>Front-end node</span>"]

  C["<span><img src="/assets/img/code-file.svg" width='25' height='25'>Job id 4</span>"]

  subgraph slurm[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 16px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 10px;">Slurm queue</p>]
    direction LR
    D1["<span><img src="/assets/img/code-file.svg" width='25' height='25'>Job id 4</span>"]
    D2["<span><img src="/assets/img/code-file.svg" width='25' height='25'>Job id 3</span>"]
    D3["<span><img src="/assets/img/code-file.svg" width='25' height='25'>Job id 2</span>"]
    D1 -.- D2 -.- D3
    end

  subgraph cluster[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 12px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 5px;">Compute nodes</p>]
    direction LR

    E1["<span><img src="/assets/img/code-file.svg" width='25' height='25'>Job id 1</span>"]
    E2["<span><img src="/assets/img/server.svg"  width='25' height='25' >ailab-l4-01</span>"]

    E1 --> E2
    end

  B --> C --> slurm --> cluster

  style D1 stroke-dasharray: 5 5

Singularity

Singularity is a container platform designed for running applications on AI-LAB. Containers are lightweight, portable, and reproducible environments that bundle an application's code, libraries, and dependencies. Key features of Singularity include:

  • Compatibility: Running containers with high-performance computing workloads without requiring root privileges.
  • Portability: Enabling the same container to run on different systems without modification.
  • Integration with HPC Systems: Designed to work seamlessly with HPC job schedulers like Slurm.


Pre-Downloaded Containers on AI-LAB

AI-LAB provides a variety of pre-downloaded containers to help users get started quickly. These containers are stored in the /ceph/container directory. The list of available containers is periodically updated, and users can propose new containers by contacting the support team. Currently available container images includes:

  • PyTorch (CPU/GPU)
  • TensorFlow (CPU/GPU)
  • ImageMagick (CPU)
  • MATLAB (CPU/GPU)

Interconnection of Slurm and Singularity

On AI-LAB, Slurm and Singularity work together. Slurm handles the job scheduling and resource allocation, while Singularity ensures that the specified container environment is instantiated and the application runs with all its dependencies.

flowchart LR
  A[<span><img src="/assets/img/person.svg" width='25' height='25'>User laptop</span>]
  B["<span><img src="/assets/img/server.svg" width='25' height='25'>Front-end node</span>"]
  C["<span><img src="/assets/img/container.svg" width='25' height='25'>Singularity container job</span>"]
  D["<span><img src="/assets/img/queue.svg" width='25' height='25'>Slurm</span>"]
  E["<span><img src="/assets/img/server.svg" width='25' height='25'>Compute node</span>"]

  A-- SSH --> B --> C --> D --> E-- Result --> B

  style C stroke-dasharray: 5 5
  style D stroke-dasharray: 5 5

Storage

AI-LAB utilizes Ceph as its storage solution, providing a robust and scalable file system for your data needs. Your files are organized within the Ceph file system hierarchy, ensuring efficient access and management across the entire platform.


User Directory

Your user directory serves as the primary location for storing personal files and data. It is structured within the Ceph file system as follows:

  • /ceph AI-LAB's file system
    • home user home directories
      • [domain] e.g student.aau.dk
        • [user] your user directory

Here, [domain] represents your domain or institution (e.g., student.aau.dk), and [user] denotes your unique username on the platform. Any files you store within your user directory are private.

Storage quota

When users log in to AI-LAB for the first time, a user directory is created for them. These directories are allocated 1 TB of storage by default. When you log in to the platform, you can see your storage usage of the user directory at the very top line:

Current quota usage: 181GiB / 1.0TiB
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-169-generic x86_64)

* Documentation:  https://help.ubuntu.com
* Management:     https://landscape.canonical.com
* Support:        https://ubuntu.com/pro

System information as of Fri Mar 15 11:09:21 CET 2024

Shared Project Directories

AI-LAB fosters collaborative work through shared project directories. These directories enable multiple users to collaborate on projects by providing a centralized space for data sharing and collaboration. Shared project directories are organized under the project directory within the Ceph file system:

  • /ceph AI-LAB's file system
    • project shared project directories
      • project_X

Read our guide on how to make shared project directories.


Course Materials

To support educational activities, AI-LAB hosts course-specific materials within dedicated directories. These materials include lecture notes, assignments, datasets, and any resources relevant to the course curriculum. Course directories are structured under the course directory within the Ceph file system:

  • /ceph AI-LAB's file system
    • course directory with course specific material
      • Course 1. Introduction to TensorFLow
        • Images
        • tensorflow.sif
      • Course 2. ...

Students and instructors can access course materials effortlessly, enhancing the learning experience and facilitating hands-on exercises.


Ready-to-Use Applications

For convenience and efficiency, AI-LAB offers a collection of ready-to-use applications packaged as container images that can easily be copied to your user directoty. We aim to consistently update these images to the latest versions.

  • /ceph
    • container directory with ready-to-use applications
      • tensorflow/tensorflow.sif
      • pytorch/pytorch.sif
      • ...sif

If you have specific container image requests, we welcome your input. Please reach out to us via the AAU service portal and include "CLAAUDIA" and "AI-LAB" in the subject line.