Skip to content

System overview

Hardware

The AI Cloud platform is built around several key components, including a front-end node for managing tasks and code, and 27 compute nodes equipped with diverse hardware options.

In this overview, you will find a description of each major component of AI Cloud. Below, is a diagram illustrating the architecture of the AI Cloud platform.

flowchart LR
  subgraph id1[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 12px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 5px;">Compute nodes</p>]
  direction TB
  A["<span><img src="/assets/img/server.svg"  width='25' height='25' ><p>a256-t4-[01-03]</p><p>i256-a10-[06-10]</p><p>a256-a40-[04-07]</p><p>i256-a40-[01-02]</p><p>a512-l4-06</p><p>nv-ai-[01-03]</p><p>nv-ai-04</p><p>a768-l40s-[01-06]</p><p>a512-mi100-01</p>
  </span>"]
  end

  subgraph id2[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 16px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 10px;">AI Cloud</p>]
  direction TB
  subgraph id3[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 12px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 5px;">Front-end node</p>]
    direction TB
    G["<span><img src="/assets/img/server.svg" width='25' height='25'>ai-fe02</span>"]
    end
  id3 --> id1 

  subgraph id4[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 12px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 5px;">File storage</p>]
    direction TB
    E["<span><img src="/assets/img/server.svg" width='25' height='25'>Ceph</span>"]
    end

  id1 & id3 <--> id4
  end

  F[<span><img src="/assets/img/person.svg" width='25' height='25'>User laptop</span>]-- SSH --> id3

Front-end node

You start by logging into a front-end node, ai-fe02. This node act as the gateway to the HPC system. Here, you can manage files, write and edit code, and prepare your computational tasks. It is important to note that the front-end node are not intended for heavy computations; It is optimized for task preparation and interaction with the HPC environment.


Compute nodes

AI Cloud currently include the following compute nodes:

Name Nodes in total GPUs per node CPU cores per node CPU HW threads RAM per node RAM per GPU Local Disk NVLINK / Infinity Frabric Link Primary usage
a256-t4-[01-03] 3 6 (NVIDIA T4) 32 (AMD EPYC) 64 256 GB 16 GB - No Interactive / smaller single-GPU jobs
i256-a10-[06-10] 5 4 (NVIDIA A10) 32 (Intel Xeon) 64 256 GB 24 GB - No Interactive / medium single-GPU jobs
a256-a40-[04-07] 4 3 (NVIDIA A40) 32 (AMD EPYC) 32 256 GB 48 GB - No Large single-GPU jobs
i256-a40-[01-02] 2 4 (NVIDIA A40) 24 (Intel Xeon) 24 256 GB 48 GB 6.4 TB /raid Yes (2×2) Large single-/multi-GPU jobs
a512-mi100-01 1 8 (AMD MI100) 16 (AMD EPYC) 32 512 GB 32 GB - Yes (Infinity Fabric link) Large / batch / multi-GPU jobs
a512-l4-06 6 8 (NVIDIA L4) 64 (AMD EPYC) 128 512 GB 24 GB - No Large / batch / multi-GPU jobs
a768-l40s-[01-06] 6 8 (NVIDIA L40s) 64 (AMD EPYC) 128 768 GB 48 GB - No Large / batch / multi-GPU jobs
nv-ai-[01-03] 3 16 (NVIDIA V100) 48 (Intel Xeon) 96 1470 GB 32 GB 30 TB /raid Yes Large / batch / multi-GPU jobs
nv-ai-04 1 8 (NVIDIA A100) 128 (AMD EPYC) 256 980 GB 40 GB 14 TB /raid Yes Large / batch / multi-GPU jobs

Note

The compute nodes nv-ai-04, i256-a40-01, and i256-a40-02 are owned by specific research groups or centers which have first-priority access to them. Other users can only access them on a limitied basis where your jobs may be cancelled by higher-priority jobs. Users outside the prioritised group can only use them via the "batch" partition (use option --partition=batch for your jobs).

Software

AI Cloud is based on Ubuntu Linux as its operating system. In practice, working on AI Cloud primarily takes place via a command-line interface.

AI Cloud leverages two primary software components: Slurm and Singularity. Understanding these tools and how they work together is crucial for efficiently utilizing the AI Cloud platform.

Slurm

Slurm is the workload manager used for scheduling and managing jobs on AI Cloud. It provides essential features such as:

  • Job Scheduling: Allocating resources to jobs based on user requests and system policies.
  • Resource Management: Tracking and managing compute resources, ensuring optimal utilization.
  • Queue Management: Organizing jobs into queues, prioritizing and executing them based on policies and resource availability.

On AI Cloud, Slurm is responsible for managing the allocation and scheduling of compute resources, ensuring that user jobs are executed efficiently and fairly.


Singularity

Singularity is a container platform designed for running applications on AI Cloud. Containers are portable and reproducible environments that bundle an application's code, libraries, and dependencies. Key features of Singularity include:

  • Compatibility: Running containers with high-performance computing workloads without requiring root privileges.
  • Portability: Enabling the same container to run on different systems without modification.
  • Integration with HPC Systems: Designed to work seamlessly with HPC job schedulers like Slurm.

Interconnection of Slurm and Singularity

On AI Cloud, Slurm and Singularity work together. Slurm handles the job scheduling and resource allocation, while Singularity ensures that the specified container environment is instantiated and the application runs with all its dependencies.

flowchart LR
  A[<span><img src="/assets/img/person.svg" width='25' height='25'>User laptop</span>]
  B["<span><img src="/assets/img/server.svg" width='25' height='25'>Front-end node</span>"]
  C["<span><img src="/assets/img/container.svg" width='25' height='25'>Singularity container job</span>"]
  D["<span><img src="/assets/img/queue.svg" width='25' height='25'>Slurm</span>"]
  E["<span><img src="/assets/img/server.svg" width='25' height='25'>Compute node</span>"]

  A-- SSH --> B  --> D --> E --> C-- Result --> B

  style C stroke-dasharray: 5 5
  style D stroke-dasharray: 5 5

Storage

AI Cloud utilizes Ceph as its storage solution, providing a robust and scalable file system for your data needs. Your files are organized within the Ceph file system hierarchy, ensuring efficient access and management across the entire platform.


User Directory

Your user directory serves as the primary location for storing personal files and data. It is structured within the Ceph file system as follows:

  • / AI Cloud's file system
    • home user home directories
      • [domain] e.g student.aau.dk
        • [user] your user directory

Here, [domain] represents your domain or institution (e.g., student.aau.dk), and [user] denotes your unique username on the platform. Any files you store within your user directory are private.


Shared Project Directories

AI Cloud fosters collaborative work through shared project directories. These directories enable multiple users to collaborate on projects by providing a centralized space for data sharing and collaboration. Shared project directories are organized under the project directory within the Ceph file system:

  • /home AI Cloud's file system
    • project shared project directories
      • project_X

Your projects directory is a subdirectory to an overall project directories folder hierarchy.

Go in to the project directory

cd /home/project

Before going ahead and creating a directory for group project, please consider naming the directory in a meaningful manner (ie. after your group or research project) A project directory can be created in the following manner (swap out <name> for the actual name of your project).

mkdir <name> 

Please remember, that these directories should be deleted when your project is finished, and you no longer need them. They are not intended for long term data storage.

Storage quota expansions

When users log in to AI Cloud for the first time, a user directory is created for them. These directories are allocated 1 TB of storage by default. This should be plenty for most users, but should you need additional space, it is possible to apply for storage quota expansions for a limited time using our Storage quota expansions form.

Info

When you log in to the platform, you can see your storage usage of the user directory at the very top line:

Current quota usage: 181GiB / 1.0TiB
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-169-generic x86_64)

* Documentation:  https://help.ubuntu.com
* Management:     https://landscape.canonical.com
* Support:        https://ubuntu.com/pro

System information as of Fri Mar 15 11:09:21 CET 2024