System overview

Hardware

The AI Cloud platform is built around several key components, including a front-end node for managing tasks and code, and 27 compute nodes equipped with diverse hardware options.

In this overview, you will find a description of each major component of AI Cloud. Below, is a diagram illustrating the architecture of the AI Cloud platform.

flowchart LR
  subgraph id1[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 12px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 5px;">Compute nodes</p>]
  direction TB
  A["<span><img src="/assets/img/server.svg"  width='25' height='25' ><p>a256-t4-[01-03]</p><p>i256-a10-[06-10]</p><p>a256-a40-[04-07]</p><p>i256-a40-[01-02]</p><p>a512-l4-06</p><p>nv-ai-[01-03]</p><p>nv-ai-04</p><p>a768-l40s-[01-06]</p><p>a512-mi100-01</p>
  </span>"]
  end

  subgraph id2[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 16px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 10px;">AI Cloud</p>]
  direction TB
  subgraph id3[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 12px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 5px;">Front-end node</p>]
    direction TB
    G["<span><img src="/assets/img/server.svg" width='25' height='25'>ai-fe02</span>"]
    end
  id3 --> id1 

  subgraph id4[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 12px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 5px;">File storage</p>]
    direction TB
    E["<span><img src="/assets/img/server.svg" width='25' height='25'>Ceph</span>"]
    end

  id1 & id3 <--> id4
  end

  F[<span><img src="/assets/img/person.svg" width='25' height='25'>User laptop</span>]-- SSH --> id3

Front-end node

You start by logging into a front-end node, ai-fe02. This node act as the gateway to the HPC system. Here, you can manage files, write and edit code, and prepare your computational tasks. It is important to note that the front-end node are not intended for heavy computations; It is optimized for task preparation and interaction with the HPC environment.

Compute nodes

AI Cloud currently include the following compute nodes:

Name	Nodes in total	GPUs per node	CPU cores per node	CPU HW threads	RAM per node	RAM per GPU	Local Disk	NVLINK / Infinity Frabric Link	Primary usage
a256-t4-[01-03]	3	6 (NVIDIA T4)	32 (AMD EPYC)	64	256 GB	16 GB	-	No	Smalle single-GPU jobs
i256-a10-[06-10]	5	4 (NVIDIA A10)	32 (Intel Xeon)	64	256 GB	24 GB	-	No	Medium single-GPU jobs
a256-a40-[04-07]	4	3 (NVIDIA A40)	32 (AMD EPYC)	32	256 GB	48 GB	-	No	Large single-GPU jobs
i256-a40-[01-02]	2	4 (NVIDIA A40)	24 (Intel Xeon)	24	256 GB	48 GB	6.4 TB /raid	Yes (2×2)	Large single-/multi-GPU jobs
a512-mi100-01	1	8 (AMD MI100)	16 (AMD EPYC)	32	512 GB	32 GB	-	Yes (Infinity Fabric link)	Large / batch / multi-GPU jobs
a512-l4-06	1	8 (NVIDIA L4)	64 (AMD EPYC)	128	512 GB	24 GB	-	No	Large / batch / multi-GPU jobs
a768-l40s-[01-06]	6	8 (NVIDIA L40s)	64 (AMD EPYC)	128	768 GB	48 GB	-	No	Large / batch / multi-GPU jobs
nv-ai-[01-03]	3	16 (NVIDIA V100)	48 (Intel Xeon)	96	1470 GB	32 GB	30 TB /raid	Yes	Large / batch / multi-GPU jobs
nv-ai-04	1	8 (NVIDIA A100)	128 (AMD EPYC)	256	980 GB	40 GB	14 TB /raid	Yes	Large / batch / multi-GPU jobs

Note

The compute nodes nv-ai-04, i256-a40-01, and i256-a40-02 are owned by specific research groups or centers which have first-priority access to them. Other users can only access them on a limitied basis where your jobs may be cancelled by higher-priority jobs. Users outside the prioritised group can only use them via the "batch" partition (use option --partition=batch for your jobs).

Software

AI Cloud is based on Ubuntu Linux as its operating system. In practice, working on AI Cloud primarily takes place via a command-line interface.

AI Cloud leverages two primary software components: Slurm and Singularity. Understanding these tools and how they work together is crucial for efficiently utilizing the AI Cloud platform.

Slurm

Slurm is the queueing mechanism used for scheduling and managing resources on AI Cloud. It provides essential features such as:

Job Scheduling: Allocating resources to jobs based on user requests and system policies.
Resource Management: Tracking and managing compute resources, ensuring optimal utilization.
Queue Management: Organizing jobs into queues, prioritizing and executing them based on policies and resource availability.

On AI Cloud, Slurm is responsible for managing the allocation and scheduling of compute resources, ensuring that user jobs are executed efficiently and fairly.

Singularity

Singularity is a container platform designed for running applications on AI Cloud. Containers are portable and reproducible environments that bundle an application's code, libraries, and dependencies. Key features of Singularity include:

Compatibility: Running containers with high-performance computing workloads without requiring root privileges.
Portability: Enabling the same container to run on different systems without modification.
Integration with HPC Systems: Designed to work seamlessly with HPC job schedulers like Slurm.

Interconnection of Slurm and Singularity

On AI Cloud, Slurm and Singularity work together. Slurm handles the job scheduling and resource allocation, while Singularity ensures that the specified container environment is instantiated and the application runs with all its dependencies.

flowchart LR
  A[<span><img src="/assets/img/person.svg" width='25' height='25'>User laptop</span>]
  B["<span><img src="/assets/img/server.svg" width='25' height='25'>Front-end node</span>"]
  C["<span><img src="/assets/img/container.svg" width='25' height='25'>Singularity container job</span>"]
  D["<span><img src="/assets/img/queue.svg" width='25' height='25'>Slurm</span>"]
  E["<span><img src="/assets/img/server.svg" width='25' height='25'>Compute node</span>"]

  A-- SSH --> B  --> D --> E --> C-- Result --> B

  style C stroke-dasharray: 5 5
  style D stroke-dasharray: 5 5

Storage

AI Cloud utilizes a Ceph network drive as its storage solution. Your files are organized within the Ceph file system hierarchy, ensuring efficient access and management across the entire platform.

User Directory

Your user directory serves as the primary location for storing personal files and data. It is structured within the Ceph file system as follows:

/ AI Cloud's file system
- home user home directories
  - [domain] e.g student.aau.dk
    - [user] your user directory

Here, [domain] represents your domain or institution (e.g., student.aau.dk), and [user] denotes your unique username on the platform. Any files you store within your user directory are private.

Shared Project Directories

AI Cloud fosters collaborative work through shared project directories. These directories enable multiple users to collaborate on projects by providing a centralized space for data sharing and collaboration. Shared project directories are organized under the project directory within the Ceph file system:

/home AI Cloud's file system
- project shared project directories
  - project_X

Your projects directory is a subdirectory to an overall project directories folder hierarchy.

Go in to the project directory

cd /home/project

Before going ahead and creating a directory for group project, please consider naming the directory in a meaningful manner (ie. after your group or research project) A project directory can be created in the following manner (swap out <name> for the actual name of your project).

mkdir <name>

Please remember, that these directories should be deleted when your project is finished, and you no longer need them. They are not intended for long term data storage.

Local scratch space

A handful of nodes have local scratch space, which can be utilised for faster I/O operations. Read more about this possibility in the appropriate section under Additional guides.

Storage quota expansions

When users log in to AI Cloud for the first time, a user directory is created for them. These directories are allocated 1 TB of storage by default. This should be plenty for most users, but should you need additional space, it is possible to apply for storage quota expansions for a limited time using our Storage quota expansions form.

Info

When you log in to the platform, you can see your storage usage of the user directory at the very top line:

Current quota usage: 181GiB / 1.0TiB
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-169-generic x86_64)

* Documentation:  https://help.ubuntu.com
* Management:     https://landscape.canonical.com
* Support:        https://ubuntu.com/pro

System information as of Fri Mar 15 11:09:21 CET 2024