System overview
Hardware
The AI Cloud platform is built around several key components, including a front-end node for managing tasks and code, and 27 compute nodes equipped with diverse hardware options.
In this overview, you will find a description of each major component of AI Cloud. Below, is a diagram illustrating the architecture of the AI Cloud platform.
flowchart LR
subgraph id1[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 12px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 5px;">Compute nodes</p>]
direction TB
A["<span><img src="/assets/img/server.svg" width='25' height='25' ><p>a256-t4-[01-03]</p><p>i256-a10-[06-10]</p><p>a256-a40-[04-07]</p><p>i256-a40-[01-02]</p><p>a512-l4-06</p><p>nv-ai-[01-03]</p><p>nv-ai-04</p><p>a768-l40s-[01-06]</p><p>a512-mi100-01</p>
</span>"]
end
subgraph id2[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 16px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 10px;">AI Cloud</p>]
direction TB
subgraph id3[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 12px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 5px;">Front-end node</p>]
direction TB
G["<span><img src="/assets/img/server.svg" width='25' height='25'>ai-fe02</span>"]
end
id3 --> id1
subgraph id4[<p style="font-family: Barlow, sans-serif; font-weight: 800; font-size: 12px; text-transform: uppercase; color: #221a52; letter-spacing: 1px; margin: 5px;">File storage</p>]
direction TB
E["<span><img src="/assets/img/server.svg" width='25' height='25'>Ceph</span>"]
end
id1 & id3 <--> id4
end
F[<span><img src="/assets/img/person.svg" width='25' height='25'>User laptop</span>]-- SSH --> id3
Front-end node
You start by logging into a front-end node, ai-fe02.srv.aau.dk
. This node acts as the gateway to the cluster. Here, you can manage files, write and edit code, and prepare your computational tasks. It is important to note that the front-end node is not intended for heavy computations, as it is a node with a very modest amount of resources, and that crashing the node, will affect all users on the system.
Compute nodes
AI Cloud consists the following compute nodes:
Name | Nodes in total | CPU cores per node | CPU type | RAM per node | GPUs per node | GPU type | RAM per GPU | Local Disk | NVLINK |
---|---|---|---|---|---|---|---|---|---|
a256-t4-[01-03] | 3 | 32 | AMD EPYC | 256 GB | 6 | NVIDIA T4 | 16 GB | - | No |
a256-a40-[04-07] | 4 | 32 | AMD EPYC | 256 GB | 3 | NVIDIA A40 | 48 GB | - | No |
i256-a10-[06-10] | 5 | 32 | Intel Xeon | 256 GB | 4 | NVIDIA A10 | 24 GB | - | No |
i256-a40-[01-02] | 2 | 24 | Intel Xeon | 256 GB | 4 | NVIDIA A40 | 48 GB | 6.4 TB /raid | Yes |
a512-l4-06 | 1 | 64 | AMD EPYC | 512 GB | 8 | NVIDIA L4 | 24 GB | - | No |
a768-l40s-[01-06] | 6 | 64 | AMD EPYC | 768 GB | 8 | NVIDIA L40s | 48 GB | - | No |
nv-ai-[01-03] | 3 | 48 | Intel Xeon | 1470 GB | 16 | NVIDIA V100 | 32 GB | 30 TB /raid | Yes |
nv-ai-04 | 1 | 128 | AMD EPYC | 980 GB | 8 | NVIDIA A100 | 40 GB | 14 TB /raid | Yes |
Further inspection of hardware
It's also possible to inspect the compute node hardware using the following commands:
A detailed overview of the CPU
srun -w a768-l40s-01 lscpu
srun -w a768-l40s-01 free -h
srun -w a768-l40s-01 -G 1 nvidia-smi
The compute nodes i256-a40-01
, i256-a40-02
and nv-ai-04
are owned by specific research groups, and members of these groups have first-priority access to these. Other users can access them on a limited basis via the "batch" partition (use option --partition=batch
for your jobs). These jobs will run until, a member from the aforementioned research groups makes a request for them, at which point the batch-job, will be interrupted.
Software
AI Cloud is essentially just a cluster of Linux servers that use Ubuntu as their operating system.
The recommended way of interacting with the platform is using the command line - find the essential commands for doing this on linuxjourney.com.
The platform leverages two primary software components: Slurm and Singularity. Learning how to use these tools, is an integral part of learning how to use the system. As they are commonly found on most HPC systems, learning how to use them will also open up the possibility of scaling your project to a larger supercomputer.
Slurm
Slurm is the queueing mechanism used for scheduling and managing resources on AI Cloud. It provides essential features such as:
- Job Scheduling: Allocating resources to jobs based on user requests and system policies.
- Resource Management: Tracking and managing compute resources, ensuring optimal utilization.
- Queue Management: Organizing jobs into queues, prioritizing and executing them based on policies and resource availability.
On AI Cloud, Slurm is responsible for managing the allocation and scheduling of compute resources, ensuring that user jobs are executed efficiently and fairly.
Singularity
Singularity is a container platform designed for running applications on AI Cloud. Containers are portable and reproducible environments that bundle an application's code, libraries, and dependencies. Key features of Singularity include:
- Compatibility: Running containers with high-performance computing workloads without requiring root privileges.
- Portability: Enabling the same container to run on different systems without modification.
- Integration with HPC Systems: Designed to work seamlessly with HPC job schedulers like Slurm.
Interconnection of Slurm and Singularity
On AI Cloud, Slurm and Singularity work together. Slurm handles the job scheduling and resource allocation, while Singularity ensures that the specified container environment is instantiated and the application runs with all its dependencies.
flowchart LR
A[<span><img src="/assets/img/person.svg" width='25' height='25'>User laptop</span>]
B["<span><img src="/assets/img/server.svg" width='25' height='25'>Front-end node</span>"]
C["<span><img src="/assets/img/container.svg" width='25' height='25'>Singularity container job</span>"]
D["<span><img src="/assets/img/queue.svg" width='25' height='25'>Slurm</span>"]
E["<span><img src="/assets/img/server.svg" width='25' height='25'>Compute node</span>"]
A-- SSH --> B --> D --> E --> C-- Result --> B
style C stroke-dasharray: 5 5
style D stroke-dasharray: 5 5
Storage
Network drive storage
All nodes in the cluster are connected to the same network storage system. This ensures that all nodes in the cluster have access to the same files. There's no need to specify additional parameters to make your files available on the compute nodes or in container environments.
Check out our page Overview of directories to learn more.