Changes in accessing GPU on FRA1-1 Cloud EO-Lab

The following documentation describes the changes related to the implementation of a new GPU access mode in the FRA1-1 cloud.

The differences between GPU pass-through (PT) and vGPU modes

Until recently, GPU instances used pass-through mode, which means that an entire physical GPU was directly assigned to one virtual machine (VM). The whole GPU is accessed exclusively by the NVIDIA driver running on the VM and is not shared among VMs.

NVIDIA Virtual GPU (vGPU) enables multiple virtual machines to have simultaneous, direct access to a single physical GPU, using the same NVIDIA graphics drivers that are deployed on non-virtualized operating systems. By doing this, NVIDIA vGPU provides VMs with unparalleled graphics performance, computing performance, and application compatibility, combined with the cost-effectiveness and scalability presented by sharing a GPU among multiple work units. This implementation is done using two virtualization modes and dedicated NVIDIA Grid drivers.

Introduced changes

Until now, users have had two flavors available for their instances based on NVIDIA A100 and RTX A6000 graphics cards, with the following configurations:

flavor name

vCPU

RAM

disk

gpu.a100

24

118 GB

64 GB

gpu.a6000

24

118 GB

800 GB

Both flavors were configured to access the GPU in pass-through mode, so an entire physical GPU was assigned to one VM. Additionally, it fell into the users’ purview to configure the instance, which involved installing NVIDIA drivers and optionally CUDA tools.

Due to the changes, new flavors were being introduced to the cloud. For better use of computing performance, instances based on vGPU flavors will have access to high-speed local SPDK storage based on NVMe disks. The last part of the naming scheme denotes how many individual “parts” of a GPU the respective flavor receives, out of a maximum of 7 for an A100 or 8 for an A6000. For example: a vm.a6000.8 receives all of the processing time available on the GPU, while a vm.a6000.1 receives one eighth, or 12.5%.

A100

The A100 GPUs operate in MIG (Multi-Instance GPU) virtualization mode.

MIG mode allows multiple vGPUs (and thereby VMs) to run in parallel on a single GPU, providing multiple users with separate GPU resources for optimal GPU utilization.

flavor name

vCPU

RAM

disk

vm.a100.1

1

14 GB

40 GB

vm.a100.2

3

28 GB

40 GB

vm.a100.3

6

56 GB

80 GB

vm.a100.7

12

112 GB

80 GB

A6000

RTX A6000 cards are run in Time-Sliced virtualization mode. In Time-Sliced mode each instance receives its own separate slice of VRAM, while GPU time is shared between instances by the scheduler.

flavor name

vCPU

RAM

disk

vm.a6000.1

1

14 GB

40 GB

vm.a6000.2

3

28 GB

40 GB

vm.a6000.4

6

56 GB

80 GB

vm.a6000.8

12

112 GB

80 GB

Images

In addition, preconfigured images have also been added.

../_images/new-images1.png

All new NVIDIA images have dedicated drivers and CUDA tools preinstalled and include packages to support machine learning, such as Tensorflow, PyTorch, scikit-learn and R.

The use of the above images is highly reccomended when launching a new virtual machine with a vGPU flavor. Otherwise, the virtual graphics card may not be recognised.

Steps to be taken after launching vGPU virtual machine

Connect to the vGPU virtual machine once it has been successfully launched.

To check the status of the vGPU device, enter the command:

nvidia-smi

The commands output should provide information about the used GPU, current utilisation, running processes as well as driver and CUDA versions.

To check the vGPU’s software licence, use the command:

nvidia-smi -q | grep -i license

When a license is active, the command output will display the corresponding status and expiry date.

root@vm01:#  nvidia-smi -q | grep -i license
vGPU Software Licensed Product
License Status  : Licensed (Expiry: 2022-10-20 12:40:45 GMT)

How to launch an environment with AI libraries on NVIDIA AI images

All additional libraries and tools in NVIDIA AI images are installed in the conda environment - ai-support.

Conda is a package and environment manager that allows control of package  dependencies. Currently, the pre-installed packages are as follows:

Currently installed package versions:

  • Python: 3.10.4

  • TensorFlow: 2.9.1

  • Pytorch: 1.11.0+cu102

  • scikit-learn: 1.1.1

  • R: 4.2.0

The ai-support environment is activated with the command:

conda activate ai-support

Once it has been executed, all tools are available.

Note

The packages in this environment have their own dependencies, which may conflict with other dependencies of packages currently installed on the virtual machine’s operating system. For instance, the yum command in CentOS 7 image will not work when the ai-support environment is active.

To avoid such conflicts, work carried out outside the environment should be preceded by shutting down the environment with the command:

conda deactivate

The “conda” command works on the users root and eouser.

To add the possibility of using conda to another user, simply add them to the “conda” group:

usermod -a -G conda username

Suggested resources:

nVidia vGPU User Guide

nVidia MIG User Guide

Miniconda manual