Sie befinden sich im Service: RWTH High Performance Computing (Linux)

Limitations of the GPU Nodes

Limitations of the GPU Nodes

Detailinformation 

Runtime Limit on the GPU Login nodes

Since the GPUs of the GPU login nodes run in exclusive mode (see below), long running GPU programs hinder other users interactive GPU usage and their experience. Therefore, we require Friendly Usage on all GPU login nodes. To guarantee high availability for all users, we may kill (with SIGXCPU) processes that use more than 10 minutes of GPU time on the GPU login nodes. We may also ban users who repeatedly abuse the GPU login nodes.

If your programs need more than 10 minutes of runtime, you can use salloc or start an interactive batchjob with X11 forwarding.

Exclusive GPU Mode

GPUs are assigned to users in "exclusive" mode ('exclusive process'). Trying to use a GPU assigned to a different user will get produce an error message.  Some applications need to define the GPU device to use, so you must define the one assigned to you by Slurm. Some Programs can automatically use any available GPU device (if there is one). However, if you set a specific device number, you will have to wait until that device becomes available (and try it again).

Keep in mind that debugging sessions may run on device 0 (default) and therefore you might experience problems if that device 0 is not assigned to you.

If you would like to run on a certain GPU (e.g. debugging a non-default device), you may mask certain GPUs by setting a environment variable:

CUDA_VISIBLE_DEVICES=<GPU ID>

Remember that for batch jobs, Slurm decided which GPU you receive and you can not change it.

GPU Boost

GPU boost means that the clock speed of a GPU is increased if its maximum power capacity is not fully used by the application. Although the tool nvidia-smi shows that auto boost is N/A, the default clocking behavior of the GPUs uses boost capabilities if possible. You can read the current clock frequency with:

nvidia-smi -q -d CLOCK

If you want to disable cuda Auto boost, set the following environment variable:

export CUDA_AUTO_BOOST=0

NVLink Usage (Pascal and Volta GPUs)

While the GPUs are attached to the host via PCIe, both Pascal/Volta GPUs within a host node are connected to each other via NVLink. You can check these data by nvidia-smi command,

ab123456@login18-g-1:~[505]$ nvidia-smi nvlink --status -i 0
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-388a149d-072d-881b-469c-a3e3c7c4dc19)
         Link 0: 25.781 GB/s
         Link 1: <inactive>
         Link 2: 25.781 GB/s
         Link 3: <inactive>
         Link 4: <inactive>
         Link 5: <inactive>
ab123456@login18-g-1:~[506]$ nvidia-smi nvlink --status -i 1
GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-142d0ca2-b09d-9c28-5583-7b3b749571f9)
         Link 0: <inactive>
         Link 1: 25.781 GB/s
         Link 2: 25.781 GB/s
         Link 3: <inactive>
         Link 4: <inactive>
         Link 5: <inactive>

If you want to use NVLink, you might need to adapt your program correspondingly. In CUDA, this could be done in the following:

unsigned int flags = 0;

cudaSetDevice(0);
cudaDeviceEnablePeerAccess(1,flags);
...
cudaSetDevice(1);
cudaDeviceEnablePeerAccess(0,flags);
...
// for example; copy data between both GPUs
cudaMemcpy(<dataDest>, <dataSrc>, <size>, cudaMemcpyDeviceToDevice); // Be aware: This can be an asynchronous copy over NVLink!

zuletzt geändert am 20.10.2023

Wie hat Ihnen dieser Inhalt geholfen?

Creative Commons Lizenzvertrag
Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Deutschland Lizenz