Sie befinden sich im Service: RWTH Compute Cluster Linux (HPC)

Limitations GPU Cluster

Limitations GPU Cluster


Runtime Limit (Interactive Mode)

Since the GPUs in interactive front ends run in exclusive mode (see below), you hinder other users when running longer programs in interactive mode. Therefore, we require Friendly Usage on all GPU dialogue nodes. To guarentee availability for all users, we may kill (with SIGXCPU) processes that have used more than 10 minutes of GPU time on dialogue nodes.We also may ban users who repeatedly start such processes.

If a session with more than 10 minutes GPU time is needed, an interactive batchjob with X11 forwarding can be submitted, using one of the GPU compute nodes, to fulfill the needs.


Exclusive Mode

On the GPUs which are set to "exclusive" mode ('exclusive process'), you will get an error message if you try to run your program on a GPU device that is already in use by another user. If possible, simply choose another device number for execution of your program. For CUDA applications do not explicitly set the device number in your program (e.g. with a call to cudaSetDevice) if not strictly necessary. Then your program will automatically use any available GPU device (if there is one). However, if you set a specific device number, you will have to wait until that device becomes available (and try it again).

Keep in mind that debugging sessions always run on device 0 (default) and therefore you might exhibit the same problem there.

If you would like to run on a certain GPU (e.g. debugging a non-default device), you may mask certain GPUs by setting a environment variable:


GPU Boost

GPU boost means that the clock speed of a GPU is increased if its maximum power capacity is not fully used by the application. Although the tool nvidia-smi shows that auto boost is N/A, the default clocking behavior of the GPUs uses boost capabilities if possible. You can read the current clock frequency with:

nvidia-smi -q -d CLOCK

If you want to disable cuda Auto boost, set the following environment variable:


NVLink Usage (Pascal and Volta GPUs)

While the GPUs are attached to the host via PCIe, the both Pascal/Volta GPUs within a host node are connected to each other via NVLink. You can chek these data by nvidia-smi command,

ab123456@login18-g-1:~[505]$ nvidia-smi nvlink --status -i 0
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-388a149d-072d-881b-469c-a3e3c7c4dc19)
         Link 0: 25.781 GB/s
         Link 1: <inactive>
         Link 2: 25.781 GB/s
         Link 3: <inactive>
         Link 4: <inactive>
         Link 5: <inactive>
ab123456@login18-g-1:~[506]$ nvidia-smi nvlink --status -i 1
GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-142d0ca2-b09d-9c28-5583-7b3b749571f9)
         Link 0: <inactive>
         Link 1: 25.781 GB/s
         Link 2: 25.781 GB/s
         Link 3: <inactive>
         Link 4: <inactive>
         Link 5: <inactive>

If you want to use NVLink, you might need to adapt your program correspondingly. In CUDA, this could be done in the following:

unsigned int flags = 0;
// for example; copy data between both GPUs
cudaMemcpy(<dataDest>, <dataSrc>, <size>, cudaMemcpyDeviceToDevice); // Be aware: This can be an asynchronous copy over NVLink!

zuletzt geändert am 29.01.2021

Wie hat Ihnen dieser Inhalt geholfen?