Limitations of the GPU Nodes
Runtime Limit on the GPU Login nodes
Since the GPUs of the GPU login nodes run in exclusive mode (see below), long running GPU programs hinder other users interactive GPU usage and their experience. Therefore, we require Friendly Usage on all GPU login nodes. To guarantee high availability for all users, we may kill (with SIGXCPU) processes that use more than 10 minutes of GPU time on the GPU login nodes. We may also ban users who repeatedly abuse the GPU login nodes.
Exclusive GPU Mode
GPUs are assigned to users in "exclusive" mode ('exclusive process'). Trying to use a GPU assigned to a different user will get produce an error message. Some applications need to define the GPU device to use, so you must define the one assigned to you by Slurm. Some Programs can automatically use any available GPU device (if there is one). However, if you set a specific device number, you will have to wait until that device becomes available (and try it again).
Keep in mind that debugging sessions may run on device 0 (default) and therefore you might experience problems if that device 0 is not assigned to you.
If you would like to run on a certain GPU (e.g. debugging a non-default device), you may mask certain GPUs by setting a environment variable:
Remember that for batch jobs, Slurm decided which GPU you receive and you can not change it.
GPU boost means that the clock speed of a GPU is increased if its maximum power capacity is not fully used by the application. Although the tool
nvidia-smi shows that auto boost is N/A, the default clocking behavior of the GPUs uses boost capabilities if possible. You can read the current clock frequency with:
nvidia-smi -q -d CLOCK
If you want to disable cuda Auto boost, set the following environment variable:
NVLink Usage (Pascal and Volta GPUs)
While the GPUs are attached to the host via PCIe, both Pascal/Volta GPUs within a host node are connected to each other via NVLink. You can check these data by
ab123456@login18-g-1:~$ nvidia-smi nvlink --status -i 0 GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-388a149d-072d-881b-469c-a3e3c7c4dc19) Link 0: 25.781 GB/s Link 1: <inactive> Link 2: 25.781 GB/s Link 3: <inactive> Link 4: <inactive> Link 5: <inactive> ab123456@login18-g-1:~$ nvidia-smi nvlink --status -i 1 GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-142d0ca2-b09d-9c28-5583-7b3b749571f9) Link 0: <inactive> Link 1: 25.781 GB/s Link 2: 25.781 GB/s Link 3: <inactive> Link 4: <inactive> Link 5: <inactive>
If you want to use NVLink, you might need to adapt your program correspondingly. In CUDA, this could be done in the following:
unsigned int flags = 0; cudaSetDevice(0); cudaDeviceEnablePeerAccess(1,flags); ... cudaSetDevice(1); cudaDeviceEnablePeerAccess(0,flags); ... // for example; copy data between both GPUs cudaMemcpy(<dataDest>, <dataSrc>, <size>, cudaMemcpyDeviceToDevice); // Be aware: This can be an asynchronous copy over NVLink!