Sie befinden sich im Service: RWTH Compute Cluster Linux (HPC)

tensorflow

tensorflow

Kurzinformation

TensorFlow™ is an open source software library for numerical computation using data flow graphs. It is being developed by the Google Brain Team and is often used for a wide range of applications, most prominently machine learning.


 Detailinformation

1. Install

TensorFlow is a very rapidly developing software.Their dependencies are often bleeding-edge; any update could lead to issues with installation and/or usability. Do not forget to make a backup of your installation prior updating in order to have a save harbour where you could return back!

 

1.0. Use Singilarity containter (with and without GPU, recommended) (for reasonable)

NVIDIA offer a set of prebuilt containters with AI software being deployed. These containters are a reasonable source of working installation of TensorFlow and offer a low-effort access to this software: no installation work for you at all!  See https://help.itc.rwth-aachen.de/en/service/rhr4fjjutttf/article/76576413f58040e78acab4332d9e68e3/

 

1.1. Pre-Packaged (CPU only, reccomended)

Our cluster has about 5% of all nodes equipped with GPUs. A modern, multi-core CPU is also a very powerful computing device. To start your computations on CPUs is a good choise as it helps to drop the GPU installation and usage complexity.

To install the latest pre-packaged version for computing using host CPU only locally in your $HOME/.local, use

Python 3

$ module switch intel gcc
$ module load python/3.8.7
$ pip3 install --user tensorflow_cpu
$ python3 -c "import tensorflow as tf; print(tf.__version__)"
2.4.1
$ python3 -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2021-02-16 16:54:34.751286: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-16 16:54:34.751676: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU
instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
tf.Tensor(-807.85876, shape=(), dtype=float32)
$ python3 -c "import tensorflow as tf; print(\"Num GPUs Available: \", len(tf.config.list_physical_devices('GPU')))"
2021-02-16 16:52:19.003180: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
Num GPUs Available:  0

Starting with TensorFlow version 2.2 the support for Python-2 is dropped, so we deprecate it too. If in doubt try older versions, e.g. tensorflow==1.12, and/or different (older) version of Python. Sometimes --upgrade could help, on the other side it can destroy other (especially large/compilcated) software chains, e.g. install new version of NumPy/SciPy without support for Intel MKL, see details.

For other python and older Tensorflow versions refer to the official website: https://www.tensorflow.org/install/

Note of compatibility: As TensorFlow is quite often bleeding-edge development software, you sometimes need to find an older version compatible to current Linux and version used. The compatibility map may be subject of change daily! Check https://www.tensorflow.org/install/source#tested_build_configurations prior installing/updating your TensorFlow; do not assume not listed combination to be working.

 

1.2. Pre-Packaged (GPGPU, for adventurous)

A GPGPU versions are available via 'pip3 install', too. However it is of limited use only as it need GPGPUs, obviously, which are available on a small (5%) fraction of nodes in the HPC Cluster, see Hardware of the RWTH Compute Cluster.

TensorFlow has requirement on  'compute capability' of GPGPUs which is/may be subject of change with any update. GPUs of older design could somewhen stop working with new versions of TensorFlow. Obviously older GPUs are in most cases also slower.

Prior installing a GPGPU enabled version of TensorFlow, You must deciede for a TensorFlow version, check the CUDA and CuDNN compatibility map https://www.tensorflow.org/install/source#tested_build_configurations be logged in into appropriate (GPU-equipped) node, load CUDA module, and load/install cuDNN library:

 
Python 3.7

$ module unload intelmpi; module switch intel gcc
$ module load python/3.8.7
$ module load cuda/11.0
$ module load cudnn/8.0.5
$ pip3 install --user tensorflow      # 2021.02.15: now the GPU varint is mainstream, CPU-only variant has own name
$ python3
Python 3.8.7 (default, Feb  1 2021, 13:16:02)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
>>> import tensorflow as tf
>>> print(tf.__version__)
>>> tf.test.gpu_device_name()
>>> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
>>> print(tf.reduce_sum(tf.random.normal([1000, 1000])))

 
example output:
pk224850@login18-g-2:~[505]$ python3
Python 3.8.7 (default, Feb  1 2021, 13:16:02)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2021-02-16 17:23:44.189045: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
>>> print(tf.__version__)
2.4.1
>>> tf.test.gpu_device_name()
2021-02-16 17:23:59.274961: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-16 17:23:59.277937: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-16 17:23:59.279132: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-02-16 17:23:59.481010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:61:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-02-16 17:23:59.481651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:62:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-02-16 17:23:59.481681: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-02-16 17:23:59.485868: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-02-16 17:23:59.485920: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-02-16 17:23:59.487867: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-02-16 17:23:59.489259: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-02-16 17:23:59.492498: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-02-16 17:23:59.493764: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-02-16 17:23:59.494767: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-02-16 17:23:59.497228: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
2021-02-16 17:23:59.497255: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-02-16 17:24:00.331180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-16 17:24:00.331208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1
2021-02-16 17:24:00.331214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N Y
2021-02-16 17:24:00.331218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   Y N
2021-02-16 17:24:00.333828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 14760 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:61:00.0, compute capability: 7.0)
2021-02-16 17:24:00.335438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:1 with 14760 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:62:00.0, compute capability: 7.0)
'/device:GPU:0'
>>> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
2021-02-16 17:25:30.389305: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-16 17:25:30.390109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:61:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-02-16 17:25:30.390737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:62:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-02-16 17:25:30.390761: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-02-16 17:25:30.390792: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-02-16 17:25:30.390800: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-02-16 17:25:30.390809: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-02-16 17:25:30.390817: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-02-16 17:25:30.390825: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-02-16 17:25:30.390832: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-02-16 17:25:30.390841: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-02-16 17:25:30.393174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
Num GPUs Available:  2
>>> print(tf.reduce_sum(tf.random.normal([1000, 1000])))
2021-02-16 17:26:12.464559: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-02-16 17:26:12.465364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:61:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-02-16 17:26:12.465972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:62:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-02-16 17:26:12.465992: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-02-16 17:26:12.466017: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-02-16 17:26:12.466026: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-02-16 17:26:12.466034: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-02-16 17:26:12.466042: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-02-16 17:26:12.466050: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-02-16 17:26:12.466058: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-02-16 17:26:12.466066: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-02-16 17:26:12.468265: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
2021-02-16 17:26:12.468304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-16 17:26:12.468310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1
2021-02-16 17:26:12.468314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N Y
2021-02-16 17:26:12.468318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   Y N
2021-02-16 17:26:12.470059: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14760 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:61:00.0, compute capability: 7.0)
2021-02-16 17:26:12.470665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14760 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:62:00.0, compute capability: 7.0)
tf.Tensor(-54.4209, shape=(), dtype=float32)
>>>
 
known issue:

sometimes, TensorFlow on GPU node fail to get the GPUs with message like below.

>>> print(tf.__version__)
2.1.0
>>> tf.test.gpu_device_name()
2020-06-04 14:36:14.070893: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-06-04 14:36:14.079830: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
2020-06-04 14:36:14.081897: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x39b4c10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-04 14:36:14.081919: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-04 14:36:14.084737: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-04 14:36:14.104121: W tensorflow/compiler/xla/service/platform_util.cc:276] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-06-04 14:36:14.180907: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3a813e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-04 14:36:14.180930: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2020-06-04 14:36:14.183748: I tensorflow/compiler/jit/xla_gpu_device.cc:136] Ignoring visible XLA_GPU_JIT device. Device number is 0, reason: Invalid argument: device CUDA:0 not supported by XLA service
2020-06-04 14:36:14.184112: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
zsh: abort (core dumped)  python3

The root of cause is the unavailability of (at leas some) GPU(s), locked by some other users. Check it using 'nvidia-smi':

pk224850@login18-g-1:~[501]$ nvidia-smi
Thu Jun  4 14:36:34 2020      
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:61:00.0 Off |                    0 |
| N/A<   38C    P0    62W / 300W |   1074MiB / 16160MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:62:00.0 Off |                    0 |
| N/A   35C    P0    38W / 300W |     11MiB / 16160MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                                

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     41713      C   ...v/versions/3.7.6/envs/pc_aug/bin/python  1063MiB |
+-----------------------------------------------------------------------------+

Here you can see a process running on a GPU, which one is obviously locked due to fact that the GPUs are in exclusive use on the front end nodes.

Less convinient are state like this:

pk224850@login18-g-2:~[513]$ nvidia-smi 
Thu Jun  4 14:37:23 2020      
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:61:00.0 Off |                    0 |
| N/A   38C    P0    52W / 300W |    318MiB / 16160MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:62:00.0 Off |                    0 |
| N/A   37C    P0    53W / 300W |    318MiB / 16160MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                                
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Here you can see no process locking GPUs, but note the non-zero Memory-Usage. Highly likely something gonna fishy on that node. Try again in a half hour; if the error persists ask the support for reboot this node.

IMPORTANT NOTE: TensorFlow with GPU-support takes over the memory management of all (visible) devices attached to the host, even if no GPU is actually used. Because the GPUs in the interactive front ends are process-exclusive, other users cannot use the idling GPUs. Please refer to the next section for advice how to fix this by making unused GPUs invisible to the program. On interactive front ends, exit python session immediately after finish of your test (otherwise python and tensorflow would stay to lock the GPUs). If there are no free GPUs available on a front end node (e.g. due to test of any other user) your try to load GPU-capable version of TensorFlow will end in an error and SIGSEGV and core dump of python (see 'known issue' above). Test the availabilit yof GPUs using 'nvidia-smi' command prior trying GPU-capable TensorFlow. Or, switch to CPU-only version of TensorFlow.

 

1.3. Build from Sources (for brave hearts)

You can also build TensorFlow directly from sources. This allows you to specify the CPU/GPU-features you want to use, which might allow better optimization. This installation method is very tedious and difficult, so only use it if you know what you are doing. See here for more information.

 
 
 

2. Using TensorFlow (Interactive Systems)

IMPORTANT NOTE: TensorFlow with GPU-support takes over the memory management of all (visible) GPU devices attached to the host, even if no GPU is actually used. Because the GPUs in the Cluster are process-exclusive, other users cannot use the idling GPUs. Please refer below for advice how to fix this by making unused GPUs invisible to the program. This section will explain how to do that.

To execute your TensorFlow program in a terminal, first enter the containing folder and then start your program with python. For example, if your TensorFlow program is called program.py and you saved it in a subfolder named subfolder in your home directory, you can use the following commands:

$ cd $HOME/subfolder/
$ python program.py
 

2.1. Using CPU

In case your TensorFlow version supports GPU acceleration (meaning you installed tensorflow-gpu instead of tensorflow), make sure to make all system GPUs invisible to TensorFlow by executing this in your shell:

$ export CUDA_VISIBLE_DEVICES=""

Otherwise, TensorFlow will block the GPUs for all users, even if your program does not use them.

 

2.2. Using one of two GPUs

To use a GPU, you need to have the correct version of TensorFlow installed (tensorflow-gpu). Also, you need to log into an interactive host with GPUs, for example the host login-g.

If you only want to use one of two GPUs, you need to make the other one invisible, otherwise it will not be available to other users. Depending on which GPUs are free at the time, use one of the following:

$ export CUDA_VISIBLE_DEVICES=0 # OR
$ export CUDA_VISIBLE_DEVICES=1

Please note that TensorFlow only utilizes two GPUs if you explicitly assign tasks to each GPU, like in this example below (see also https://www.tensorflow.org/guide/gpu ):

c = []

# Iterate through two GPUs
for d in ['/device:GPU:0','/device:GPU:1']:
  with tf.device(d):
    # x: 1-dimensional vector
    x = tf.constant([1.02.03.04.05.06.0], shape=[n, 1])     

    # a: 0-dimensional scalar
    a = tf.constant(7.0)

    # add a*x to the list c
    c.append(tf.scalar_mul(a, x))

# sum up results from both GPUs on the CPU
with tf.device('/cpu:0'):
  sum = tf.add_n(c)

# Runs the operation
sess = tf.Session()
print sess.run(sum)

For more detailed information about how to use one or more GPUs with Tensorflow, see here https://www.tensorflow.org/guide/distributed_training

 

3. Using TensorFlow (Batch Mode)

For general information on how to use the batch system, see here.

If you do not need GPU acceleration, you need to have tensorflow installed, not tensorflow-gpu, which requires CUDA. Likewise, if you want to utilize GPU acceleration, you need tensorflow-gpu.

In your batch script, before executing your TensorFlow application, you have to load the Python module:

$ module load python

After loading all necessary modules, your script should execute your application. If your Python file is called program.py and lies in a subfolder called subfolder of your home directory, you can execute it like this:

$ cd $HOME/subfolder/
$ python program.py

See the next section for complete, ready-to-use examples of batch scripts.

 

Using one or more GPUs in a batch job

To use a GPU, you need to have the correct (GPU-enabled) version of TensorFlow installed.

In your batch script, before executing your TensorFlow application, you have to load  the modules used by the installation spep, for example the Python module, as well as CUDA and cuDNN:

$ module load python
$ module load cuda/11.0
$ module load cudnn/8.0.5

(This i snot needed by a Singularity containeised version - here you only need to load the container.)

You also need to request a GPU in your batch script. If you want to use one GPU of the cluster, you can use the following:

#SBATCH --gres=gpu:1

Remember that in case you need one GPU only your job must be not exclusive and ask for less that 24 cores.

If you need two GPUs, use instead:

#SBATCH --gres=gpu:2

For other GPU configuration options, see here.

After loading all necessary modules, your script should execute your application. If your Python file is called program.py and lies in a subfolder called subfolder of your home directory, you can execute it like this:

$ cd $HOME/subfolder/
$ python3 program.py

See the next section for complete, ready-to-use examples of batch scripts.

 

4. Example Batch Scripts

4.1. CPU-only Tensorflow Job

#!/usr/local_rwth/bin/zsh

### Job name
#SBATCH --job-name=MyTensorFlowJob

### Output path for stdout and stderr
### %J is the job ID
#SBATCH --output output_%J.txt

### Request the time you need for execution. The full format is D-HH:MM:SS
### You must at least specify minutes OR days and hours and may add or
### leave out any other parameters
#SBATCH --time=80

### Request the amount of memory you need for your job.
### You can specify this in either MB (1024M) or GB (4G).
#SBATCH --mem-per-cpu=2G

### Request the number of parallel threads for your job
#SBATCH --ntasks=24

### if needed: switch to your working directory (where you saved your program)
#cd $HOME/subfolder/

### Load python module
module load python

### If you only want to use a CPU, the CUDA and cuDNN modules are
### unnecessary. Make sure you have 'tensorflow' installed, because
### using 'tensorflow-gpu' without those modules will crash your
### program.

### Execute your application
python3 program.py


4.2. TensorFlow Job for one (or more) GPUs

#!/usr/local_rwth/bin/zsh
 

### Job name
#SBATCH --job-name=MyTensorFlowJob

### Output path for stdout and stderr
### %J is the job ID, %I is the array ID
#SBATCH --output=output_%J.txt

### Request the time you need for execution. The full format is D-HH:MM:SS
### You must at least specify minutes OR days and hours and may add or
### leave out any other parameters
#SBATCH --time=80

### Request the amount of memory you need for your job.
### You can specify this in either MB (1024M) or GB (4G).
#SBATCH --mem-per-cpu=2

### Request a host with a Volta GPU
### If you need two GPUs, change the number accordingly
#SBATCH --gres=gpu:volta:1

### if needed: switch to your working directory (where you saved your program)
#cd $HOME/subfolder/

### Load modules
module load python
module load cuda/11.0
module load cudnn/8.0.5

### Make sure you have 'tensorflow-gpu' installed, because using
### 'tensorflow' will lead to your program not using the requested
### GPU.

### Execute your application
python3 program.py

 

5. Tutorials and Examples

See here for TensorFlow turorials provided by Google.

zuletzt geändert am 29.06.2021

Wie hat Ihnen dieser Inhalt geholfen?

Creative Commons Lizenzvertrag
Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Deutschland Lizenz