Using the GPU cards

Using Machine Learning and Deep Learning frameworks

Machine learning frameworks are available on Osirim as containers.

In order to allow the use of several versions of Cuda/CuDNN/Python and avoid dependencies and conflicts between machine learning libraries, each framework runs in a dedicated container. Although Docker is the most common containerization solution, we use Singularity, a containerization solution adapted to HPC clusters. https://www.sylabs.io/docs/

Singularity (.SIF) images of frameworks are built on Ubuntu 16.04 and 18.04 images in which the following components are installed :
  * CUDA 9.0, 9.2 and 10
  * cuDNN 7.1 and 7.1.4
  * gcc 5.4 and cmake 3.5,
  * Miniconda for python 2 and 3  with a set of modules
  * OpenCV (for CPU and GPU)

SINGULARITY (.SIF) ready-to-use images are available on /logiciels/containerCollections/

Available Frameworks:

Cintainer OS CUDA CuDNN Tensorflow Keras Theano Pytorch
CUDA 9              
keras-tf.sif Ubuntu 16.04 9.2 7.1 1.6 2.1.4    
keras-th.sif Ubuntu 16.04 9.2 7.1   2.2.4 1.0.3  
pytorch.sif Ubuntu 16.04 9.0 7.1       0.4.1
tf.sif Ubuntu 16.04 9.2 7.1 1.6      
th.sif Ubuntu 16.04 9.2 7.1     1.0.3  
vanilla_9.0.sif Ubuntu 16.04 9.0 7.1        
vanilla_9.2.sif Ubuntu 16.04 9.2 7.1        
CUDA 10              
keras-tf.sif Ubuntu 18.04 10.0 7.1.4 1.12.0 2.2.4    
pytorch.sif Ubuntu 18.04 10.0 7.1.2       0.4.2
pytorch_1.0.1.sif Ubuntu 18.04 10.0 7.3.1       1.0.1
tf.sif Ubuntu 18.04 10.0 7.1.4 1.12.0      
vanilla_10.0.sif Ubuntu 18.04 10.0 7.1.4        

To learn more about the software installed and available in each container, consult the README file in the CUDA9 and CUDA10 directories

Running a Singularity container is carried out via the command 'Singularity exec' followed by the image of the container and the processing to be performed in the container.:

singularity exec /logiciels/containerCollections/tf.sif $HOME/mon_code.sh

Note that user environment variables are available in containers as well as directories /users /projets and /logiciels
 

Using frameworks in a Batch Slurm :

Available frameworks have the ability to run on CPUs or GPUs. So you can run the containers with Slurm on any OSIRIM partition: ‘24CPUNodes’, ‘64CPUNodes’ or ‘GPUNodes’.

In the examples below, we want to take advantage of GPUs, and run the processing on the ‘GPUNodes’ partition. To tell Slurm that we want to use GPUs, 2 parameters must be mentioned:

#SBATCH --gres=gpu:1  (the number of cards you want to use, 4 max per server)
#SBATCH --gres-flags=enforce-binding

 

Examples of using Tensorflow and Keras frameworks:

Content of slurm_job_tf.sh for Tensorflow execution :

#!/bin/sh

#SBATCH --job-name=GPU-Tensorflow-Singularity-Test
#SBATCH --output=ML-%j-Tensorflow.out
#SBATCH --error=ML-%j-Tensorflow.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --partition=GPUNodes
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
 
srun singularity exec /logiciels/containerCollections/CUDA10/tf.sif python2 "$HOME/tf-script.py"

Execution : [bob@co2-slurm-client ~]$  sbatch slurm_job_tf.sh

Content of slurm_job_ke.sh for the execution of Keras on Theano :

#!/bin/sh

#SBATCH --job-name=GPU-keras-Singularity-Test
#SBATCH --output=ML-%j-keras.out
#SBATCH --error=ML-%j-keras.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --partition=GPUNodes
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
 
srun singularity exec /logiciels/containerCollections/CUDA9/keras-th.sif python3 "$HOME/tf-script.py"

Execution : [bob@co2-slurm-client ~]$  sbatch slurm_job_ke.sh

 

Installation of additional packages :

Singularity containers encapsulate specific Deep Learning and Machine Learning frameworks under both versions of python 3 and 2. But you may want to use libraries not available by default in the containers made available to you.

To install additional packages, you can use virtualenv, pip or conda.

  • For Python 2 use the following software : conda2, pip2 or virtualenv2
  • For Python 3 use the following software : conda3, pip3 or virtualenv3

Below is the procedure to follow :

First, you must create a virtual environment from your $HOME directory, by opening a shell in the container:

$ singularity shell /logiciels/containerCollections/CUDA9/tf.sif

(tf.sif) → $ virtualenv2 --system-site-packages $HOME/ENVNAME
## or for CONDA,
(tf.sif) → $ conda2 create -n ENVNAME python=2.7 miniconda

The --system-site-packages parameter allows you to use in virtualenv all the packages already installed with the python used to install the vrtualenv (Python 2 in the example).

Then, once the virtual environment is created, you can install the desired packages (example with the Reinforcement Learning Package Shogun package):

$ singularity shell /logiciels/containerCollections/CUDA9/tf.sif
(tf.sif) → $ conda2 activate  ENVNAME
(tf.sif) → $ conda2 install -c conda-forge shogun

Finally, you can use the new packages in your processing from a SLURM job:

Execution : [bob@co2-slurm-client ~]$  sbatch slurm_job_shogun.sh

Content of slurm_job_shogun.sh :

#!/bin/sh
 
#SBATCH --job-name=GPU-shogun-Singularity-Test
#SBATCH --output=ML-%j-shogun.out
#SBATCH --error=ML-%j-shogun.err
 
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --partition=GPUNodes
#SBATCH --gres=gpu:4
#SBATCH --gres-flags=enforce-binding
 
srun singularity exec /logiciels/containerCollections/CUDA9/tf.sif $HOME/ENVNAME/bin/python "$HOME/shogun-script.py"
 
#OR, with conda2
 
srun singularity exec /logiciels/containerCollections/CUDA9/tf.sif $HOME/conda/envs/ENVNAME/bin/python "$HOME/shogun-script.py"