Skip to content

Weights&Biases - Hydra

Weights&Biases and Hydra are 2 tools used in Machine Learning Projects. Weights&Biases allows you to easily save a lot of information about your different experiments in the cloud, like meta data, system data, model weights and of course your different metrics and logs. Hydra is a configuration management tool that allows you to build command line interfaces and create robust and readable configuration files. These 2 tools can be used together very elegantly and easily, but their setup on Jean Zay is not straightforward. In this example, we will show you how to setup both tools on Jean Zay in a TensorFlow example.


To run this example, you need to clone the jean-zay repo in your $WORK dir:

cd $WORK &&\
git clone

You can then install the requirements:

module purge
module load tensorflow-gpu/py3/2.6.0
pip install --user -r $WORK/jean-zay-doc/docs/examples/tf/tf_wandb_hydra/requirements.txt


In order to run the example on SLURM you can just issue the following command from the example directory:

python --multirun hydra/launcher=base +hours=1

SLURM parametrization

Different parameters can be set for the SLURM job, using the hydra.launcher config group. For example to launch a longer job, you can use:

python --multirun hydra/launcher=base +hours=10 hydra.launcher.qos='qos_gpu-t3'

If you want to use more gpus:

python --multirun hydra/launcher=base +hours=10 hydra.launcher.qos='qos_gpu-t3' hydra.launcher.gpus_per_node=4


This will require you to create a Weights&Biases account. wandb is run offline because the compute nodes are not connected to the internet. In order to have the results uploaded to the cloud, you need to manually sync them using the wandb sync run_dir command. The run directories are located in $SCRATCH/wandb/jean-zay-doc, but this can be changed using the wandb.dir config variable. You can also run a script to sync the runs before they are finished on a front node, for example using the script here.

Hydra and submitit outputs

The outputs created by Hydra and submitit are located in the multirun directory. You can change this value by setting the hydra.dir config variable.

Batch jobs

In order to batch multiple similar jobs you can use the sweep feature of Hydra. For example, if you want to run multiple training with different batch sizes, you can do the following:

python --multirun hydra/launcher=base +hours=1 fit.batch_size=32,64,128

This can be extended to the grid search of a Cartesian product for example:

python --multirun hydra/launcher=base +hours=1 fit.batch_size=32,64,128 compile.optimizer=rmsprop,adam

Similar resources



To Weights&Biases: - MLFlow - Tensorboard

To Hydra: - argparse - click