Skip to content

Weights&Biases - Hydra

Weights&Biases and Hydra are 2 tools used in Machine Learning Projects. Weights&Biases allows you to easily save a lot of information about your different experiments in the cloud, like meta data, system data, model weights and of course your different metrics and logs. Hydra is a configuration management tool that allows you to build command line interfaces and create robust and readable configuration files. These 2 tools can be used together very elegantly and easily, but their setup on Jean Zay is not straightforward. In this example, we will show you how to setup both tools on Jean Zay in a TensorFlow example.

Installation

To run this example, you need to clone the jean-zay repo in your $WORK dir:

cd $WORK &&\
git clone https://github.com/jean-zay-users/jean-zay-doc.git

You can then install the requirements:

module purge
module load tensorflow-gpu/py3/2.6.0
pip install --user -r $WORK/jean-zay-doc/docs/examples/tf/tf_wandb_hydra/requirements.txt

Run

In order to run the example on SLURM you can just issue the following command from the example directory:

python train_mnist.py --multirun hydra/launcher=base +hours=1

SLURM parametrization

Different parameters can be set for the SLURM job, using the hydra.launcher config group. For example to launch a longer job, you can use:

python train_mnist.py --multirun hydra/launcher=base +hours=10 hydra.launcher.qos='qos_gpu-t3'

If you want to use more gpus:

python train_mnist.py --multirun hydra/launcher=base +hours=10 hydra.launcher.qos='qos_gpu-t3' hydra.launcher.gpus_per_node=4

Weights&Biases

This will require you to create a Weights&Biases account. wandb is run offline because the compute nodes are not connected to the internet. In order to have the results uploaded to the cloud, you need to manually sync them using the wandb sync run_dir command. The run directories are located in $SCRATCH/wandb/jean-zay-doc, but this can be changed using the wandb.dir config variable. You can also run a script to sync the runs before they are finished on a front node, for example using the script here.

Hydra and submitit outputs

The outputs created by Hydra and submitit are located in the multirun directory. You can change this value by setting the hydra.dir config variable.

Batch jobs

In order to batch multiple similar jobs you can use the sweep feature of Hydra. For example, if you want to run multiple training with different batch sizes, you can do the following:

python train_mnist.py --multirun hydra/launcher=base +hours=1 fit.batch_size=32,64,128

This can be extended to the grid search of a Cartesian product for example:

python train_mnist.py --multirun hydra/launcher=base +hours=1 fit.batch_size=32,64,128 compile.optimizer=rmsprop,adam

Similar resources

References

Alternatives

To Weights&Biases: - MLFlow - Tensorboard

To Hydra: - argparse - click