We are researchers and engineers in AI (very vague term but oh well ...) who have managed to get access to Jean Zay and think this can be a very useful cluster for your AI research.

At the time of writing (November 2020), the GPU part of Jean Zay is very much underused and we think a user-contributed documentation could help people navigating the access procedure and knowing a few necessary tips and tricks to be productive on such a cluster.

This is supposed to be a collaborative doc, if you spot errors or things that could be improved, open an issue or even better a Pull Request (PR)!

We use gitter for chat, don't hesitate to get involved there and ask questions!


In the medium term, more material could be added to discuss tips and tricks, limitations, work-arounds, etc ... on Jean Zay. In particular, feel free to share tutorials, tools and scripts to help users have a more productive use of the Jean Zay cluster, e.g.:

  • how to make your code use checkpointing to be able to get long running processing despite the 20 hour wall time limit;
  • how to make sure your code can leverage the hardware optimally (e.g. with mixed precision and tensorcores);
  • how to make sure that your processing is not limited by suboptimal data access patterns on the disks or inefficient pre-processing on the CPUs;
  • how to do efficient hyper-parameter tuning at scale;
  • how to synchronize you code between local computer and the cluster.

Generic advice

  • There are big differences in the way of working between traditional HPC (High Performance Computing) users and AI users. For example, most traditional "serious" HPC clusters do not have access to the internet, yes you have read this correctly, people in traditional HPC do not need internet access to work on their problems.
  • So far every interaction we have had with Jean Zay user support has been very positive. Even if there may be some frustration (on both sides), try to be both pedagogical and constructive when you send an email to