GaNDLF

This page contains answers to frequently asked questions about GaNDLF.

Where do I start?

The usage guide provides a good starting point for you to understand the application of GaNDLF. If you have any questions, please feel free to post a support request, and we will do our best to address it ASAP.

Why do I get the error `importlib.metadata.PackageNotFoundError: GANDLF`?

This means that GaNDLF was not installed correctly. Please ensure you have followed the installation guide properly.

Why is GaNDLF not working?

Verify that the installation has been done correctly by running gandlf verify-install after activating the correct virtual environment. If you are still having issues, please feel free to post a support request, and we will do our best to address it ASAP.

Which parts of a GaNDLF configuration are customizable?

Virtually all of it! For more details, please see the usage guide and our extensive samples. All available options are documented in the config_all_options.yaml file.

Can I run GaNDLF on a high performance computing (HPC) cluster?

Yes, GaNDLF has successfully been run on an SGE cluster and another managed using Kubernetes. Please post a question with more details such as the type of scheduler, and so on, and we will do our best to address it.

How can I track the per-epoch training performance?

Yes, look for logs_*.csv files in the output directory. It should be arranged in accordance with the cross-validation configuration. Furthermore, it should contain separate files for each data cohort, i.e., training/validation/testing, along with the values for all requested performance metrics, which are defined per problem type.

Why are my compute jobs failing with excess RAM usage?

If you have data_preprocessing enabled, GaNDLF will load all of the resized images as tensors into memory. Depending on your dataset (resolution, size, number of modalities), this can lead to high RAM usage. To avoid this, you can enable the memory saver mode by enabling the flag memory_save_mode in the configuration. This will write the resized images into disk.

How can I resume training from a previous checkpoint?

GaNDLF allows you to resume training from a previous checkpoint in 2 ways:

By using the --resume CLI parameter in gandlf run, only the model weights and state dictionary will be preserved, but parameters and data are taken from the new options in the CLI. This is helpful when you are updated the training data or some compatible options in the parameters.
If both --resume and --reset are skipped in gandlf run, the model weights, state dictionary, and all previously saved information (parameters, training/validation/testing data) is used to resume training.

How can I update GaNDLF?

If you have installed from pip, then you can simply run pip install --upgrade gandlf to get the latest version of GaNDLF, or if you are interested in the nightly builds, then you can run pip install --upgrade --pre gandlf.
If you have performed installation from sources, then you will need to do git pull from the base GaNDLF directory to get the latest master of GaNDLF. Follow this up with pip install -e . after activating the appropriate virtual environment to ensure the updates get passed through.

How can I perform federated learning of my GaNDLF model?

Please see https://mlcommons.github.io/GaNDLF/usage/#federating-your-model-using-openfl.

How can I perform federated evaluation of my GaNDLF model?

Please see https://mlcommons.github.io/GaNDLF/usage/#federating-your-model-evaluation-using-medperf.

I was using GaNDLF version `0.0.19` or earlier, and I am facing issues after updating to `0.0.20` or later. What should I do?

Please read the migration guide to understand the changes that have been made to GaNDLF. If you have any questions, please feel free to post a support request.

This is a safety feature to ensure a tight integration between the configuration used to define a model and the code version used to perform the training. Ensure that you have all requirements satisfied, and then check the version key in the configuration, and ensure it appropriately matches the output of gandlf run --version.

How to interpret seeing the same numbers for all classification metrics for under `global_*`?

The classification metrics are based on TorchMetrics [ref], and this is an issue that is documented on their side [ref]. Please use either per_class_weighted or per_class_average metrics for final evaluation.

What if I have another question?

Please post a support request.

This site is open source. Improve this page.