Git is Not Enough
Version control and multi-user collaboration are problems largely solved by git for classic codebases. Unfortunately, git alone is not enough to handle the lifecycle of a modern ML (research) project, where many different problems arise:
Data versioning: can you recover the pre-processed data a model has been trained with? What if the data is a work in progress?
Hyperparameters comparison: can you reliably say which hyperparameters are the best?
Model comparison: can you identify which approach/model is the best?
Sweeps: can you easily search for the best hyperparameters and models?
Code organization and reproducibility: how steep is the codebase learning curve?
You have to tackle all the previous problems simultaneously: a jumble for each new project.
Luckily many great tools have been developed to solve or alleviate these obstacles. Examples are PyTorch Lightning to organize your code, DVC for data versioning, Weights & Biases to compare and analyze your experiments, Hydra for configurations and sweeps, Streamlit to interact and showcase your system.
These tools must work together in each project: a non-project-specific scaffolding that can and should be abstracted.
nn-template is exactly this: a generic template to bootstrap your project, enforcing code best practices.
It provides boilerplate code for:
- PyTorch Lightning, lightweight PyTorch wrapper for high-performance AI research.
- Hydra, a framework for elegantly configuring complex applications.
- DVC, track large files, directories, or ML models. Think “Git for data”.
- Weights and Biases, organize and analyze machine learning experiments. (educational account available)
- Streamlit, turns data scripts into shareable web apps in minutes.
You can click here to start a project with this template.