Jupyter notebooks are very useful to quickly prototype ideas in data science projects because they allow seeing code, the output of that code, and a narrative explaining its logic right next to each other. But when committing said Jupyter notebooks to version control systems like git, code output containing sensitive information (such as passwords, access tokens, or sensitive data) can easily end up being stored in places where it should not be stored and where it is accessible to others.
This post explains how to make sure that sensitive information is never committed to version control by automatically stripping output from Jupyter notebooks using nbstripout before every commit.
The problem
What we are trying to avoid is having sensitive output like in the following example in version control where it can be read by others:
Of course, if you manually strip all sensitive information before EVERY single commit: no problem. But we all know what happens when you just want to finish and commit before having lunch, getting a coffee, or leaving work for the weekend: you forget to do exactly that and disaster strikes.
The solution
Instead, we want to ensure that only the following is committed to version control without ever having to remember:
I recommend ensuring this behavior by:
- Installing pre-commit to manage git pre-commit hooks
- Installing nbstripout as a pre-commit hook
- Running
pre-commit install
to set things up. - Done.
What happens is that nbstripout
is now being executed to strip any output from .ipynb
files before committing.
Since the pre-commit hook works directly in git
the above automatically hooks into version control plugins in your favorite IDE, too.
Caveat
nbstripout
modifies the working copy of your notebook.
The main caveat is that an active development session could be disrupted due to having to re-run cells more frequently to
re-generate their output. I think this is a minor problem because:
a) only cell outputs are stripped while variables and the kernel working memory are not cleared, and b) it’s a good idea to run notebooks frequently top-to-bottom anyway after restarting the kernel - simply to make sure that things work as expected.