Description
Depending on the version of matplotlib, the way it was installed, the way other dependencies were installed and a bunch of other random factors, the results of our image comparison tests can vary slightly.
Enough to fall way out of the tolerance window, but still being perfectly valid results.
So we have a set of multiple valid baseline images for most of our tests, but don't really have a good way to actually test if any of them matches.
Right now, we are using our own fork of pytest-mpl, which does support testing against multiple baselines: https://github.com/OGGM/pytest-mpl
But maintaining that has been a consistent pain, and it's getting harder the more upstream pytest-mpl evolves.
This must be a general issue not only we have. What is the proper way to deal with it?