Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
Hi,
as a user of Visual Studio Code, I was happy to see that devcontainer, i.e., a way to develop and run pandas with all its dependencies in an isolated docker container with vscode integration, was supported by pandas.
My assumption is that the devcontainer configuration should allow contributors, new and experienced, to automate tedious tasks away (install dependencies, build pandas, configure vscode), and accelerate pandas development.
However, I felt that the current implementation had some shortcomings:
- It did not allow me, as a new contributor to pandas, to jump right into pandas development, as it did not build pandas automatically (and check for preconditions, such as git tags being present in the fork), and furthermore suffered from filesystem permission issues, as it was acting as
root
within the container. - I wasn't sure how close the environment I would be developing in would be to the one used in GitHub Actions CI workflows, possibly causing troubleshooting issues when a commit works in the devcontainer but not on GH or vice versa.
- It didn't fully configure vscode to be compatible with pandas' style guidelines (e.g., 88 character line limit) and toolchain (e.g., pre-commit hooks)
- Many of the vscode configurations were deprecated and to be replaced.
Feature Description / Request for Comments
I opened PR #54845 to address some of these points, though I don't claim to have solved them completely.
@mroeschke made reference to the history of pandas' Dockerfile, e.g., #49981, where docke image size and build times were of concern.
The image size of PR #54845 stands at roughly 1.2 GB, and build time at a few minutes for the docker image itself, plus on the order of ten minutes to have it install the python mamba
environment and build pandas when first launched as a vscode devcontainer.
To better understand how to improve devcontainer support for pandas contributors, I'd like to ask the following:
- What is the
Dockerfile
use case today, aside from being part of the devcontainer setup? Which workflows is the docker image a part of, what are their constraints? - Is there an official docker image built and published by pandas? If so, when are new builds triggered?
Alternative Solutions
Depending on the discussion, one could have, e.g.,
- Separate
Dockerfile
s for the vscode devcontainer use case, and for the other use cases - A common
Dockerfile
containing largely-static OS-level setup (e.g., C++ build dependencies), while the rest that evolves more quickly in the pandas repo is left for devcontainer-specific files and/or the docker image user - A
Dockerfile
that contains a pre-built pandas distribution sourced from themain
branch -- in this case I think one would want to automatically build and publish docker images with each merge, so that users (i.e., contributors) can fetch a ready-made image and start developing.
What are your thoughts on this?