Skip to content

ENH: devcontainer & docker: use cases and improvements #54862

Open
@DavidToneian

Description

@DavidToneian

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Hi,

as a user of Visual Studio Code, I was happy to see that devcontainer, i.e., a way to develop and run pandas with all its dependencies in an isolated docker container with vscode integration, was supported by pandas.

My assumption is that the devcontainer configuration should allow contributors, new and experienced, to automate tedious tasks away (install dependencies, build pandas, configure vscode), and accelerate pandas development.

However, I felt that the current implementation had some shortcomings:

  1. It did not allow me, as a new contributor to pandas, to jump right into pandas development, as it did not build pandas automatically (and check for preconditions, such as git tags being present in the fork), and furthermore suffered from filesystem permission issues, as it was acting as root within the container.
  2. I wasn't sure how close the environment I would be developing in would be to the one used in GitHub Actions CI workflows, possibly causing troubleshooting issues when a commit works in the devcontainer but not on GH or vice versa.
  3. It didn't fully configure vscode to be compatible with pandas' style guidelines (e.g., 88 character line limit) and toolchain (e.g., pre-commit hooks)
  4. Many of the vscode configurations were deprecated and to be replaced.

Feature Description / Request for Comments

I opened PR #54845 to address some of these points, though I don't claim to have solved them completely.

@mroeschke made reference to the history of pandas' Dockerfile, e.g., #49981, where docke image size and build times were of concern.

The image size of PR #54845 stands at roughly 1.2 GB, and build time at a few minutes for the docker image itself, plus on the order of ten minutes to have it install the python mamba environment and build pandas when first launched as a vscode devcontainer.

To better understand how to improve devcontainer support for pandas contributors, I'd like to ask the following:

  1. What is the Dockerfile use case today, aside from being part of the devcontainer setup? Which workflows is the docker image a part of, what are their constraints?
  2. Is there an official docker image built and published by pandas? If so, when are new builds triggered?

Alternative Solutions

Depending on the discussion, one could have, e.g.,

  1. Separate Dockerfiles for the vscode devcontainer use case, and for the other use cases
  2. A common Dockerfile containing largely-static OS-level setup (e.g., C++ build dependencies), while the rest that evolves more quickly in the pandas repo is left for devcontainer-specific files and/or the docker image user
  3. A Dockerfile that contains a pre-built pandas distribution sourced from the main branch -- in this case I think one would want to automatically build and publish docker images with each merge, so that users (i.e., contributors) can fetch a ready-made image and start developing.

What are your thoughts on this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions