Skip to content

pandas-gbq auth proposal #161

Closed
Closed
@tswast

Description

@tswast

Overview

The current auth flows for pandas-gbq are a bit confusing and hard to customize.

Final desired state. The pandas_gbq module should have the following (changes in bold):

  • read_gbq(query, project_id [optional], index_col=None, col_order=None, reauth, verbose [deprecated], private_key [deprecated], auth_local_webserver, dialect='legacy', configuration [optional], credentials [new param, optional])
  • to_gbq(dataframe, destination_table, project_id [optional], chunksize=None, verbose [deprecated], reauth, if_exists='fail', private_key [deprecated], auth_local_webserver, table_schema=None, credentials [new param, optional])
  • CredentialsCache (and WriteOnlyCredentialsCache, NoopCredentialsCache) - new class (and subclasses) for configuring user credentials caching behavior
  • context - global singleton with "client" property for caching default client in-memory.
  • get_user_credentials(scopes=None, credentials_cache=None, client_secrets=None, use_localhost_webserver=False) - Helper function to get user authentication credentials.

Tasks:

  • Add authentication documentation with examples.
  • Add optional credentials parameter to read_gbq, taking a google.cloud.bigquery.Client object.
  • Add optional credentials parameter to to_gbq, taking a google.cloud.bigquery.Client object.
  • Add pandas_gbq.get_user_credentials() helper for fetching user credentials with installed-app OAuth2 flow.
  • Add pandas_gbq.CredentialsCache and related subclasses for managing user credentials cache.
  • Add pandas_gbq.context global for caching a default Client in-memory. Add examples for manually setting pandas_gbq.context.client (so that default project and other values like location can be set).
  • Update minimum google-cloud-bigquery version to 0.32.0 so that the project ID in the client can be overridden when creating query & load jobs. (Done in ENH: Add location parameter to read_gbq and to_gbq #185)
  • Deprecate private_key argument. Show examples of how to do the same thing by passing Credentials to the Client constructor.
  • Deprecate PANDAS_GBQ_CREDENTIALS_FILE environment variable. Show example using pandas_gbq.get_user_credentials with credentials_cache argument.
    * [ ] Deprecate reauth argument. Show examples using pandas_gbq.get_user_credentials with credentials_cache argument and WriteOnlyCredentialsCache or NoopCredentialsCache. Edit: No reason to deprecate reauth, since we don't need to complicate pandas-gbq's auth with pydata-google-auth's implementation details.
    * [ ] Deprecate auth_local_webserver argument. Show example using pandas_gbq.get_user_credentials with auth_local_webserver argument. Edit: No reason to deprecate auth_local_webserver, as that feature is still needed. We don't actually want to force people to use pydata-google-auth for the default credentials case.

Background

pandas-gbq has its own auth flows, which include but are distinct from "application default credentials".

See issue: #129

Current (0.4.0) state of pandas-gbq auth:

  1. Use service account key file passed in as private_key parameter. Parameter can be either as JSON bytes or a file path.
  2. Use application default credentials.
    1. Use service account key at GOOGLE_APPLICATION_CREDENTIALS environment variable.
    2. Use service account associated with Compute Engine, Kubernetes Engine, App Engine, or Cloud Functions.
  3. Use user authentication.
    1. Attempt to load user credentials from cache stored at ~/.config/pandas_gbq/bigquery_credentials.dat or in path specified by PANDAS_GBQ_CREDENTIALS_FILE environment variable.
    2. Do 3-legged OAuth flow.
    3. Cache the user credentials to disk.

Why does pandas-gbq do user auth at all? Aren't application default credentials enough?

  • It's difficult in some environments to set the right environment variables, so a way to explicitly provide credentials is desired.
  • BigQuery does resource-based billing, so it is possible to use user-based authentication.
    • User-based authentication eliminates the unnecessary step of creating a service account.
    • A user with the BigQuery User IAM role wouldn't be allowed to create a service account.
    • Often datasets are shared with a specific user. Querying with user account credentials will allow them to access those shared datasets / tables.
    • User-based authentication is more intuitive in shared notebook environments like Colab, where the compute credentials might be associated with a service account in a shadow project or not available at all.

Problems with the current flow

  • The credentials order isn't always ideal.
  • It's not possible to specify user credentials in environments where application default credentials are available.
  • If someone is familiar with the google-auth library, the current auth mechanisms do not allow passing in an arbitrary Credentials object.
  • It is verbose and error-prone to pass in explicit service account credentials every time. See Set project_id (and other settings) once for all subsequent queries so you don't have to pass every time #103 for a feature request for more configurable defaults.
    • Error-prone? More than once have I and the other pandas-gbq contributors forgot to add a private_key argument to a call in a test, resulting in surprising failures in CI builds.
  • It's not possible to override the scopes for the credentials. For example, it is useful to add Drive / Sheets scopes for querying external data sources.

Proposal

Document default auth behavior

Current behavior (not changing, except for deprecations).

  1. Use client if passed in.
  2. Deprecated. Use private_key to create a Client if passed in. Use google-auth and credentials argument instead.
  3. Attempt to create client using application default credentials. Intersphinx link to google.auth.default
  4. Attempt to construct client using user credentials (project_id parameter must be passed in). Link to pandas_gbq.get_user_credentials().

New default auth behavior.

  • 1b. If client not passed in, attempt to use global client at pandas_gbq.context (similar to google.cloud.bigquery.magics.context). If there is no client in the global context: run steps 2-4 and set the client it creates to the global context.

Add client parameter to read_gbq and to_gbq

The new client parameter, if provided, would bypass all other credentials fetching mechanisms.

Why a Client and not an explicit Credentials object?

  • A Client contains a default project (See feature request for default projects at Set project_id (and other settings) once for all subsequent queries so you don't have to pass every time #103) and will eventually handle other defaults, such as location, encryption configuration, and maximum bytes billed.
  • A Client object supports more BigQuery operations than will ever be exposed by pandas-gbq (creating datasets, modifying ACLs, other property updates). Passing this in as a parameter could hint to developers that they can use the Client directly for those things.
  • It is more clear that BigQuery magic command is provided by google-cloud-bigquery not pandas-gbq.

Helpers for user-based authentication

No helpers are needed for default credentials or service account credentials because these can easily be constructed with the google-auth library. Link to samples for constructing these from the docs.

pandas_gbq.get_user_credentials(scopes=None, credentials_cache=None, client_secrets=None, use_localhost_webserver=False):

If credentials_cache is None, construct a pandas_gbq.CredentialsCache with defaults for arguments.

Attempt to load credentials from cache.

If credentials can't be loaded, start 3-legged oauth2 flow for installed applications. Use provided client secrets if given, otherwise use Pandas-GBQ client secrets. Use command-line flow by default. Use localhost webserver if set to True.

No credentials could be fetched? Raise an AccessDenied error. (Existing behavior of GbqConnector.get_user_account_credentials())

Save credentials to cache.

Return credentials.

pandas_gbq.CredentialsCache

Constructor takes optional credentials_path.

If credentials_path not provided, set self._credentials_path to

  • PANDAS_GBQ_CREDENTIALS_FILE - show deprecation warning that this environment variable will be ignored at a later date.
  • Default user credentials path at ~/.config/pandas_gbq/bigquery_credentials.dat

Methods

  • load() - load credentials from self._credentials_path, refresh them, and return them. Otherwise, return None if credentials not found.
  • save(credentials) - write credentials as JSON to self._credentials_path.
pandas_gbq.WriteOnlyCredentialsCache

Same as CredentialsCache, but load() is a no-op. Equivalent to "force reauth" in current versions.

pandas_gbq.NoopCredentialsCache

Satisfies the credentials cache interface, but does nothing. Useful for shared systems where you want credentials to stay in memory (e.g. Colab).

Deprecations

Some time should be given (1-year deprecation?) for folks to migrate to the new client argument. It might be used in scripts and older notebooks, and also is a parameter upstream in Pandas.

Deprecate the PANDAS_GBQ_CREDENTIALS_FILE environment variable

Log a deprecation warning suggesting pandas_gbq.get_user_credentials with a pandas_gbq.CredentialsCache argument.

Deprecate private_key argument

Log a deprecation warning suggesting google.oauth2.service_account.Credentials.from_service_account_info instead of passing in bytes and google.oauth2.service_account.Credentials.from_service_account_file instead of passing in a path.

Add / link to service account examples in the docs.

Deprecate reauth argument

Log a deprecation warning suggesting creating a client using credentials from pandas_gbq.get_user_credentials and a pandas_gbq.WriteOnlyCredentialsCache

Add user authentication examples in the docs.

Deprecate auth_local_webserver argument

Log a deprecation warning suggesting creating a client using credentials from pandas_gbq.get_user_credentials and set the auth_local_webserver argument there.

Add user authentication examples in the docs.

/cc @craigcitro @maxim-lian

Metadata

Metadata

Assignees

Labels

type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions