pandas-gbq auth proposal

## Overview

The current auth flows for pandas-gbq are a bit confusing and hard to customize. 

**Final desired state**. The `pandas_gbq` module should have the following (changes in **bold**):

*   read_gbq(query, project_id [**optional**], index_col=None, col_order=None, reauth, verbose [deprecated], private_key [**deprecated**], auth_local_webserver, dialect='legacy', configuration [optional], credentials [**new param, optional**])
*   to_gbq(dataframe, destination_table, project_id [**optional**], chunksize=None, verbose [deprecated], reauth, if_exists='fail', private_key [**deprecated**], auth_local_webserver, table_schema=None, credentials [**new param, optional**])
*   **CredentialsCache** (and **WriteOnlyCredentialsCache**, **NoopCredentialsCache**) - new class (and subclasses) for configuring user credentials caching behavior
*   **context** - global singleton with "client" property for caching default client in-memory.
*   **get_user_credentials**(**scopes**=None, **credentials_cache**=None, **client_secrets**=None, **use_localhost_webserver**=False) - Helper function to get user authentication credentials.

**Tasks**:

*   [x] Add authentication documentation with examples.
*   [x] Add optional credentials parameter to `read_gbq`, taking a google.cloud.bigquery.Client object.
*   [x] Add optional credentials parameter to `to_gbq`, taking a google.cloud.bigquery.Client object.
*   [x] Add pandas_gbq.get_user_credentials() helper for fetching user credentials with installed-app OAuth2 flow.
*   [x] Add pandas_gbq.CredentialsCache and related subclasses for managing user credentials cache.
*   [x] Add pandas_gbq.context global for caching a default Client in-memory. Add examples for manually setting pandas_gbq.context.client (so that default project and other values like location can be set).
*   [x] Update minimum google-cloud-bigquery version to 0.32.0 so that the project ID in the client can be overridden when creating query & load jobs. (Done in https://github.com/pydata/pandas-gbq/pull/185)
*   [x] Deprecate `private_key` argument. Show examples of how to do the same thing by passing  [Credentials](https://google-auth.readthedocs.io/en/latest/reference/google.auth.credentials.html#google.auth.credentials.Credentials)  to the Client constructor.
*   [x] Deprecate PANDAS_GBQ_CREDENTIALS_FILE environment variable. Show example using `pandas_gbq.get_user_credentials` with `credentials_cache` argument.
~*   [ ] Deprecate `reauth` argument. Show examples using `pandas_gbq.get_user_credentials` with `credentials_cache` argument and  WriteOnlyCredentialsCache or NoopCredentialsCache.~ **Edit**: No reason to deprecate reauth, since we don't need to complicate pandas-gbq's auth with pydata-google-auth's implementation details. 
~*   [ ] Deprecate `auth_local_webserver` argument. Show example using `pandas_gbq.get_user_credentials` with `auth_local_webserver` argument.~ **Edit**: No reason to deprecate auth_local_webserver, as that feature is still needed. We don't actually want to force people to use pydata-google-auth for the default credentials case.


## Background

pandas-gbq has its own auth flows, which include but are distinct from "application default credentials".

See issue: https://github.com/pydata/pandas-gbq/issues/129

Current (0.4.0) [state of pandas-gbq auth](https://github.com/pydata/pandas-gbq/blob/9ced97b19e24ff503330ed108547dd4bf9c422c9/pandas_gbq/gbq.py#L187-L195):



1.  Use service account key file passed in as `private_key` parameter. Parameter can be either as JSON bytes or a file path.
1.  Use [application default credentials](https://cloud.google.com/docs/authentication/production#providing_credentials_to_your_application).
    1.  Use service account key at GOOGLE_APPLICATION_CREDENTIALS environment variable.
    1.  Use service account associated with Compute Engine, Kubernetes Engine, App Engine, or Cloud Functions.
1.  Use user authentication.
    1.  Attempt to load user credentials from cache stored at `~/.config/pandas_gbq/bigquery_credentials.dat` or in path specified by `PANDAS_GBQ_CREDENTIALS_FILE` environment variable.
    1.  Do 3-legged OAuth flow.
    1.  Cache the user credentials to disk.

Why does pandas-gbq do user auth at all? Aren't application default credentials enough?



*   It's difficult in some environments to set the right environment variables, so a way to explicitly provide credentials is desired.
*   BigQuery does resource-based billing, so it is possible to use user-based authentication.
    *   User-based authentication eliminates the unnecessary step of creating a service account.
    *   A user with the BigQuery User IAM role wouldn't be allowed to create a service account.
    *   Often datasets are shared with a specific user. Querying with user account credentials will allow them to access those shared datasets / tables.
    *   User-based authentication is more intuitive in shared notebook environments like Colab, where the compute credentials might be associated with a service account in a shadow project or not available at all.


### Problems with the current flow



*   The credentials order [isn't always ideal](https://github.com/pydata/pandas-gbq/issues/129#issuecomment-368515825).
*   It's not possible to specify user credentials in environments where application default credentials are available.
*   If someone is familiar with the google-auth library, the current auth mechanisms do not allow passing in an arbitrary [Credentials](https://google-auth.readthedocs.io/en/latest/reference/google.auth.credentials.html#google.auth.credentials.Credentials) object.
*   It is verbose and error-prone to pass in explicit service account credentials every time. See https://github.com/pydata/pandas-gbq/issues/103 for a feature request for more configurable defaults.
    *   Error-prone? More than once have I and the other pandas-gbq contributors forgot to add a `private_key` argument to a call in a test, resulting in surprising failures in CI builds.
*   It's not possible to override the scopes for the credentials. For example, it is useful to add Drive / Sheets scopes for querying external data sources.


## Proposal


### Document default auth behavior

Current behavior (not changing, except for deprecations).



1.  Use client if passed in.
1.  Deprecated. Use private_key to create a Client if passed in. Use google-auth and credentials argument instead.
1.  Attempt to create client using application default credentials. Intersphinx link to google.auth.default
1.  Attempt to construct client using user credentials (project_id parameter must be passed in). Link to pandas_gbq.get_user_credentials().

New default auth behavior.



*   1b. If client not passed in, attempt to use global client at pandas_gbq.context (similar to [google.cloud.bigquery.magics.context](https://github.com/GoogleCloudPlatform/google-cloud-python/blob/f3d9a90f476723864ad6dc18e5b3dcfc8c865345/bigquery/google/cloud/bigquery/magics.py#L188)). If there is no client in the global context: run steps 2-4 and set the client it creates to the global context.


### Add `client` parameter to `read_gbq` and `to_gbq`

The new client parameter, if provided, would bypass all other credentials fetching mechanisms.

Why a Client and not an explicit [Credentials](https://google-auth.readthedocs.io/en/latest/reference/google.auth.credentials.html#google.auth.credentials.Credentials) object?



*   A Client contains a default project (See feature request for default projects at https://github.com/pydata/pandas-gbq/issues/103) and will eventually handle other defaults, such as [location](https://github.com/GoogleCloudPlatform/google-cloud-python/issues/5148), encryption configuration, and maximum bytes billed.
*   A Client object supports more BigQuery operations than will ever be exposed by pandas-gbq (creating datasets, modifying ACLs, other property updates). Passing this in as a parameter could hint to developers that they can use the Client directly for those things.
*   It is more clear that BigQuery magic command is provided by google-cloud-bigquery not pandas-gbq.


### Helpers for user-based authentication

No helpers are needed for default credentials or service account credentials because these can easily be constructed with the google-auth library. Link to samples for constructing these from the docs.


#### pandas_gbq.get_user_credentials(scopes=None, credentials_cache=None, client_secrets=None, use_localhost_webserver=False):

If credentials_cache is None, construct a pandas_gbq.CredentialsCache with defaults for arguments.

Attempt to load credentials from cache.

If credentials can't be loaded, start 3-legged oauth2 flow for installed applications. Use provided client secrets if given, otherwise use Pandas-GBQ client secrets. Use command-line flow by default. Use localhost webserver if set to True.

No credentials could be fetched? Raise an AccessDenied error. (Existing behavior of GbqConnector.get_user_account_credentials())

Save credentials to cache.

Return credentials.


#### pandas_gbq.CredentialsCache

Constructor takes optional <span style="color:#000000;">credentials_path.</span>

If credentials_path not provided, set self._credentials_path to



*   `PANDAS_GBQ_CREDENTIALS_FILE - show deprecation warning that this environment variable will be ignored at a later date.`
*   `Default user credentials path `at `~/.config/pandas_gbq/bigquery_credentials.dat`

Methods



*   load() - load credentials from self._credentials_path, refresh them, and return them. Otherwise, return None if credentials not found.
*   save(credentials) - write credentials as JSON to self._credentials_path.


##### pandas_gbq.WriteOnlyCredentialsCache

Same as CredentialsCache, but load() is a no-op. Equivalent to "force reauth" in current versions.


##### pandas_gbq.NoopCredentialsCache

Satisfies the credentials cache interface, but does nothing. Useful for shared systems where you want credentials to stay in memory (e.g. Colab).


## Deprecations

Some time should be given (1-year deprecation?) for folks to migrate to the new `client` argument.  It might be used in scripts and older notebooks, and also is a parameter upstream in Pandas.


### Deprecate the PANDAS_GBQ_CREDENTIALS_FILE environment variable

Log a deprecation warning suggesting `pandas_gbq.get_user_credentials` with a `pandas_gbq.CredentialsCache` argument.


### Deprecate `private_key` argument

Log a deprecation warning suggesting [google.oauth2.service_account.Credentials.from_service_account_info](https://google-auth.readthedocs.io/en/latest/reference/google.oauth2.service_account.html#google.oauth2.service_account.Credentials.from_service_account_info) instead of passing in bytes and [google.oauth2.service_account.Credentials.from_service_account_file](https://google-auth.readthedocs.io/en/latest/reference/google.oauth2.service_account.html#google.oauth2.service_account.Credentials.from_service_account_file) instead of passing in a path.

Add / link to service account examples in the docs.


### Deprecate `reauth` argument

Log a deprecation warning suggesting creating a client using credentials from pandas_gbq.get_user_credentials and a pandas_gbq.WriteOnlyCredentialsCache

Add user authentication examples in the docs.


### Deprecate `auth_local_webserver` argument

Log a deprecation warning suggesting creating a client using credentials from pandas_gbq.get_user_credentials and set the auth_local_webserver argument there.

Add user authentication examples in the docs.


/cc @craigcitro @maxim-lian 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pandas-gbq auth proposal #161

Overview

Background

Problems with the current flow

Proposal

Document default auth behavior

Add `client` parameter to `read_gbq` and `to_gbq`

Helpers for user-based authentication

pandas_gbq.get_user_credentials(scopes=None, credentials_cache=None, client_secrets=None, use_localhost_webserver=False):

pandas_gbq.CredentialsCache

pandas_gbq.WriteOnlyCredentialsCache

pandas_gbq.NoopCredentialsCache

Deprecations

Deprecate the PANDAS_GBQ_CREDENTIALS_FILE environment variable

Deprecate `private_key` argument

Deprecate `reauth` argument

Deprecate `auth_local_webserver` argument

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pandas-gbq auth proposal #161

Description

Overview

Background

Problems with the current flow

Proposal

Document default auth behavior

Add client parameter to read_gbq and to_gbq

Helpers for user-based authentication

pandas_gbq.get_user_credentials(scopes=None, credentials_cache=None, client_secrets=None, use_localhost_webserver=False):

pandas_gbq.CredentialsCache

pandas_gbq.WriteOnlyCredentialsCache

pandas_gbq.NoopCredentialsCache

Deprecations

Deprecate the PANDAS_GBQ_CREDENTIALS_FILE environment variable

Deprecate private_key argument

Deprecate reauth argument

Deprecate auth_local_webserver argument

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add `client` parameter to `read_gbq` and `to_gbq`

Deprecate `private_key` argument

Deprecate `reauth` argument

Deprecate `auth_local_webserver` argument