Description
Overview
The current auth flows for pandas-gbq are a bit confusing and hard to customize.
Final desired state. The pandas_gbq
module should have the following (changes in bold):
- read_gbq(query, project_id [optional], index_col=None, col_order=None, reauth, verbose [deprecated], private_key [deprecated], auth_local_webserver, dialect='legacy', configuration [optional], credentials [new param, optional])
- to_gbq(dataframe, destination_table, project_id [optional], chunksize=None, verbose [deprecated], reauth, if_exists='fail', private_key [deprecated], auth_local_webserver, table_schema=None, credentials [new param, optional])
- CredentialsCache (and WriteOnlyCredentialsCache, NoopCredentialsCache) - new class (and subclasses) for configuring user credentials caching behavior
- context - global singleton with "client" property for caching default client in-memory.
- get_user_credentials(scopes=None, credentials_cache=None, client_secrets=None, use_localhost_webserver=False) - Helper function to get user authentication credentials.
Tasks:
- Add authentication documentation with examples.
- Add optional credentials parameter to
read_gbq
, taking a google.cloud.bigquery.Client object. - Add optional credentials parameter to
to_gbq
, taking a google.cloud.bigquery.Client object. - Add pandas_gbq.get_user_credentials() helper for fetching user credentials with installed-app OAuth2 flow.
- Add pandas_gbq.CredentialsCache and related subclasses for managing user credentials cache.
- Add pandas_gbq.context global for caching a default Client in-memory. Add examples for manually setting pandas_gbq.context.client (so that default project and other values like location can be set).
- Update minimum google-cloud-bigquery version to 0.32.0 so that the project ID in the client can be overridden when creating query & load jobs. (Done in ENH: Add location parameter to read_gbq and to_gbq #185)
- Deprecate
private_key
argument. Show examples of how to do the same thing by passing Credentials to the Client constructor. - Deprecate PANDAS_GBQ_CREDENTIALS_FILE environment variable. Show example using
pandas_gbq.get_user_credentials
withcredentials_cache
argument.
* [ ] DeprecateEdit: No reason to deprecate reauth, since we don't need to complicate pandas-gbq's auth with pydata-google-auth's implementation details.reauth
argument. Show examples usingpandas_gbq.get_user_credentials
withcredentials_cache
argument and WriteOnlyCredentialsCache or NoopCredentialsCache.
* [ ] DeprecateEdit: No reason to deprecate auth_local_webserver, as that feature is still needed. We don't actually want to force people to use pydata-google-auth for the default credentials case.auth_local_webserver
argument. Show example usingpandas_gbq.get_user_credentials
withauth_local_webserver
argument.
Background
pandas-gbq has its own auth flows, which include but are distinct from "application default credentials".
See issue: #129
Current (0.4.0) state of pandas-gbq auth:
- Use service account key file passed in as
private_key
parameter. Parameter can be either as JSON bytes or a file path. - Use application default credentials.
- Use service account key at GOOGLE_APPLICATION_CREDENTIALS environment variable.
- Use service account associated with Compute Engine, Kubernetes Engine, App Engine, or Cloud Functions.
- Use user authentication.
- Attempt to load user credentials from cache stored at
~/.config/pandas_gbq/bigquery_credentials.dat
or in path specified byPANDAS_GBQ_CREDENTIALS_FILE
environment variable. - Do 3-legged OAuth flow.
- Cache the user credentials to disk.
- Attempt to load user credentials from cache stored at
Why does pandas-gbq do user auth at all? Aren't application default credentials enough?
- It's difficult in some environments to set the right environment variables, so a way to explicitly provide credentials is desired.
- BigQuery does resource-based billing, so it is possible to use user-based authentication.
- User-based authentication eliminates the unnecessary step of creating a service account.
- A user with the BigQuery User IAM role wouldn't be allowed to create a service account.
- Often datasets are shared with a specific user. Querying with user account credentials will allow them to access those shared datasets / tables.
- User-based authentication is more intuitive in shared notebook environments like Colab, where the compute credentials might be associated with a service account in a shadow project or not available at all.
Problems with the current flow
- The credentials order isn't always ideal.
- It's not possible to specify user credentials in environments where application default credentials are available.
- If someone is familiar with the google-auth library, the current auth mechanisms do not allow passing in an arbitrary Credentials object.
- It is verbose and error-prone to pass in explicit service account credentials every time. See Set project_id (and other settings) once for all subsequent queries so you don't have to pass every time #103 for a feature request for more configurable defaults.
- Error-prone? More than once have I and the other pandas-gbq contributors forgot to add a
private_key
argument to a call in a test, resulting in surprising failures in CI builds.
- Error-prone? More than once have I and the other pandas-gbq contributors forgot to add a
- It's not possible to override the scopes for the credentials. For example, it is useful to add Drive / Sheets scopes for querying external data sources.
Proposal
Document default auth behavior
Current behavior (not changing, except for deprecations).
- Use client if passed in.
- Deprecated. Use private_key to create a Client if passed in. Use google-auth and credentials argument instead.
- Attempt to create client using application default credentials. Intersphinx link to google.auth.default
- Attempt to construct client using user credentials (project_id parameter must be passed in). Link to pandas_gbq.get_user_credentials().
New default auth behavior.
- 1b. If client not passed in, attempt to use global client at pandas_gbq.context (similar to google.cloud.bigquery.magics.context). If there is no client in the global context: run steps 2-4 and set the client it creates to the global context.
Add client
parameter to read_gbq
and to_gbq
The new client parameter, if provided, would bypass all other credentials fetching mechanisms.
Why a Client and not an explicit Credentials object?
- A Client contains a default project (See feature request for default projects at Set project_id (and other settings) once for all subsequent queries so you don't have to pass every time #103) and will eventually handle other defaults, such as location, encryption configuration, and maximum bytes billed.
- A Client object supports more BigQuery operations than will ever be exposed by pandas-gbq (creating datasets, modifying ACLs, other property updates). Passing this in as a parameter could hint to developers that they can use the Client directly for those things.
- It is more clear that BigQuery magic command is provided by google-cloud-bigquery not pandas-gbq.
Helpers for user-based authentication
No helpers are needed for default credentials or service account credentials because these can easily be constructed with the google-auth library. Link to samples for constructing these from the docs.
pandas_gbq.get_user_credentials(scopes=None, credentials_cache=None, client_secrets=None, use_localhost_webserver=False):
If credentials_cache is None, construct a pandas_gbq.CredentialsCache with defaults for arguments.
Attempt to load credentials from cache.
If credentials can't be loaded, start 3-legged oauth2 flow for installed applications. Use provided client secrets if given, otherwise use Pandas-GBQ client secrets. Use command-line flow by default. Use localhost webserver if set to True.
No credentials could be fetched? Raise an AccessDenied error. (Existing behavior of GbqConnector.get_user_account_credentials())
Save credentials to cache.
Return credentials.
pandas_gbq.CredentialsCache
Constructor takes optional credentials_path.
If credentials_path not provided, set self._credentials_path to
PANDAS_GBQ_CREDENTIALS_FILE - show deprecation warning that this environment variable will be ignored at a later date.
Default user credentials path
at~/.config/pandas_gbq/bigquery_credentials.dat
Methods
- load() - load credentials from self._credentials_path, refresh them, and return them. Otherwise, return None if credentials not found.
- save(credentials) - write credentials as JSON to self._credentials_path.
pandas_gbq.WriteOnlyCredentialsCache
Same as CredentialsCache, but load() is a no-op. Equivalent to "force reauth" in current versions.
pandas_gbq.NoopCredentialsCache
Satisfies the credentials cache interface, but does nothing. Useful for shared systems where you want credentials to stay in memory (e.g. Colab).
Deprecations
Some time should be given (1-year deprecation?) for folks to migrate to the new client
argument. It might be used in scripts and older notebooks, and also is a parameter upstream in Pandas.
Deprecate the PANDAS_GBQ_CREDENTIALS_FILE environment variable
Log a deprecation warning suggesting pandas_gbq.get_user_credentials
with a pandas_gbq.CredentialsCache
argument.
Deprecate private_key
argument
Log a deprecation warning suggesting google.oauth2.service_account.Credentials.from_service_account_info instead of passing in bytes and google.oauth2.service_account.Credentials.from_service_account_file instead of passing in a path.
Add / link to service account examples in the docs.
Deprecate reauth
argument
Log a deprecation warning suggesting creating a client using credentials from pandas_gbq.get_user_credentials and a pandas_gbq.WriteOnlyCredentialsCache
Add user authentication examples in the docs.
Deprecate auth_local_webserver
argument
Log a deprecation warning suggesting creating a client using credentials from pandas_gbq.get_user_credentials and set the auth_local_webserver argument there.
Add user authentication examples in the docs.
/cc @craigcitro @maxim-lian