Skip to content

ENH: Allow partial table schema in to_gbq() table_schema (#218) #257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 12, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion docs/source/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,11 @@ Internal changes
- Use ``to_dataframe()`` from ``google-cloud-bigquery`` in the ``read_gbq()``
function. (:issue:`247`)

Enhancements
~~~~~~~~~~~~
- Allow ``table_schema`` in :func:`to_gbq` to contain only a subset of columns,
with the rest being populated using the DataFrame dtypes (:issue:`218`)
(contributed by @johnpaton)

.. _changelog-0.9.0:

Expand Down Expand Up @@ -237,4 +242,4 @@ Initial release of transfered code from `pandas <https://github.com/pandas-dev/p
Includes patches since the 0.19.2 release on pandas with the following:

- :func:`read_gbq` now allows query configuration preferences `pandas-GH#14742 <https://github.com/pandas-dev/pandas/pull/14742>`__
- :func:`read_gbq` now stores ``INTEGER`` columns as ``dtype=object`` if they contain ``NULL`` values. Otherwise they are stored as ``int64``. This prevents precision lost for integers greather than 2**53. Furthermore ``FLOAT`` columns with values above 10**4 are no longer casted to ``int64`` which also caused precision loss `pandas-GH#14064 <https://github.com/pandas-dev/pandas/pull/14064>`__, and `pandas-GH#14305 <https://github.com/pandas-dev/pandas/pull/14305>`__
- :func:`read_gbq` now stores ``INTEGER`` columns as ``dtype=object`` if they contain ``NULL`` values. Otherwise they are stored as ``int64``. This prevents precision lost for integers greather than 2**53. Furthermore ``FLOAT`` columns with values above 10**4 are no longer casted to ``int64`` which also caused precision loss `pandas-GH#14064 <https://github.com/pandas-dev/pandas/pull/14064>`__, and `pandas-GH#14305 <https://github.com/pandas-dev/pandas/pull/14305>`__
21 changes: 16 additions & 5 deletions pandas_gbq/gbq.py
Original file line number Diff line number Diff line change
Expand Up @@ -939,9 +939,11 @@ def to_gbq(
'STRING'},...]``.
If schema is not provided, it will be
generated according to dtypes of DataFrame columns.
If schema is provided, it must contain all DataFrame columns.
pandas_gbq.gbq._generate_bq_schema() may be used to create an initial
schema, though it doesn't preserve column order.
If schema is provided, it may contain all or a subset of DataFrame
columns. If a subset is provided, the rest will be inferred from
the DataFrame dtypes.
pandas_gbq.gbq._generate_bq_schema() may be used to create an
initial schema, though it doesn't preserve column order.
See BigQuery API documentation on available names of a field.

.. versionadded:: 0.3.1
Expand Down Expand Up @@ -1023,10 +1025,13 @@ def to_gbq(
credentials=connector.credentials,
)

default_schema = _generate_bq_schema(dataframe)
if not table_schema:
table_schema = _generate_bq_schema(dataframe)
table_schema = default_schema
else:
table_schema = dict(fields=table_schema)
table_schema = _update_bq_schema(
default_schema, dict(fields=table_schema)
)

# If table exists, check if_exists parameter
if table.exists(table_id):
Expand Down Expand Up @@ -1091,6 +1096,12 @@ def _generate_bq_schema(df, default_type="STRING"):
return schema.generate_bq_schema(df, default_type=default_type)


def _update_bq_schema(schema_old, schema_new):
from pandas_gbq import schema

return schema.update_schema(schema_old, schema_new)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this function? Should we import from schema directly? Or there's a circular import?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following the pattern used for the only other schema function, which is imported the same way in gbq._generate_bq_schema. As far as I'm concerned it's fine to get rid of it. Let me know and I'll make the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to Chesterton's fence; we can clean up later if @tswast knows

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_generate_bq_schema is only in gbq.py for backwards compatibility. We should use the update_schema method from the schema module directly.

I've sent #259 to clean this up (and also improve the docs for this feature).



class _Table(GbqConnector):
def __init__(
self,
Expand Down
29 changes: 29 additions & 0 deletions pandas_gbq/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,32 @@ def generate_bq_schema(dataframe, default_type="STRING"):
)

return {"fields": fields}


def update_schema(schema_old, schema_new):
"""
Given an old BigQuery schema, update it with a new one.

Where a field name is the same, the new will replace the old. Any
new fields not present in the old schema will be added.

Arguments:
schema_old: the old schema to update
schema_new: the new schema which will overwrite/extend the old
"""
old_fields = schema_old["fields"]
new_fields = schema_new["fields"]
output_fields = list(old_fields)

field_indices = {field["name"]: i for i, field in enumerate(output_fields)}

for field in new_fields:
name = field["name"]
if name in field_indices:
# replace old field with new field of same name
output_fields[field_indices[name]] = field
else:
# add new field
output_fields.append(field)

return {"fields": output_fields}
46 changes: 46 additions & 0 deletions tests/unit/test_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,49 @@
def test_generate_bq_schema(dataframe, expected_schema):
schema = pandas_gbq.schema.generate_bq_schema(dataframe)
assert schema == expected_schema


@pytest.mark.parametrize(
"schema_old,schema_new,expected_output",
[
(
{"fields": [{"name": "col1", "type": "INTEGER"}]},
{"fields": [{"name": "col2", "type": "TIMESTAMP"}]},
{
"fields": [
{"name": "col1", "type": "INTEGER"},
{"name": "col2", "type": "TIMESTAMP"},
]
},
),
(
{"fields": [{"name": "col1", "type": "INTEGER"}]},
{"fields": [{"name": "col1", "type": "BOOLEAN"}]},
{"fields": [{"name": "col1", "type": "BOOLEAN"}]},
),
(
{
"fields": [
{"name": "col1", "type": "INTEGER"},
{"name": "col2", "type": "INTEGER"},
]
},
{
"fields": [
{"name": "col2", "type": "BOOLEAN"},
{"name": "col3", "type": "FLOAT"},
]
},
{
"fields": [
{"name": "col1", "type": "INTEGER"},
{"name": "col2", "type": "BOOLEAN"},
{"name": "col3", "type": "FLOAT"},
]
},
),
],
)
def test_update_schema(schema_old, schema_new, expected_output):
output = pandas_gbq.schema.update_schema(schema_old, schema_new)
assert output == expected_output