Description
As mentioned in #74, around July 11th pandas-gbq
builds started failing this test: test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_data_if_table_exists_replace
.
I reviewed the test failure and my initial thought is that there was change made in the BigQuery backend recently that triggered this. The issue is related to deleting and recreating a table with a different schema. Currently we force a delay of 2 minutes when a table with a modified schema is recreated. This delay is suggested in this StackOverflow post and this entry in the BigQuery issue tracker . Based on my limited testing, it seems that in addition to waiting 2 minutes, you also need to upload the data twice in order to see the data in BigQuery. During the first upload StreamingInsertError
is raised. The second upload is successful.
You can easily confirm this when running the test locally. The test failure no longer appears when I change
connector.load_data(dataframe, dataset_id, table_id, chunksize)
at
https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L1056
to
try:
connector.load_data(dataframe, dataset_id, table_id, chunksize)
except:
connector.load_data(dataframe, dataset_id, table_id, chunksize)
Based on this behaviour, I believe that now you need to upload data twice after changing the schema. It seems like this issue could be a regression on the BigQuery side (since re-uploading data wasn't required before).
I was also able to create this issue with the google-cloud-bigquery
package with the following code:
from google.cloud import bigquery
from google.cloud.bigquery import SchemaField
import time
client = bigquery.Client(project=<your_project_id>)
dataset = client.dataset('test_dataset')
if not dataset.exists():
dataset.create()
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = dataset.table('test_table', SCHEMA)
if table.exists:
try:
table.delete()
except:
pass
table.create()
ROWS_TO_INSERT = [
(u'Phred Phlyntstone', 32),
(u'Wylma Phlyntstone', 29),
]
table.insert_data(ROWS_TO_INSERT)
# Now change the schema
SCHEMA = [
SchemaField('name', 'STRING', mode='required'),
SchemaField('age', 'STRING', mode='required'),
]
table = dataset.table('test_table', SCHEMA)
# Delete the table, wait 2 minutes and re-create the table
table.delete()
time.sleep(120)
table.create()
ROWS_TO_INSERT = [
(u'Phred Phlyntstone', '32'),
(u'Wylma Phlyntstone', '29'),
]
for _ in range(5):
insert_errors = table.insert_data(ROWS_TO_INSERT)
if len(insert_errors):
print(insert_errors)
print('Retrying')
else:
break
The output was :
>>[{'index': 0, 'errors': [{u'debugInfo': u'generic::not_found: no such field.', u'reason': u'invalid', u'message': u'no such field.', u'location': u'name'}]}, {'index': 1, 'errors': [{u'debugInfo': u'generic::not_found: no such field.', u'reason': u'invalid', u'message': u'no such field.', u'location': u'name'}]}]
>>Retrying
but prior to July 11th (or so) the retry wasn't required.
One thing that google-cloud-bigquery
does is return streaming insert errors rather than raising StreamingInsertError
like we do in pandas-gbq
. See https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/bigquery/google/cloud/bigquery/table.py#L826 .
We could follow a similar behaviour and add a return in to_gbq
which contains the streaming insert errors rather than raising StreamingInsertError
. We can leave it up to the user to check for streaming insert errors and retry if needed https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L1056