StreamingInsertError occurs when uploading to a table with a new schema

As mentioned in #74, around July 11th `pandas-gbq` [builds started failing](https://travis-ci.org/pydata/pandas-gbq/builds) this test: `test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_data_if_table_exists_replace`.

I reviewed the test failure and my initial thought is that there was change made in the BigQuery backend recently that triggered this.  The issue is related to deleting and recreating a table with a different schema. Currently we [force a delay of 2 minutes](https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L762) when a table with a modified schema is recreated. This delay is suggested in [this StackOverflow post](https://stackoverflow.com/questions/25279116/cannot-insert-new-value-to-bigquery-table-after-updating-with-new-column-using-s) and [this entry in the BigQuery issue tracker](https://issuetracker.google.com/issues/35905247) . Based on my limited testing, it seems that in addition to waiting 2 minutes, you also need to upload the data twice in order to see the data in BigQuery. During the first upload `StreamingInsertError` is raised. The second upload is successful. 

You can easily confirm this when running the test locally. The test failure no longer appears when I change 
```
        connector.load_data(dataframe, dataset_id, table_id, chunksize)
``` 
at
https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L1056
to
```
    try:
        connector.load_data(dataframe, dataset_id, table_id, chunksize)
    except:
        connector.load_data(dataframe, dataset_id, table_id, chunksize)
```

Based on this behaviour, I believe that now you need to upload data twice after changing the schema. It seems like this issue could be a regression on the BigQuery side (since re-uploading data wasn't required before). 

I was also able to create this issue with the `google-cloud-bigquery` package with the following code:
```
from google.cloud import bigquery
from google.cloud.bigquery import SchemaField
import time

client = bigquery.Client(project=<your_project_id>)

dataset = client.dataset('test_dataset')
if not dataset.exists():
    dataset.create()

SCHEMA = [
    SchemaField('full_name', 'STRING', mode='required'),
    SchemaField('age', 'INTEGER', mode='required'),
]

table = dataset.table('test_table', SCHEMA)

if table.exists:
    try:
        table.delete()
    except:
        pass
    
table.create()
ROWS_TO_INSERT = [
    (u'Phred Phlyntstone', 32),
    (u'Wylma Phlyntstone', 29),
]
table.insert_data(ROWS_TO_INSERT)

# Now change the schema
SCHEMA = [
    SchemaField('name', 'STRING', mode='required'),
    SchemaField('age', 'STRING', mode='required'),
]
table = dataset.table('test_table', SCHEMA)

# Delete the table, wait 2 minutes and re-create the table
table.delete()
time.sleep(120)
table.create()

ROWS_TO_INSERT = [
    (u'Phred Phlyntstone', '32'),
    (u'Wylma Phlyntstone', '29'),
]
for _ in range(5):
    insert_errors = table.insert_data(ROWS_TO_INSERT)
    if len(insert_errors):
        print(insert_errors)
        print('Retrying')
    else:
        break
```

The output was :
```
>>[{'index': 0, 'errors': [{u'debugInfo': u'generic::not_found: no such field.', u'reason': u'invalid', u'message': u'no such field.', u'location': u'name'}]}, {'index': 1, 'errors': [{u'debugInfo': u'generic::not_found: no such field.', u'reason': u'invalid', u'message': u'no such field.', u'location': u'name'}]}]
>>Retrying
```
 but prior to July 11th (or so) the retry wasn't required.

One thing that `google-cloud-bigquery` does is return streaming insert errors rather than raising  `StreamingInsertError` like we do in `pandas-gbq`. See https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/bigquery/google/cloud/bigquery/table.py#L826 .

We could follow a similar behaviour and add a return in `to_gbq` which contains the streaming insert errors rather than raising `StreamingInsertError`. We can leave it up to the user to check for streaming insert errors and retry if needed https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L1056



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

StreamingInsertError occurs when uploading to a table with a new schema #75

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

StreamingInsertError occurs when uploading to a table with a new schema #75

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions