Description
A small, complete example of the issue
Recently we have discovered that streaming loads into google tables are best effort but can be up to 90 minute delay. It has no transactional guarantees. As such to_gbq()
returning is no guarantee of anything.
I propose that we switch the loading from tableData.insertAll() to the batch bigquery load, either using an http upload, or pushing to a cloud bucket. The latter would suck because it requires bucket perms too, but I'm not sure on the size bounds for the http stream version.
However, the atomicity seems important in the case where we next execute a BQ query with the recently uploaded to_gbq() results and worth a tradeoff in terms of time.
At least we would could include an option.
Thoughts ?
(not sure how to label this IO:google)
Output of pd.show_versions()
pandas: 0.18.1
nose: 1.3.7
pip: 9.0.0
setuptools: 25.2.0
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.5.3
pytz: 2015.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.4
pymysql: None
psycopg2: 2.6 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
pandas_datareader: None