Skip to content

Change to_gbq() from stream to batch loading #14670

Closed
@nicku33

Description

@nicku33

A small, complete example of the issue

Recently we have discovered that streaming loads into google tables are best effort but can be up to 90 minute delay. It has no transactional guarantees. As such to_gbq() returning is no guarantee of anything.

I propose that we switch the loading from tableData.insertAll() to the batch bigquery load, either using an http upload, or pushing to a cloud bucket. The latter would suck because it requires bucket perms too, but I'm not sure on the size bounds for the http stream version.

However, the atomicity seems important in the case where we next execute a BQ query with the recently uploaded to_gbq() results and worth a tradeoff in terms of time.

At least we would could include an option.
Thoughts ?

(not sure how to label this IO:google)

Output of pd.show_versions()

commit: None python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 14.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 9.0.0
setuptools: 25.2.0
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.5.3
pytz: 2015.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.4
pymysql: None
psycopg2: 2.6 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions