Skip to content

Commit 56caedb

Browse files
Schaer, Jacob Cjacobschaer
Schaer, Jacob C
authored andcommitted
Redesigned pandas.io.gbq to remove bq.py as a dependency for everything except unit testing. Minor API changes were also introduced.
1 parent f8b101c commit 56caedb

File tree

9 files changed

+664
-948
lines changed

9 files changed

+664
-948
lines changed

ci/requirements-2.6.txt

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ python-dateutil==1.5
44
pytz==2013b
55
http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/beautifulsoup4-4.2.0.tar.gz
66
html5lib==1.0b2
7-
bigquery==2.0.17
87
numexpr==1.4.2
98
sqlalchemy==0.7.1
109
pymysql==0.6.0
@@ -15,3 +14,6 @@ xlwt==0.7.5
1514
openpyxl==2.0.3
1615
xlsxwriter==0.4.6
1716
xlrd==0.9.2
17+
httplib2==0.8
18+
python-gflags==2.0
19+
google-api-python-client==1.2

ci/requirements-2.7.txt

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,5 +19,7 @@ lxml==3.2.1
1919
scipy==0.13.3
2020
beautifulsoup4==4.2.1
2121
statsmodels==0.5.0
22-
bigquery==2.0.17
2322
boto==2.26.1
23+
httplib2==0.8
24+
python-gflags==2.0
25+
google-api-python-client==1.2

doc/source/install.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,9 @@ Optional Dependencies
112112
:func:`~pandas.io.clipboard.read_clipboard`. Most package managers on Linux
113113
distributions will have xclip and/or xsel immediately available for
114114
installation.
115-
* `Google bq Command Line Tool <https://developers.google.com/bigquery/bq-command-line-tool/>`__
115+
* Google's `python-gflags` and `google-api-python-client`
116+
* Needed for :mod:`~pandas.io.gbq`
117+
* `httplib2`
116118
* Needed for :mod:`~pandas.io.gbq`
117119
* One of the following combinations of libraries is needed to use the
118120
top-level :func:`~pandas.io.html.read_html` function:

doc/source/io.rst

Lines changed: 47 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -3373,83 +3373,79 @@ Google BigQuery (Experimental)
33733373
The :mod:`pandas.io.gbq` module provides a wrapper for Google's BigQuery
33743374
analytics web service to simplify retrieving results from BigQuery tables
33753375
using SQL-like queries. Result sets are parsed into a pandas
3376-
DataFrame with a shape derived from the source table. Additionally,
3377-
DataFrames can be uploaded into BigQuery datasets as tables
3378-
if the source datatypes are compatible with BigQuery ones.
3376+
DataFrame with a shape and data types derived from the source table.
3377+
Additionally, DataFrames can be appended to existing BigQuery tables if
3378+
the destination table is the same shape as the DataFrame.
33793379

33803380
For specifics on the service itself, see `here <https://developers.google.com/bigquery/>`__
33813381

3382-
As an example, suppose you want to load all data from an existing table
3383-
: `test_dataset.test_table`
3384-
into BigQuery and pull it into a DataFrame.
3382+
As an example, suppose you want to load all data from an existing BigQuery
3383+
table : `test_dataset.test_table` into a DataFrame using the :func:`~pandas.io.read_gbq`
3384+
function.
33853385

33863386
.. code-block:: python
3387-
3388-
from pandas.io import gbq
3389-
33903387
# Insert your BigQuery Project ID Here
3391-
# Can be found in the web console, or
3392-
# using the command line tool `bq ls`
3388+
# Can be found in the Google web console
33933389
projectid = "xxxxxxxx"
33943390
3395-
data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)
3391+
data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)
33963392
3397-
The user will then be authenticated by the `bq` command line client -
3398-
this usually involves the default browser opening to a login page,
3399-
though the process can be done entirely from command line if necessary.
3400-
Datasets and additional parameters can be either configured with `bq`,
3401-
passed in as options to `read_gbq`, or set using Google's gflags (this
3402-
is not officially supported by this module, though care was taken
3403-
to ensure that they should be followed regardless of how you call the
3404-
method).
3393+
You will then be authenticated to the specified BigQuery account
3394+
via Google's Oauth2 mechanism. In general, this is as simple as following the
3395+
prompts in a browser window which will be opened for you. Should the browser not
3396+
be available, or fail to launch, a code will be provided to complete the process
3397+
manually. Additional information on the authentication mechanism can be found
3398+
`here <https://developers.google.com/accounts/docs/OAuth2#clientside/>`__
34053399

3406-
Additionally, you can define which column to use as an index as well as a preferred column order as follows:
3400+
You can define which column from BigQuery to use as an index in the
3401+
destination DataFrame as well as a preferred column order as follows:
34073402

34083403
.. code-block:: python
34093404
3410-
data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table',
3405+
data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table',
34113406
index_col='index_column_name',
3412-
col_order='[col1, col2, col3,...]', project_id = projectid)
3413-
3414-
Finally, if you would like to create a BigQuery table, `my_dataset.my_table`, from the rows of DataFrame, `df`:
3407+
col_order=['col1', 'col2', 'col3'], project_id = projectid)
3408+
3409+
Finally, you can append data to a BigQuery table from a pandas DataFrame
3410+
using the :func:`~pandas.io.to_gbq` function. This function uses the
3411+
Google streaming API which requires that your destination table exists in
3412+
BigQuery. Given the BigQuery table already exists, your DataFrame should
3413+
match the destination table in column order, structure, and data types.
3414+
DataFrame indexes are not supported. By default, rows are streamed to
3415+
BigQuery in chunks of 10,000 rows, but you can pass other chuck values
3416+
via the ``chunksize`` argument. You can also see the progess of your
3417+
post via the ``verbose`` flag which defaults to ``True``. The http
3418+
response code of Google BigQuery can be successful (200) even if the
3419+
append failed. For this reason, if there is a failure to append to the
3420+
table, the complete error response from BigQuery is returned which
3421+
can be quite long given it provides a status for each row. You may want
3422+
to start with smaller chuncks to test that the size and types of your
3423+
dataframe match your destination table to make debugging simpler.
34153424

34163425
.. code-block:: python
34173426
34183427
df = pandas.DataFrame({'string_col_name' : ['hello'],
34193428
'integer_col_name' : [1],
34203429
'boolean_col_name' : [True]})
3421-
schema = ['STRING', 'INTEGER', 'BOOLEAN']
3422-
data_frame = gbq.to_gbq(df, 'my_dataset.my_table',
3423-
if_exists='fail', schema = schema, project_id = projectid)
3424-
3425-
To add more rows to this, simply:
3426-
3427-
.. code-block:: python
3428-
3429-
df2 = pandas.DataFrame({'string_col_name' : ['hello2'],
3430-
'integer_col_name' : [2],
3431-
'boolean_col_name' : [False]})
3432-
data_frame = gbq.to_gbq(df2, 'my_dataset.my_table', if_exists='append', project_id = projectid)
3430+
df.to_gbq('my_dataset.my_table', project_id = projectid)
34333431
3434-
.. note::
3432+
The BigQuery SQL query language has some oddities, see `here <https://developers.google.com/bigquery/query-reference>`__
34353433

3436-
A default project id can be set using the command line:
3437-
`bq init`.
3434+
While BigQuery uses SQL-like syntax, it has some important differences
3435+
from traditional databases both in functionality, API limitations (size and
3436+
qunatity of queries or uploads), and how Google charges for use of the service.
3437+
You should refer to Google documentation often as the service seems to
3438+
be changing and evolving. BiqQuery is best for analyzing large sets of
3439+
data quickly, but it is not a direct replacement for a transactional database.
34383440

3439-
There is a hard cap on BigQuery result sets, at 128MB compressed. Also, the BigQuery SQL query language has some oddities,
3440-
see `here <https://developers.google.com/bigquery/query-reference>`__
3441-
3442-
You can access the management console to determine project id's by:
3443-
<https://code.google.com/apis/console/b/0/?noredirect>
3441+
You can access the management console to determine project id's by:
3442+
<https://code.google.com/apis/console/b/0/?noredirect>
34443443

34453444
.. warning::
34463445

3447-
To use this module, you will need a BigQuery account. See
3448-
<https://cloud.google.com/products/big-query> for details.
3449-
3450-
As of 1/28/14, a known bug is present that could possibly cause data duplication in the resultant dataframe. A fix is imminent,
3451-
but any client changes will not make it into 0.13.1. See:
3452-
http://stackoverflow.com/questions/20984592/bigquery-results-not-including-page-token/21009144?noredirect=1#comment32090677_21009144
3446+
To use this module, you will need a valid BigQuery account. See
3447+
<https://cloud.google.com/products/big-query> for details on the
3448+
service.
34533449

34543450
.. _io.stata:
34553451

doc/source/v0.14.1.txt

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -154,14 +154,11 @@ Performance
154154
Experimental
155155
~~~~~~~~~~~~
156156

157-
``pandas.io.data.Options`` has gained a ``get_all_data method``, and now consistently returns a multi-indexed ``DataFrame`` (:issue:`5602`). See :ref:`the docs<remote_data.yahoo_options>`
158-
159-
.. ipython:: python
160-
161-
from pandas.io.data import Options
162-
aapl = Options('aapl', 'yahoo')
163-
data = aapl.get_all_data()
164-
data.iloc[0:5, 0:5]
157+
- ``io.gbq.read_gbq`` and ``io.gbq.to_gbq`` were refactored to remove the
158+
dependency on the Google ``bq.py`` command line client. This submodule
159+
now uses ``httplib2`` and the Google ``apiclient`` and ``oauth2client`` API client
160+
libraries which should be more stable and, therefore, reliable than
161+
``bq.py`` (:issue:`6937`).
165162

166163
.. _whatsnew_0141.bug_fixes:
167164

pandas/core/frame.py

Lines changed: 29 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -669,47 +669,43 @@ def to_dict(self, outtype='dict'):
669669
else: # pragma: no cover
670670
raise ValueError("outtype %s not understood" % outtype)
671671

672-
def to_gbq(self, destination_table, schema=None, col_order=None,
673-
if_exists='fail', **kwargs):
672+
def to_gbq(self, destination_table, project_id=None, chunksize=10000,
673+
verbose=True, reauth=False):
674674
"""Write a DataFrame to a Google BigQuery table.
675675
676-
If the table exists, the DataFrame will be appended. If not, a new
677-
table will be created, in which case the schema will have to be
678-
specified. By default, rows will be written in the order they appear
679-
in the DataFrame, though the user may specify an alternative order.
676+
THIS IS AN EXPERIMENTAL LIBRARY
677+
678+
If the table exists, the dataframe will be written to the table using
679+
the defined table schema and column types. For simplicity, this method
680+
uses the Google BigQuery streaming API. The to_gbq method chunks data
681+
into a default chunk size of 10,000. Failures return the complete error
682+
response which can be quite long depending on the size of the insert.
683+
There are several important limitations of the Google streaming API
684+
which are detailed at:
685+
https://developers.google.com/bigquery/streaming-data-into-bigquery.
680686
681687
Parameters
682-
---------------
688+
----------
689+
dataframe : DataFrame
690+
DataFrame to be written
683691
destination_table : string
684-
name of table to be written, in the form 'dataset.tablename'
685-
schema : sequence (optional)
686-
list of column types in order for data to be inserted, e.g.
687-
['INTEGER', 'TIMESTAMP', 'BOOLEAN']
688-
col_order : sequence (optional)
689-
order which columns are to be inserted, e.g. ['primary_key',
690-
'birthday', 'username']
691-
if_exists : {'fail', 'replace', 'append'} (optional)
692-
- fail: If table exists, do nothing.
693-
- replace: If table exists, drop it, recreate it, and insert data.
694-
- append: If table exists, insert data. Create if does not exist.
695-
kwargs are passed to the Client constructor
696-
697-
Raises
698-
------
699-
SchemaMissing :
700-
Raised if the 'if_exists' parameter is set to 'replace', but no
701-
schema is specified
702-
TableExists :
703-
Raised if the specified 'destination_table' exists but the
704-
'if_exists' parameter is set to 'fail' (the default)
705-
InvalidSchema :
706-
Raised if the 'schema' parameter does not match the provided
707-
DataFrame
692+
Name of table to be written, in the form 'dataset.tablename'
693+
project_id : str
694+
Google BigQuery Account project ID.
695+
chunksize : int (default 10000)
696+
Number of rows to be inserted in each chunk from the dataframe.
697+
verbose : boolean (default True)
698+
Show percentage complete
699+
reauth : boolean (default False)
700+
Force Google BigQuery to reauthenticate the user. This is useful
701+
if multiple accounts are used.
702+
708703
"""
709704

710705
from pandas.io import gbq
711-
return gbq.to_gbq(self, destination_table, schema=None, col_order=None,
712-
if_exists='fail', **kwargs)
706+
return gbq.to_gbq(self, destination_table, project_id=project_id,
707+
chunksize=chunksize, verbose=verbose,
708+
reauth=reauth)
713709

714710
@classmethod
715711
def from_records(cls, data, index=None, exclude=None, columns=None,

0 commit comments

Comments
 (0)