Skip to content

bug: loading DATE columns with extreme values results in incorrect data (pyarrow.lib.ArrowInvalid casting error in v3) #45

Closed
@tswast

Description

@tswast

Environment details

  • OS type and version:
  • Python version: python --version
  • pip version: pip --version
  • google-cloud-bigquery version: pip show google-cloud-bigquery
Name: google-cloud-bigquery
Version: 3.0.0b1
Summary: Google BigQuery API client library
Home-page: https://github.com/googleapis/python-bigquery
Author: Google LLC
Author-email: googleapis-packages@google.com
License: Apache 2.0
Location: /Users/swast/src/github.com/googleapis/python-bigquery
Requires: grpcio, google-api-core, google-cloud-bigquery-storage, google-cloud-core, google-resumable-media, packaging, proto-plus, protobuf, pyarrow, requests
Required-by: pandas-gbq, sqlalchemy-bigquery, pybigquery, ibis-bigquery

Steps to reproduce

Try to load the following DataFrame:

input_df = pandas.DataFrame(
    {
        "row_num": [1, 2, 3],
        "date_col": [datetime.date(1, 1, 1), datetime.date(1970, 1, 1), datetime.date(9999, 12, 31)],
        "datetime_col": [datetime.datetime(1, 1, 1), datetime.datetime(1970, 1, 1), datetime.datetime(9999, 12, 31, 23, 59, 59, 999999)],
        "timestamp_col": [datetime.datetime(1, 1, 1, tzinfo=datetime.timezone.utc), datetime.datetime(1970, 1, 1, tzinfo=datetime.timezone.utc), datetime.datetime(9999, 12, 31, 23, 59, 59, 999999, tzinfo=datetime.timezone.utc)],
    },
    columns=["row_num", "date_col", "datetime_col", "timestamp_col"],
)
table_schema = [
    {"name": "row_num", "type": "INTEGER"},
    {"name": "date_col", "type": "DATE"},
    {"name": "datetime_col", "type": "DATETIME"},
    {"name": "timestamp_col", "type": "TIMESTAMP"},
]

Stack trace

(dev-3.9) ➜  python-bigquery-pandas git:(issue365-datetime-out-of-bounds) ✗ pytest tests/system/test_to_gbq.py::test_dataframe_round_trip_with_table_schema'[load_parquet-issue365-extreme-datetimes]'
============================================================================= test session starts =============================================================================
platform darwin -- Python 3.9.5, pytest-6.2.5, py-1.10.0, pluggy-0.13.1
rootdir: /Users/swast/src/github.com/googleapis/python-bigquery-pandas
plugins: cov-2.12.1, asyncio-0.15.1, anyio-3.3.0, requests-mock-1.9.3
collected 1 item                                                                                                                                                              

tests/system/test_to_gbq.py F                                                                                                                                           [100%]

================================================================================== FAILURES ===================================================================================
____________________________________________ test_dataframe_round_trip_with_table_schema[load_parquet-issue365-extreme-datetimes] _____________________________________________

client = <google.cloud.bigquery.client.Client object at 0x19b997fd0>
dataframe =    row_num    date_col                datetime_col                     timestamp_col
0        1  1754-08-30         00...         1970-01-01 00:00:00+00:00
2        3  1816-03-29  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
destination_table_ref = TableReference(DatasetReference('swast-scratch', 'python_bigquery_pandas_tests_system_20211201161931_9106f8'), 'round_trip_w_schema_687411')
location = None
schema = {'fields': [{'name': 'row_num', 'type': 'INTEGER'}, {'name': 'date_col', 'type': 'DATE'}, {'name': 'datetime_col', 'type': 'DATETIME'}, {'name': 'timestamp_col', 'type': 'TIMESTAMP'}]}

    def load_parquet(
        client: bigquery.Client,
        dataframe: pandas.DataFrame,
        destination_table_ref: bigquery.TableReference,
        location: Optional[str],
        schema: Optional[Dict[str, Any]],
    ):
        job_config = bigquery.LoadJobConfig()
        job_config.write_disposition = "WRITE_APPEND"
        job_config.source_format = "PARQUET"
    
        if schema is not None:
            schema = pandas_gbq.schema.remove_policy_tags(schema)
            job_config.schema = pandas_gbq.schema.to_google_cloud_bigquery(schema)
            dataframe = cast_dataframe_for_parquet(dataframe, schema)
    
        try:
>           client.load_table_from_dataframe(
                dataframe, destination_table_ref, job_config=job_config, location=location,
            ).result()

pandas_gbq/load.py:127: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <google.cloud.bigquery.client.Client object at 0x19b997fd0>
dataframe =    row_num    date_col                datetime_col                     timestamp_col
0        1  1754-08-30         00...         1970-01-01 00:00:00+00:00
2        3  1816-03-29  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
destination = TableReference(DatasetReference('swast-scratch', 'python_bigquery_pandas_tests_system_20211201161931_9106f8'), 'round_trip_w_schema_687411'), num_retries = 6
job_id = '2592298a-57af-4944-b7f4-a93f58ee2af1', job_id_prefix = None, location = None, project = None
job_config = <google.cloud.bigquery.job.load.LoadJobConfig object at 0x19b966d60>, parquet_compression = 'SNAPPY', timeout = None

    def load_table_from_dataframe(
        self,
        dataframe,
        destination: Union[Table, TableReference, str],
        num_retries: int = _DEFAULT_NUM_RETRIES,
        job_id: str = None,
        job_id_prefix: str = None,
        location: str = None,
        project: str = None,
        job_config: LoadJobConfig = None,
        parquet_compression: str = "snappy",
        timeout: TimeoutType = DEFAULT_TIMEOUT,
    ) -> job.LoadJob:
        """Upload the contents of a table from a pandas DataFrame.
    
        Similar to :meth:`load_table_from_uri`, this method creates, starts and
        returns a :class:`~google.cloud.bigquery.job.LoadJob`.
    
        .. note::
    
            REPEATED fields are NOT supported when using the CSV source format.
            They are supported when using the PARQUET source format, but
            due to the way they are encoded in the ``parquet`` file,
            a mismatch with the existing table schema can occur, so
            REPEATED fields are not properly supported when using ``pyarrow<4.0.0``
            using the parquet format.
    
            https://github.com/googleapis/python-bigquery/issues/19
    
        Args:
            dataframe (pandas.DataFrame):
                A :class:`~pandas.DataFrame` containing the data to load.
            destination (google.cloud.bigquery.table.TableReference):
                The destination table to use for loading the data. If it is an
                existing table, the schema of the :class:`~pandas.DataFrame`
                must match the schema of the destination table. If the table
                does not yet exist, the schema is inferred from the
                :class:`~pandas.DataFrame`.
    
                If a string is passed in, this method attempts to create a
                table reference from a string using
                :func:`google.cloud.bigquery.table.TableReference.from_string`.
    
        Keyword Arguments:
            num_retries (Optional[int]): Number of upload retries.
            job_id (Optional[str]): Name of the job.
            job_id_prefix (Optional[str]):
                The user-provided prefix for a randomly generated
                job ID. This parameter will be ignored if a ``job_id`` is
                also given.
            location (Optional[str]):
                Location where to run the job. Must match the location of the
                destination table.
            project (Optional[str]):
                Project ID of the project of where to run the job. Defaults
                to the client's project.
            job_config (Optional[google.cloud.bigquery.job.LoadJobConfig]):
                Extra configuration options for the job.
    
                To override the default pandas data type conversions, supply
                a value for
                :attr:`~google.cloud.bigquery.job.LoadJobConfig.schema` with
                column names matching those of the dataframe. The BigQuery
                schema is used to determine the correct data type conversion.
                Indexes are not loaded.
    
                By default, this method uses the parquet source format. To
                override this, supply a value for
                :attr:`~google.cloud.bigquery.job.LoadJobConfig.source_format`
                with the format name. Currently only
                :attr:`~google.cloud.bigquery.job.SourceFormat.CSV` and
                :attr:`~google.cloud.bigquery.job.SourceFormat.PARQUET` are
                supported.
            parquet_compression (Optional[str]):
                [Beta] The compression method to use if intermittently
                serializing ``dataframe`` to a parquet file.
    
                The argument is directly passed as the ``compression``
                argument to the underlying ``pyarrow.parquet.write_table()``
                method (the default value "snappy" gets converted to uppercase).
                https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow-parquet-write-table
    
                If the job config schema is missing, the argument is directly
                passed as the ``compression`` argument to the underlying
                ``DataFrame.to_parquet()`` method.
                https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html#pandas.DataFrame.to_parquet
            timeout (Optional[float]):
                The number of seconds to wait for the underlying HTTP transport
                before using ``retry``.
    
        Returns:
            google.cloud.bigquery.job.LoadJob: A new load job.
    
        Raises:
            TypeError:
                If ``job_config`` is not an instance of :class:`~google.cloud.bigquery.job.LoadJobConfig`
                class.
        """
        job_id = _make_job_id(job_id, job_id_prefix)
    
        if job_config:
            _verify_job_config_type(job_config, google.cloud.bigquery.job.LoadJobConfig)
            # Make a copy so that the job config isn't modified in-place.
            job_config_properties = copy.deepcopy(job_config._properties)
            job_config = job.LoadJobConfig()
            job_config._properties = job_config_properties
    
        else:
            job_config = job.LoadJobConfig()
    
        supported_formats = {job.SourceFormat.CSV, job.SourceFormat.PARQUET}
        if job_config.source_format is None:
            # default value
            job_config.source_format = job.SourceFormat.PARQUET
    
        if (
            job_config.source_format == job.SourceFormat.PARQUET
            and job_config.parquet_options is None
        ):
            parquet_options = ParquetOptions()
            # default value
            parquet_options.enable_list_inference = True
            job_config.parquet_options = parquet_options
    
        if job_config.source_format not in supported_formats:
            raise ValueError(
                "Got unexpected source_format: '{}'. Currently, only PARQUET and CSV are supported".format(
                    job_config.source_format
                )
            )
    
        if location is None:
            location = self.location
    
        # If table schema is not provided, we try to fetch the existing table
        # schema, and check if dataframe schema is compatible with it - except
        # for WRITE_TRUNCATE jobs, the existing schema does not matter then.
        if (
            not job_config.schema
            and job_config.write_disposition != job.WriteDisposition.WRITE_TRUNCATE
        ):
            try:
                table = self.get_table(destination)
            except core_exceptions.NotFound:
                pass
            else:
                columns_and_indexes = frozenset(
                    name
                    for name, _ in _pandas_helpers.list_columns_and_indexes(dataframe)
                )
                job_config.schema = [
                    # Field description and policy tags are not needed to
                    # serialize a data frame.
                    SchemaField(
                        field.name,
                        field.field_type,
                        mode=field.mode,
                        fields=field.fields,
                    )
                    # schema fields not present in the dataframe are not needed
                    for field in table.schema
                    if field.name in columns_and_indexes
                ]
    
        job_config.schema = _pandas_helpers.dataframe_to_bq_schema(
            dataframe, job_config.schema
        )
    
        if not job_config.schema:
            # the schema could not be fully detected
            warnings.warn(
                "Schema could not be detected for all columns. Loading from a "
                "dataframe without a schema will be deprecated in the future, "
                "please provide a schema.",
                PendingDeprecationWarning,
                stacklevel=2,
            )
    
        tmpfd, tmppath = tempfile.mkstemp(
            suffix="_job_{}.{}".format(job_id[:8], job_config.source_format.lower())
        )
        os.close(tmpfd)
    
        try:
    
            if job_config.source_format == job.SourceFormat.PARQUET:
                if job_config.schema:
                    if parquet_compression == "snappy":  # adjust the default value
                        parquet_compression = parquet_compression.upper()
    
>                   _pandas_helpers.dataframe_to_parquet(
                        dataframe,
                        job_config.schema,
                        tmppath,
                        parquet_compression=parquet_compression,
                        parquet_use_compliant_nested_type=True,
                    )

../python-bigquery/google/cloud/bigquery/client.py:2616: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

dataframe =    row_num    date_col                datetime_col                     timestamp_col
0        1  1754-08-30         00...         1970-01-01 00:00:00+00:00
2        3  1816-03-29  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
bq_schema = [SchemaField('row_num', 'INTEGER', 'NULLABLE', None, (), None), SchemaField('date_col', 'DATE', 'NULLABLE', None, (), ...e_col', 'DATETIME', 'NULLABLE', None, (), None), SchemaField('timestamp_col', 'TIMESTAMP', 'NULLABLE', None, (), None)]
filepath = '/var/folders/gg/ry6k1t_157xbkwtn2rnnfddc006gf0/T/tmpip3lgjt8_job_2592298a.parquet', parquet_compression = 'SNAPPY', parquet_use_compliant_nested_type = True

    def dataframe_to_parquet(
        dataframe,
        bq_schema,
        filepath,
        parquet_compression="SNAPPY",
        parquet_use_compliant_nested_type=True,
    ):
        """Write dataframe as a Parquet file, according to the desired BQ schema.
    
        This function requires the :mod:`pyarrow` package. Arrow is used as an
        intermediate format.
    
        Args:
            dataframe (pandas.DataFrame):
                DataFrame to convert to Parquet file.
            bq_schema (Sequence[Union[ \
                :class:`~google.cloud.bigquery.schema.SchemaField`, \
                Mapping[str, Any] \
            ]]):
                Desired BigQuery schema. Number of columns must match number of
                columns in the DataFrame.
            filepath (str):
                Path to write Parquet file to.
            parquet_compression (Optional[str]):
                The compression codec to use by the the ``pyarrow.parquet.write_table``
                serializing method. Defaults to "SNAPPY".
                https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow-parquet-write-table
            parquet_use_compliant_nested_type (bool):
                Whether the ``pyarrow.parquet.write_table`` serializing method should write
                compliant Parquet nested type (lists). Defaults to ``True``.
                https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types
                https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow-parquet-write-table
    
                This argument is ignored for ``pyarrow`` versions earlier than ``4.0.0``.
        """
        import pyarrow.parquet  # type: ignore
    
        kwargs = (
            {"use_compliant_nested_type": parquet_use_compliant_nested_type}
            if _helpers.PYARROW_VERSIONS.use_compliant_nested_type
            else {}
        )
    
        bq_schema = schema._to_schema_fields(bq_schema)
>       arrow_table = dataframe_to_arrow(dataframe, bq_schema)

../python-bigquery/google/cloud/bigquery/_pandas_helpers.py:660: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

dataframe =    row_num    date_col                datetime_col                     timestamp_col
0        1  1754-08-30         00...         1970-01-01 00:00:00+00:00
2        3  1816-03-29  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
bq_schema = [SchemaField('row_num', 'INTEGER', 'NULLABLE', None, (), None), SchemaField('date_col', 'DATE', 'NULLABLE', None, (), ...e_col', 'DATETIME', 'NULLABLE', None, (), None), SchemaField('timestamp_col', 'TIMESTAMP', 'NULLABLE', None, (), None)]

    def dataframe_to_arrow(dataframe, bq_schema):
        """Convert pandas dataframe to Arrow table, using BigQuery schema.
    
        Args:
            dataframe (pandas.DataFrame):
                DataFrame to convert to Arrow table.
            bq_schema (Sequence[Union[ \
                :class:`~google.cloud.bigquery.schema.SchemaField`, \
                Mapping[str, Any] \
            ]]):
                Desired BigQuery schema. The number of columns must match the
                number of columns in the DataFrame.
    
        Returns:
            pyarrow.Table:
                Table containing dataframe data, with schema derived from
                BigQuery schema.
        """
        column_names = set(dataframe.columns)
        column_and_index_names = set(
            name for name, _ in list_columns_and_indexes(dataframe)
        )
    
        bq_schema = schema._to_schema_fields(bq_schema)
        bq_field_names = set(field.name for field in bq_schema)
    
        extra_fields = bq_field_names - column_and_index_names
        if extra_fields:
            raise ValueError(
                "bq_schema contains fields not present in dataframe: {}".format(
                    extra_fields
                )
            )
    
        # It's okay for indexes to be missing from bq_schema, but it's not okay to
        # be missing columns.
        missing_fields = column_names - bq_field_names
        if missing_fields:
            raise ValueError(
                "bq_schema is missing fields from dataframe: {}".format(missing_fields)
            )
    
        arrow_arrays = []
        arrow_names = []
        arrow_fields = []
        for bq_field in bq_schema:
            arrow_names.append(bq_field.name)
            arrow_arrays.append(
>               bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
            )

../python-bigquery/google/cloud/bigquery/_pandas_helpers.py:605: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

series = 0    1754-08-30
1    1970-01-01
2    1816-03-29
Name: date_col, dtype: dbdate, bq_field = SchemaField('date_col', 'DATE', 'NULLABLE', None, (), None)

    def bq_to_arrow_array(series, bq_field):
        if bq_field.field_type.upper() == "GEOGRAPHY":
            arrow_type = None
            first = _first_valid(series)
            if first is not None:
                if series.dtype.name == "geometry" or isinstance(first, _BaseGeometry):
                    arrow_type = pyarrow.binary()
                    # Convert shapey geometry to WKB binary format:
                    series = series.apply(_to_wkb)
                elif isinstance(first, bytes):
                    arrow_type = pyarrow.binary()
            elif series.dtype.name == "geometry":
                # We have a GeoSeries containing all nulls, convert it to a pandas series
                series = pandas.Series(numpy.array(series))
    
            if arrow_type is None:
                arrow_type = bq_to_arrow_data_type(bq_field)
        else:
            arrow_type = bq_to_arrow_data_type(bq_field)
    
        field_type_upper = bq_field.field_type.upper() if bq_field.field_type else ""
    
        if bq_field.mode.upper() == "REPEATED":
            return pyarrow.ListArray.from_pandas(series, type=arrow_type)
        if field_type_upper in schema._STRUCT_TYPES:
            return pyarrow.StructArray.from_pandas(series, type=arrow_type)
>       return pyarrow.Array.from_pandas(series, type=arrow_type)

../python-bigquery/google/cloud/bigquery/_pandas_helpers.py:330: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???

pyarrow/array.pxi:904: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???

pyarrow/array.pxi:252: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???

pyarrow/array.pxi:110: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <DateArray>
[datetime.date(1754, 8, 30),  datetime.date(1970, 1, 1),
 datetime.date(1816, 3, 29)]
Length: 3, dtype: dbdate, type = DataType(date32[day])

    def __arrow_array__(self, type=None):
        """Convert to an Arrow array from dbdate data.
    
        See:
        https://pandas.pydata.org/pandas-docs/stable/development/extending.html#compatibility-with-apache-arrow
        """
        array = pyarrow.array(self._ndarray, type=pyarrow.timestamp("ns"))
>       return pyarrow.compute.cast(
            array, type if type is not None else pyarrow.date32(),
        )

../python-db-dtypes-pandas/db_dtypes/__init__.py:277: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

arr = <pyarrow.lib.TimestampArray object at 0x19bbd7c40>
[
  1754-08-30 22:43:41.128654848,
  1970-01-01 00:00:00.000000000,
  1816-03-29 05:56:08.066277376
]
target_type = DataType(date32[day]), safe = True

    def cast(arr, target_type, safe=True):
        """
        Cast array values to another data type. Can also be invoked as an array
        instance method.
    
        Parameters
        ----------
        arr : Array or ChunkedArray
        target_type : DataType or type string alias
            Type to cast to
        safe : bool, default True
            Check for overflows or other unsafe conversions
    
        Examples
        --------
        >>> from datetime import datetime
        >>> import pyarrow as pa
        >>> arr = pa.array([datetime(2010, 1, 1), datetime(2015, 1, 1)])
        >>> arr.type
        TimestampType(timestamp[us])
    
        You can use ``pyarrow.DataType`` objects to specify the target type:
    
        >>> cast(arr, pa.timestamp('ms'))
        <pyarrow.lib.TimestampArray object at 0x7fe93c0f6910>
        [
          2010-01-01 00:00:00.000,
          2015-01-01 00:00:00.000
        ]
    
        >>> cast(arr, pa.timestamp('ms')).type
        TimestampType(timestamp[ms])
    
        Alternatively, it is also supported to use the string aliases for these
        types:
    
        >>> arr.cast('timestamp[ms]')
        <pyarrow.lib.TimestampArray object at 0x10420eb88>
        [
          1262304000000,
          1420070400000
        ]
        >>> arr.cast('timestamp[ms]').type
        TimestampType(timestamp[ms])
    
        Returns
        -------
        casted : Array
        """
        if target_type is None:
            raise ValueError("Cast target type must not be None")
        if safe:
            options = CastOptions.safe(target_type)
        else:
            options = CastOptions.unsafe(target_type)
>       return call_function("cast", [arr], options)

/usr/local/Caskroom/miniconda/base/envs/dev-3.9/lib/python3.9/site-packages/pyarrow/compute.py:297: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???

pyarrow/_compute.pyx:527: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???

pyarrow/_compute.pyx:337: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???

pyarrow/error.pxi:143: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to date32[day] would lose data: -6795364578871345152

pyarrow/error.pxi:99: ArrowInvalid

The above exception was the direct cause of the following exception:

method_under_test = functools.partial(<function to_gbq at 0x19ba4aee0>, project_id='swast-scratch', credentials=<google.oauth2.credentials.Credentials object at 0x19b997160>)
random_dataset_id = 'python_bigquery_pandas_tests_system_20211201161931_9106f8', bigquery_client = <google.cloud.bigquery.client.Client object at 0x19b97efd0>
input_df =    row_num    date_col                datetime_col                     timestamp_col
0        1  0001-01-01         00...         1970-01-01 00:00:00+00:00
2        3  9999-12-31  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
expected_df =    row_num    date_col                datetime_col                     timestamp_col
0        1  0001-01-01         00...         1970-01-01 00:00:00+00:00
2        3  9999-12-31  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
table_schema = [{'name': 'row_num', 'type': 'INTEGER'}, {'name': 'date_col', 'type': 'DATE'}, {'name': 'datetime_col', 'type': 'DATETIME'}, {'name': 'timestamp_col', 'type': 'TIMESTAMP'}]
api_method = 'load_parquet', api_methods = {'load_csv', 'load_parquet'}

    @pytest.mark.parametrize(
        ["input_df", "expected_df", "table_schema", "api_methods"], DATAFRAME_ROUND_TRIPS
    )
    def test_dataframe_round_trip_with_table_schema(
        method_under_test,
        random_dataset_id,
        bigquery_client,
        input_df,
        expected_df,
        table_schema,
        api_method,
        api_methods,
    ):
        if api_method not in api_methods:
            pytest.skip(f"{api_method} not supported.")
        if expected_df is None:
            expected_df = input_df
        table_id = f"{random_dataset_id}.round_trip_w_schema_{random.randrange(1_000_000)}"
>       method_under_test(
            input_df, table_id, table_schema=table_schema, api_method=api_method
        )

tests/system/test_to_gbq.py:232: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas_gbq/gbq.py:1095: in to_gbq
    connector.load_data(
pandas_gbq/gbq.py:546: in load_data
    chunks = load.load_chunks(
pandas_gbq/load.py:218: in load_chunks
    load_parquet(client, dataframe, destination_table_ref, location, schema)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

client = <google.cloud.bigquery.client.Client object at 0x19b997fd0>
dataframe =    row_num    date_col                datetime_col                     timestamp_col
0        1  1754-08-30         00...         1970-01-01 00:00:00+00:00
2        3  1816-03-29  9999-12-31 23:59:59.999999  9999-12-31 23:59:59.999999+00:00
destination_table_ref = TableReference(DatasetReference('swast-scratch', 'python_bigquery_pandas_tests_system_20211201161931_9106f8'), 'round_trip_w_schema_687411')
location = None
schema = {'fields': [{'name': 'row_num', 'type': 'INTEGER'}, {'name': 'date_col', 'type': 'DATE'}, {'name': 'datetime_col', 'type': 'DATETIME'}, {'name': 'timestamp_col', 'type': 'TIMESTAMP'}]}

    def load_parquet(
        client: bigquery.Client,
        dataframe: pandas.DataFrame,
        destination_table_ref: bigquery.TableReference,
        location: Optional[str],
        schema: Optional[Dict[str, Any]],
    ):
        job_config = bigquery.LoadJobConfig()
        job_config.write_disposition = "WRITE_APPEND"
        job_config.source_format = "PARQUET"
    
        if schema is not None:
            schema = pandas_gbq.schema.remove_policy_tags(schema)
            job_config.schema = pandas_gbq.schema.to_google_cloud_bigquery(schema)
            dataframe = cast_dataframe_for_parquet(dataframe, schema)
    
        try:
            client.load_table_from_dataframe(
                dataframe, destination_table_ref, job_config=job_config, location=location,
            ).result()
        except pyarrow.lib.ArrowInvalid as exc:
>           raise exceptions.ConversionError(
                "Could not convert DataFrame to Parquet."
            ) from exc
E           pandas_gbq.exceptions.ConversionError: Could not convert DataFrame to Parquet.

pandas_gbq/load.py:131: ConversionError
=========================================================================== short test summary info ===========================================================================
FAILED tests/system/test_to_gbq.py::test_dataframe_round_trip_with_table_schema[load_parquet-issue365-extreme-datetimes] - pandas_gbq.exceptions.ConversionError: Could not ...
============================================================================== 1 failed in 4.85s ==============================================================================

Metadata

Metadata

Assignees

Labels

api: bigqueryIssues related to the googleapis/python-db-dtypes-pandas API.priority: p1Important issue which blocks shipping the next release. Will be fixed prior to next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions