Skip to content

BUG: read_json broken for S3 URL with non-null chunksize #47659

Open
@dungba88

Description

@dungba88

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.read_json(path_or_buf="s3://...json", lines=True, chunksize=100)

Issue Description

This issue happens when using Pandas read_json with s3fs, with a non-null chunksize. There's a similar report for the null chunksize case.

Using the code above will results in this error:

TypeError: initial_value must be str or None, not bytes

I found out it's due to this method

    def _preprocess_data(self, data):
        """
        At this point, the data either has a `read` attribute (e.g. a file
        object or a StringIO) or is a string that is a JSON document.
        If self.chunksize, we prepare the data for the `__next__` method.
        Otherwise, we read it into memory for the `read` method.
        """
        if hasattr(data, "read") and not (self.chunksize or self.nrows):
            with self:
                data = data.read()
        if not hasattr(data, "read") and (self.chunksize or self.nrows):
-->         data = StringIO(data)

        return data

I found the fix is simple, just the change the above line to:

            data = StringIO(ensure_str(data))

Will put together a PR

Expected Behavior

Using pandas read_json with S3 url and non-null chunksize should work

Installed Versions

If happens with current versions (1.4.3)

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO JSONread_json, to_json, json_normalizeIO NetworkLocal or Cloud (AWS, GCS, etc.) IO Issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions