Open
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.read_json(path_or_buf="s3://...json", lines=True, chunksize=100)
Issue Description
This issue happens when using Pandas read_json with s3fs, with a non-null chunksize. There's a similar report for the null chunksize case.
Using the code above will results in this error:
TypeError: initial_value must be str or None, not bytes
I found out it's due to this method
def _preprocess_data(self, data):
"""
At this point, the data either has a `read` attribute (e.g. a file
object or a StringIO) or is a string that is a JSON document.
If self.chunksize, we prepare the data for the `__next__` method.
Otherwise, we read it into memory for the `read` method.
"""
if hasattr(data, "read") and not (self.chunksize or self.nrows):
with self:
data = data.read()
if not hasattr(data, "read") and (self.chunksize or self.nrows):
--> data = StringIO(data)
return data
I found the fix is simple, just the change the above line to:
data = StringIO(ensure_str(data))
Will put together a PR
Expected Behavior
Using pandas read_json with S3 url and non-null chunksize should work
Installed Versions
If happens with current versions (1.4.3)