-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Switch arrow type for string array to large string #56220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 11 commits
752e368
d813e8d
ed07537
c0c42a8
3196f32
e807652
1341ffc
6dc3f20
848f7ed
7244889
a24ebac
3d90cc7
46d7f16
3fdf256
c2bd9d2
a01d4e5
a22e625
847b74c
47fda87
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -172,9 +172,15 @@ def _convert_arrays_to_dataframe( | |
) | ||
if dtype_backend == "pyarrow": | ||
pa = import_optional_dependency("pyarrow") | ||
arrays = [ | ||
ArrowExtensionArray(pa.array(arr, from_pandas=True)) for arr in arrays | ||
] | ||
import pyarrow.compute as pc | ||
|
||
result_arrays = [] | ||
for arr in arrays: | ||
pa_array = pa.array(arr, from_pandas=True) | ||
if arr.dtype == "string": | ||
pa_array = pc.cast(pa_array, pa.string()) | ||
mroeschke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
result_arrays.append(ArrowExtensionArray(pa_array)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the reason for this cast? (and maybe add a comment about it) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. arrow is inferring this as regular strings, I think we had failing tests without this cast There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I'm still confused about this as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm the comment above was incorrect, its like this: We are now using large_string in our String Extension arrays, e.g. if you convert this to an ArrowExtensionArray it will also be large_string. This is inconsistent with the other I/O methods where ArrowExtensionArray is still pa.string, that's why I am casting it back here. I am happy to change this as well, but rather in a follow up There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah okay that makes sense. I'm OK with this then but would be good to have a |
||
arrays = result_arrays # type: ignore[assignment] | ||
if arrays: | ||
df = DataFrame(dict(zip(list(range(len(columns))), arrays))) | ||
df.columns = columns | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -83,11 +83,14 @@ def test_constructor_not_string_type_value_dictionary_raises(chunked): | |||||
ArrowStringArray(arr) | ||||||
|
||||||
|
||||||
@pytest.mark.xfail( | ||||||
reason="dict conversion does not seem to be implemented for large string in arrow" | ||||||
) | ||||||
@pytest.mark.parametrize("chunked", [True, False]) | ||||||
def test_constructor_valid_string_type_value_dictionary(chunked): | ||||||
pa = pytest.importorskip("pyarrow") | ||||||
|
||||||
arr = pa.array(["1", "2", "3"], pa.dictionary(pa.int32(), pa.utf8())) | ||||||
arr = pa.array(["1", "2", "3"], pa.dictionary(pa.int32(), pa.large_string())) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
(it's only the python->arrow converter that doesn't seem to implement this, but creating a dictionary array with large string in pyarrow itself is certainly supported) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Additionally, it looks a bit strange that we actually allow creating a string column backed by a dictionary array. It would be nice that long-term we support this, but right now many operations will just fail (eg all string compute functions from pyarrow will fail on a dictionary[string] type). I think for fixing #53951, instead of allowing dictionary to pass through, we should rather convert the dictionary to a plain string array? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can do this as a follow up, but I don't think that this is a real use case anyway There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The report in #53951 is a real use case, though (and that will now create such dictionary-backed string column), AFAIU But indeed for a different issue/PR There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. isn't this also happening on main? maybe I am misunderstanding something |
||||||
if chunked: | ||||||
arr = pa.chunked_array(arr) | ||||||
|
||||||
|
Uh oh!
There was an error while loading. Please reload this page.