Skip to content

BUG: Fix #15344 by backporting ujson usage of PEP 393 API #15360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

tobgu
Copy link
Contributor

@tobgu tobgu commented Feb 9, 2017

Make use of the PEP 393 API to avoid expanding single byte ascii characters into four byte unicode characters when encoding objects to json.

@jreback
Copy link
Contributor

jreback commented Feb 9, 2017

lgtm. can you add a whatsnew note? (Bug fix is fine)

@jreback jreback added the IO JSON read_json, to_json, json_normalize label Feb 9, 2017
@tobgu tobgu force-pushed the backport-ujson-compact-ascii-encoding branch from 17290c9 to 4e8e2ff Compare February 9, 2017 21:34
@jreback
Copy link
Contributor

jreback commented Feb 10, 2017

https://travis-ci.org/pandas-dev/pandas/jobs/200143310

we have a c-linter :<

@tobgu
Copy link
Contributor Author

tobgu commented Feb 10, 2017

Ouch :-). Hopefully fixed now.

@jreback jreback added this to the 0.20.0 milestone Feb 10, 2017
@jreback jreback closed this in e884072 Feb 10, 2017
@jreback
Copy link
Contributor

jreback commented Feb 10, 2017

thanks @tobgu

keep em coming!

I am still puzzled why the encoding of a new UTF8 string, that returns a new reference is affecting existing references. Though this maybe some deep things going on that I don't understand. If you have time I would search / post an issue on CPython to understand the semantics here. Its also possible that this is the reason this function is being deprecated (because of these side effects).

PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
Return value: New reference.
Encode the Py_UNICODE buffer s of the given size using UTF-8 and return a Python bytes object. Return NULL if an exception was raised by the codec.

Part of the old-style Py_UNICODE API; please migrate to using PyUnicode_AsUTF8String() or PyUnicode_AsUTF8AndSize().
Deprecated since version 3.3, will be removed in version 4.0.

@tobgu
Copy link
Contributor Author

tobgu commented Feb 10, 2017

I think the culprit in this case is the PyUnicode_AS_UNICODE macro that's described as inefficient and something you should migrate away from:
https://www.python.org/dev/peps/pep-0393/#porting-guidelines
https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_AS_UNICODE

It seems like there's actually an allocation for a full strlen*4 byte unicode string here that would be the root cause of the observed behaviour:
https://github.com/python/cpython-mirror/blob/9aa885c9df5b6895f70272b6abc24ef0b1fb8b46/Objects/unicodeobject.c#L4089

Haven't dug into it further than that...

Thanks @jreback and @chris-b1 for helping out with this! It would have been a show stopper for the application that I'm currently working on. Hope to see it released soon! :-)

@jreback
Copy link
Contributor

jreback commented Feb 10, 2017

thanks for looking @tobgu, that is some nasty code.

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
Make use of the PEP 393 API to avoid expanding single byte ascii
characters into four byte unicode characters when encoding objects to
json.

closes pandas-dev#15344

Author: Tobias Gustafsson <tobias.l.gustafsson@gmail.com>

Closes pandas-dev#15360 from tobgu/backport-ujson-compact-ascii-encoding and squashes the following commits:

44de133 [Tobias Gustafsson] Fix C-code formatting to pass linting of GH15344
b7e404f [Tobias Gustafsson] Merge branch 'master' into backport-ujson-compact-ascii-encoding
4e8e2ff [Tobias Gustafsson] BUG: Fix pandas-dev#15344 by backporting ujson usage of PEP 393 APIs for compact ascii
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calling DataFrame.to_json() increases data frame memory usage in Python 3.6
2 participants