Closed
Description
In any situation where you are ingesting content into ElasticSearch that you do not directly control, there is the possibility that someone somewhere along the lines has improperly encoded UTF-16 or another encoding to UTF-8 through the use of surrogate pairs.
These are not supported in UTF-8, in Python, or in ElasticSearch, and will cause mountains of errors like this:
File \"/usr/lib/python3.6/site-packages/elasticsearch/transport.py\", line 295, in perform_request
body = body.encode('utf-8')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udd9e' in position 19281: surrogates not allowed
The fix is quite simple, surrogate escapes must be backslash replaced in Python3:
try:
body.encode("utf-8")
except UnicodeEncodeError as e:
if e.reason == 'surrogates not allowed':
body = body.encode('utf-8', "backslashreplace").decode('utf-8')
I will open a pull request with this fix.
Metadata
Metadata
Assignees
Labels
No labels