Skip to content

Surrogates from poorly formed Unicode breaks transport layer #611

Closed
@tylerjharden

Description

@tylerjharden

In any situation where you are ingesting content into ElasticSearch that you do not directly control, there is the possibility that someone somewhere along the lines has improperly encoded UTF-16 or another encoding to UTF-8 through the use of surrogate pairs.

These are not supported in UTF-8, in Python, or in ElasticSearch, and will cause mountains of errors like this:

File \"/usr/lib/python3.6/site-packages/elasticsearch/transport.py\", line 295, in perform_request
   body = body.encode('utf-8')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udd9e' in position 19281: surrogates not allowed

The fix is quite simple, surrogate escapes must be backslash replaced in Python3:

try:
    body.encode("utf-8")
except UnicodeEncodeError as e:
    if e.reason == 'surrogates not allowed':
        body = body.encode('utf-8', "backslashreplace").decode('utf-8')

I will open a pull request with this fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions