Skip to content

Expose JSON body fields as typed parameters #1680

Closed
@sethmlarson

Description

@sethmlarson

Motivation

Today with the Elasticsearch client you have to write a JSON blob to the body to pass any values within the HTTP body.
Unfortunately this is where some of the most complex data structures are for the Elasticsearch API (see query DSL, aggregations)
which means we're unable to provide a good window into this opaque object via types or auto-complete.

This is what writing a search query looks like today:

from elasticsearch import Elasticsearch
client = Elasticsearch("https://localhost:9200")

client.search(
    index="test-index",
    size=10,
    body={
        "runtime_mappings": {
            "day_of_week": {
                "type": "keyword",
                "script": "emit(doc['@timestamp'].value.dayOfWeekEnum)"
            }
        },
        "aggs": {
            "day_of_week_count": {
                "value_count": {
                    "field": "day_of_week"
                }
            }
        }
    }
)

And here is what the type hints for the search API look like for today:

def search(
    index: Optional[Any] = ...,
    body: Optional[Any] = ...,
    size: Optional[Any] = ...,
    ...,
): ...

Something to note for later: the size parameter is always serialized in the query string in this example.

Proposed Solution

With the expanded JSON body fields implemented the same search API call can be written like so:

client.search(
    index="test-index",
    size=10,
    runtime_mappings={
        "day_of_week": {
            "type": "keyword",
            "script": "emit(doc['@timestamp'].value.dayOfWeekEnum)"
        }
    },
    aggs={
        "day_of_week_count": {
            "value_count": {
                "field": "day_of_week"
            }
        }
    }
)

Notice how the body fields and fields that are serialized elsewhere like index are all at the same level and you're writing actual Python code instead of wrangling with JSON at the top-level of the API.

And here is what the function signature and types for the search API would look like:

def search(
    index: Optional[Union[str, List[str]]] = ...,
    size: Optional[int] = ...,
    runtime_mappings: Optional[Mapping[str, Any]] = ...,
    aggs: Optional[Mapping[str, Any]] = ...
    ...,
    body: Optional[Any] = ...,
): ...

The differences that I want to highlight:

  • There is still a body parameter for backwards compatibility with queries written before this change. This won't be a breaking change, any queries that are using the body parameter will still function exactly as they do today. The body parameter will be deprecated on APIs that support expanded body fields.
  • Using the Elasticsearch specification we get better type hints for parameters that already exist. You can see this example best with the index and size parameters, previously they were Optional[Any] and now they are Optional[Union[str, List[str]] and Optional[int] respectively. This gives you a lot more confidence when writing API calls that you're using the right types the first time.
  • Not shown, but the size parameter would be encoded in the HTTP request body instead of the query string. This is an improvement over the current serialization strategy because space within the HTTP request target (URL path + query) is limited. In the past we've seen errors from users using Scroll IDs which can be quite verbose and currently have a work-around to serialize Scroll IDs in the HTTP body. Another motivator is that any value serialized in the query string instead of the request body isn't effected by HTTP body compression.

Nuances and Future Improvements

These improvements are a first-step towards a fully-typed Elasticsearch Python client and will let users start down the path of adding richer types their Elasticsearch code. However there are a few things to note:

When will DeprecationWarnings start?

I'm unsure if they should be emitted in the next release (7.15) or if I should wait until more APIs are supported. Tough balance between alerting users to upcoming changes and new features and being difficult to avoid DeprecationWarnings in general usage of the client.

Not all APIs can take advantage right away

For an API to have expanded body fields it must be completed defined within the Elasticsearch specification. There are a lot of APIs that aren't completely defined yet so rollout of this change may take some time while the specification is filled.

In these cases APIs will be generated using the previous process and have a simple body parameter that won't raise a DeprecationWarning for the time being.

When the body parameter is used, return to old behavior

This means that all previous code written will use the old behavior until updated to not use the body parameter. A DeprecationWarning will be raised in cases where the body parameter could be replaced by expanded fields.

If both a body field and the body parameter are defined a ValueError will be raised as this configuration isn't supported.

Some fields that were once serialized to the query are now serialized to the body

This is true! However I believe that API compatibility won't be broken by this behavior change as Elasticsearch will treat the two different serializations the same. Some examples of this are _source, size, from, etc.

Deeply nested objects are still untyped

For example the runtime_mappings object is typed as Mapping[str, Any] where instead it should be mapped as Mapping[str, TypedDict[type: Union[Mapping[str, Any], str], format: str, script: Optional[str]]].

The aim is to reduce the amount of Any types in type hints if possible but for this initial implementation using only built-in Python types as this is as far as we can define in the general case. This means only using scalars (int, float, str, etc), and Union, Optional, List, Mapping, and Any.

typing.TypedDict is a newly added feature to Python 3.8 but in terms of ability to describe the complex structures of Elasticsearch API is missing a few critical features. Specifically the ability to mark one or more keys as "optional" to include without using the total=False parameter which makes all keys optional. I'll continue to watch the Python typing space for additional improvements there.

In the future defining our own objects and types may be required to represent these complex types.

What about conflicts with per-request parameters and body fields?

Parameters like api_key are a parameter on every API in order to define different authentication config per request.
We want to continue to support code written this way so for now APIs that have a conflict between per-request parameters
and body fields will continue to use the old behavior of a single body parameter.

APIs that have these conflicts will not have expanded body fields for now, there is a future improvement in the works to solve this issue.

What about bodies that aren't JSON?

These APIs won't be changed and will continue to have a body parameter for 7.x.

The future is keyword-only

In Python 3 keyword-only arguments were added which allowed making functions automatically raise a TypeError if called without using that argument as a keyword argument. This is a fantastic feature because if makes all code written with a library much more readable and maintainable. It also makes my job as a library maintainer much easier as I no longer have to worry about breaking code wrt. the order of parameters, only that they are there.

Currently type stubs define all parameters as keyword-only (except required path parameters) but because these are stubs and not function signatures there's no enforcement of this unless you opt-in with mypy or another type-checking tool.

Starting in 8.x keyword-only arguments will be used for all parameters so users should switch over to keyword arguments (as has always been recommended) as soon as possible!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions