Skip to content

Performance issue when populate millions documents #86

Open
@ahivert

Description

@ahivert
Problem

When we need to put a lot of documents in index, we need to use queryset_pagination meta option to paginate. Django pagination need a sorted queryset with order_by (cf doc) otherwise same pk can be present more than once and others missing (like #71).

Put order_by on queryset will make django paginator call order_by for each page. Call order_by on huge queryset (like 10 millions) will lead to a huge perfomance issue.

Temporary solution:

We can override _get_actions method (from django_elasticsearch_dsl.documents.DocType) to not use django paginator when a queryset is passed. More over because of the way a database index work, we should first fetch only pks, and then do sub request based on it.

from django.db.models.query import QuerySet


def _get_actions(self, object_list, action):
    if self._doc_type.queryset_pagination and isinstance(object_list, QuerySet):
        pks = object_list.order_by('pk').values_list('pk', flat=True)
        len_pks = len(pks)
        for start_pk_index in range(0, len_pks, self._doc_type.queryset_pagination + 1):
            end_pk_index = start_pk_index + self._doc_type.queryset_pagination
            if end_pk_index >= len_pks:
                end_pk_index = len_pks - 1
            ranged_qs = object_list.filter(pk__range=[
                pks[start_pk_index],
                pks[end_pk_index]
            ])
            for object_instance in ranged_qs:
                yield self._prepare_action(object_instance, action)
    else:
        yield from super()._get_actions(object_list, action)

Available to make the PR if needed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions