Skip to content

Unable to set _id in bulk index with raw source documents #2861

Closed
@diegobenincasa

Description

@diegobenincasa

I've tried to bulk index a bunch of JSON raw records into ES, and I needed to set custom _id values for them. Individual indexing works by calling "IndexQueryBuilder().withId(some_id_value)" and then calling the individual index method, but calling the "bulkIndex" method doesn't consider what was defined as the _id desired value.

Here's the code that ignores the ".withId" call:

package <ommited for safety>;

import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.elasticsearch.core.ElasticsearchOperations;
import org.springframework.data.elasticsearch.core.IndexOperations;
import org.springframework.data.elasticsearch.core.mapping.IndexCoordinates;
import org.springframework.data.elasticsearch.core.query.IndexQuery;
import org.springframework.data.elasticsearch.core.query.IndexQueryBuilder;
import org.springframework.stereotype.Service;

@Service
public class ESService {
    
    @Autowired
    private ElasticsearchOperations esOperations;

    public void index(String baseName, Map<Integer, String> jsonDocuments, String indexName, Long exp_time) {

        IndexCoordinates indexCoordinates = IndexCoordinates.of(indexName);
        
        IndexOperations indexOps = esOperations.indexOps(indexCoordinates);
        if(!indexOps.exists()) {
            indexOps.create();
            try {
                Thread.sleep(1000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }

        List<IndexQuery> indexQueries = jsonDocuments.keySet().stream()
            .map(id -> new IndexQueryBuilder()
                .withSource(jsonDocuments.get(id))
                .withId(id.toString()) // HERE IS THE IGNORED CALL
                .withIndex(indexName)
                .build())
            .collect(Collectors.toList());

        try {
            esOperations.bulkIndex(indexQueries, indexCoordinates);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

It should be interesting (if not mandatory) that the user could set the _id for each individual record sent in the bulk request.

I was able to loop over individual IndexQuery objects and send them one by one to ES, and that correctly sets the _id value, but that increases processing time a lot - in my scenario of ~2m JSON records, elapsed time increases from 15-20 minutes (in batches of 2000 records) to ~3 hours.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions