Description
I've tried to bulk index a bunch of JSON raw records into ES, and I needed to set custom _id values for them. Individual indexing works by calling "IndexQueryBuilder().withId(some_id_value)" and then calling the individual index method, but calling the "bulkIndex" method doesn't consider what was defined as the _id desired value.
Here's the code that ignores the ".withId" call:
package <ommited for safety>;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.elasticsearch.core.ElasticsearchOperations;
import org.springframework.data.elasticsearch.core.IndexOperations;
import org.springframework.data.elasticsearch.core.mapping.IndexCoordinates;
import org.springframework.data.elasticsearch.core.query.IndexQuery;
import org.springframework.data.elasticsearch.core.query.IndexQueryBuilder;
import org.springframework.stereotype.Service;
@Service
public class ESService {
@Autowired
private ElasticsearchOperations esOperations;
public void index(String baseName, Map<Integer, String> jsonDocuments, String indexName, Long exp_time) {
IndexCoordinates indexCoordinates = IndexCoordinates.of(indexName);
IndexOperations indexOps = esOperations.indexOps(indexCoordinates);
if(!indexOps.exists()) {
indexOps.create();
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
List<IndexQuery> indexQueries = jsonDocuments.keySet().stream()
.map(id -> new IndexQueryBuilder()
.withSource(jsonDocuments.get(id))
.withId(id.toString()) // HERE IS THE IGNORED CALL
.withIndex(indexName)
.build())
.collect(Collectors.toList());
try {
esOperations.bulkIndex(indexQueries, indexCoordinates);
} catch (Exception e) {
e.printStackTrace();
}
}
}
It should be interesting (if not mandatory) that the user could set the _id for each individual record sent in the bulk request.
I was able to loop over individual IndexQuery objects and send them one by one to ES, and that correctly sets the _id value, but that increases processing time a lot - in my scenario of ~2m JSON records, elapsed time increases from 15-20 minutes (in batches of 2000 records) to ~3 hours.