Skip to content

Add additional information to errors when query execution failed #368

Open
@dkropachev

Description

@dkropachev

We have had an issue with java-driver reaching end of execution plan and throwing NoNodeAvailableException.
The problem is that when user get this error there is no information in it, beside the fact that end of execution plan has been reached.

Most of the PROD environments have log rate reducing technics in place, like: log sampling, filtering, deduplication, supression.
Due to that, it is common problem that is not possible to figure out why exactly this exception was thrown by just looking at the error message and/or at the logs.
Which causes extra load on both customer engeneering team, our support and engeneering team.
To mitigate this issue in all the drivers the following is proposed to enrich error/exception with following information(pick only that is relevant for given error):

  1. List of the nodes in the cluster (including their status,dc,rack)
  2. List of connections to the replicas (including host, rack, dc, shard)
  3. List of prior errors (if query has been tried to execute on one host, and was switched to another due to the error, show all these errors if end of execution plan is reached).
  4. History of topology changes. (Nodes being UP/DOWN with timestamps)
  5. Replica set information source (tablet/vnode/other)
  6. Node/connection overload status (Status itself, if peresent or queries in flight)

We can include that information into any query error, or into spefic errors, such as timeouts, empty execution plan error, end of execution plan error, or no connections available error.

While doing that we should be aware that clusters potentially could have many nodes (>60) and therefore node status information should be reduced by the following logic:

  1. Add status for nodes that are relevant to the query (based on replica set, dc, rack)
  2. Status for the rest of the cluster we should group by dc/rack/node-status(UP/DOWN)

In order to avoid excessive load we might want to have reducing logic, say to include that information only once a minute.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions