Skip to content

PYTHON-4915 - Add guidance on adding _id fields to documents to CRUD spec, reorder client.bulk_write generated _id fields #1976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 12, 2024

Conversation

NoahStapp
Copy link
Contributor

No description provided.

Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this change. For one it adds a likely non-trivial perf cost for documents with many top-level fields. We also already reorder the _id to be the first field when encoding a document to BSON, see:

if top_level and "_id" in doc:
elements.append(_name_value_to_bson(b"_id\x00", doc["_id"], check_keys, opts))

Instead we should add a MockupDB test to assert that we actually send the _id field first, if such a test doesn't already exist.

@NoahStapp
Copy link
Contributor Author

We don't need this change. For one it adds a likely non-trivial perf cost for documents with many top-level fields. We also already reorder the _id to be the first field when encoding a document to BSON, see:

if top_level and "_id" in doc:
elements.append(_name_value_to_bson(b"_id\x00", doc["_id"], check_keys, opts))

Instead we should add a MockupDB test to assert that we actually send the _id field first, if such a test doesn't already exist.

insert operations and collection-level bulkWrite correctly order their BSON documents, but client-level bulkWrite does not:

client.bulk_write([InsertOne(namespace="db.coll", document={'x2': 1})]) 

# Sends this to the server
{"bulkWrite": 1, "errorsOnly": true, "ordered": true, "$db": "admin", "ops": [{"insert": 0, "document": {"x2": 1, "_id": {"$oid": "671ff1d0fe637fbcdaa64254"}}}], "nsInfo": [{"ns": "db.coll"}]}

This is because the document is embedded within the operation mapping, so it does not put the _id field first.

@ShaneHarvey
Copy link
Member

Ah good find. Then we need to add the change back for client.bulkWrite. An alternative to copying the data is to use ChainMap:

>>> from collections import ChainMap
>>> bson.encode({'subdoc': ChainMap({'_id': 1}, {'d': 'doc'})})
b'&\x00\x00\x00\x03subdoc\x00\x19\x00\x00\x00\x02d\x00\x04\x00\x00\x00doc\x00\x10_id\x00\x01\x00\x00\x00\x00\x00'

Although we may need to document this interaction with command monitoring. We'll also still need to add _id to the input for backwards compat.

@NoahStapp NoahStapp requested a review from ShaneHarvey October 29, 2024 14:21
@@ -0,0 +1,52 @@
from __future__ import annotations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the boilerplate License comment.

pytestmark = pytest.mark.mockupdb


class TestIdOrdering(PyMongoTestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to the crud spec that describes this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the spec is merged, yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the spec merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. added!

new_document = {"_id": ObjectId()}
new_document.update(document)
document.clear()
document.update(new_document)
Copy link
Member

@ShaneHarvey ShaneHarvey Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on the perf implications of this vs ChainMap? Yet another way is to encode the documents to RawBSONDocuments thus relying on the bson layer to reorder the id field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ChainMap is cleaner, encoding to RawBSONDocuments might have additional performance costs.

document["_id"] = ObjectId()
new_document = {"_id": ObjectId()}
new_document.update(document)
document.clear()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it the more I think it's problematic to call clear() and update() here. Those methods could have unintentional side effects aside from the perf problems. For example consider a user passing a custom mapping class which overrides clear()/update().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ChainMap makes sense, agreed. Explicitly modifying a user-supplied mapping will always carry some risks unfortunately, using the least amount of APIs as possible seems like a safer bet here.

Copy link
Contributor Author

@NoahStapp NoahStapp Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some more thinking: is the added complexity and changing of the type to ChainMap here worth it over this much simpler approach:

  if "_id" in document:
      document = {"_id": document["_id"]} | document
  else:
      id = ObjectId()
      document["_id"] = id
      document = {"_id": id} | document

If the original document already had an _id field, this doesn't modify it, which is consistent with our other insert code paths. If we generated an _id field, we still add it to the original document, but we don't worry about the order of the original document.

This also resolves the doctest error we're seeing due to using ChainMap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler yes, but it's not performant:

$ python -m timeit -s 'd={str(k):k for k in range(10000)};from collections import ChainMap' 'ChainMap(d,{"_id":1})'
2000000 loops, best of 5: 163 nsec per loop
$ python -m timeit -s 'd={str(k):k for k in range(10000)}' '{"_id":1}|d'
2000 loops, best of 5: 143 usec per loop

Copy link
Member

@ShaneHarvey ShaneHarvey Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the above, the slow approach adds 2 milliseconds per 100,000 fields copied on my machine. That's significant enough to warrant the complexity.

For the doc test, we probably want to unwrap the ChainMap (via .maps) before exposing it back to the user in bulk errors and possibly even command monitoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah unwrapping it back into the original map makes sense.

@@ -133,7 +133,10 @@ def add_insert(self, namespace: str, document: _DocumentOut) -> None:
validate_is_document_type("document", document)
# Generate ObjectId client side.
if not (isinstance(document, RawBSONDocument) or "_id" in document):
document["_id"] = ObjectId()
new_document = {"_id": ObjectId()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is also incomplete because it does not put the _id field first if the user supplies it. For example when inserting {"a": 1, "_id": 2}. We should add tests for this case for insert/bulk/clientBulk as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've decided to make re-ordering user-supplied _id fields optional due to the complexity of doing so across different driver implementations. We can do it in PyMongo if we want, but it won't be standard across all drivers.

@NoahStapp NoahStapp requested a review from ShaneHarvey October 30, 2024 13:53
@@ -133,7 +134,9 @@ def add_insert(self, namespace: str, document: _DocumentOut) -> None:
validate_is_document_type("document", document)
# Generate ObjectId client side.
if not (isinstance(document, RawBSONDocument) or "_id" in document):
document["_id"] = ObjectId()
document = ChainMap(document, {"_id": ObjectId()})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to add id to the input document here. Can you also add a test for that?

@@ -133,7 +134,9 @@ def add_insert(self, namespace: str, document: _DocumentOut) -> None:
validate_is_document_type("document", document)
# Generate ObjectId client side.
if not (isinstance(document, RawBSONDocument) or "_id" in document):
document["_id"] = ObjectId()
document = ChainMap(document, {"_id": ObjectId()})
elif not isinstance(document, RawBSONDocument) and "_id" in document:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you refactor these two if statements to avoid repeating the checks?

if not isinstance(document, RawBSONDocument):
    if "_id" in document:
        ...
    else:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be worthwhile to add a comment explaining the ChainMap usage and why we have to use here but not in other insert code paths.

@NoahStapp NoahStapp requested a review from ShaneHarvey October 30, 2024 21:09
@@ -16,7 +16,7 @@
"""Constants, types, and classes shared across Client Bulk Write API implementations."""
from __future__ import annotations

from typing import TYPE_CHECKING, Any, Mapping, MutableMapping, NoReturn
from typing import TYPE_CHECKING, Any, ChainMap, Mapping, MutableMapping, NoReturn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be from collections import ChainMap

@@ -190,7 +190,7 @@ def connection_checked_in(self, event):

import datetime
from collections import abc, namedtuple
from typing import TYPE_CHECKING, Any, Mapping, Optional, Sequence
from typing import TYPE_CHECKING, Any, ChainMap, Mapping, Optional, Sequence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from collections import ChainMap

else:
id = ObjectId()
document["_id"] = id
document = ChainMap(document, {"_id": id})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized this but what do you think about pushing this id-reordering logic down into _client_batched_op_msg_impl? That way we don't need to expose ChainMap anywhere and we don't need to unwrap it either.

        # Encode current operation doc and, if newly added, namespace doc.
        if real_op_type == "insert":
           op_doc = ... # ChainMap stuff
        op_doc_encoded = _dict_to_bson(op_doc, False, opts)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd still need to unwrap it, even with this change:

# Started events
[{'bulkWrite': 1, 'errorsOnly': True, 'ordered': False, 'lsid': {'id': Binary(b'\xa3\xc7\x80\xdd\x07\x98L\x13\x81\xb8\xbcY\xe8\xa0\x04\xf3', 4)}, '$db': 'admin', 'ops': [{'insert': 0, 'document': ChainMap({'foo': 'bar', '_id': 5}, {'_id': 5})}, {'insert': 1, 'document': ChainMap({'foo': 'bar', '_id': 6}, {'_id': 6})}, {'insert': 0, 'document': ChainMap({'foo': 'bar', '_id': 5}, {'_id': 5})}, {'insert': 1, 'document': ChainMap({'foo': 'bar', '_id': 7}, {'_id': 7})}, {'delete': 0, 'filter': {'foo': 'bar', '_id': 5}, 'multi': False}], 'nsInfo': [{'ns': 'db.test_five'}, {'ns': 'db.test_six'}]}]

# Bulk write error
batch op errors occurred, full error: {'anySuccessful': True, 'error': None, 'writeErrors': [{'ok': 0.0, 'idx': 1, 'code': 11000, 'errmsg': 'E11000 duplicate key error collection: db.test_six index: _id_ dup key: { _id: 6 }', 'keyPattern': {'_id': 1}, 'keyValue': {'_id': 6}, 'n': 0, 'op': {'insert': 1, 'document': ChainMap({'foo': 'bar', '_id': 6}, {'_id': 6})}}, {'ok': 0.0, 'idx': 2, 'code': 11000, 'errmsg': 'E11000 duplicate key error collection: db.test_five index: _id_ dup key: { _id: 5 }', 'keyPattern': {'_id': 1}, 'keyValue': {'_id': 5}, 'n': 0, 'op': {'insert': 0, 'document': ChainMap({'foo': 'bar', '_id': 5}, {'_id': 5})}}, {'ok': 0.0, 'idx': 3, 'code': 11000, 'errmsg': 'E11000 duplicate key error collection: db.test_six index: _id_ dup key: { _id: 7 }', 'keyPattern': {'_id': 1}, 'keyValue': {'_id': 7}, 'n': 0, 'op': {'insert': 1, 'document': ChainMap({'foo': 'bar', '_id': 7}, {'_id': 7})}}], 'writeConcernErrors': [], 'nInserted': 1, 'nUpserted': 0, 'nMatched': 0, 'nModified': 0, 'nDeleted': 1, 'insertResults': {}, 'updateResults': {}, 'deleteResults': {}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still like moving the ChainMap logic into _client_batched_op_msg_impl to make how we add _id fields consistent across insert methods.

# it won't be automatically re-ordered by the BSON conversion.
# We use ChainMap here to make the _id field the first field instead.
if real_op_type == "insert":
op_doc["document"] = ChainMap(op_doc["document"], {"_id": op_doc["document"]["_id"]}) # type: ignore[index]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason we need to unwrap ChainMap later is because we're mutating op_doc. Instead we can do this:

        doc_to_encode = op_doc
        if real_op_type == "insert":
            doc = op_doc["document"]
            if not isinstance(doc, RawBSONDocument):
                doc_to_encode = op_doc.copy()  # Shallow copy
                doc_to_encode["document"] = ChainMap(doc, {"_id": doc["_id"]})  # type: ignore[index]
        op_doc_encoded = _dict_to_bson(doc_to_encode, False, opts)

We also still need the RawBSONDocument here. We should add a test to ensure RawBSONDocument is not inflated after a call to bulk_write.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see! That's a slick solution.

@NoahStapp NoahStapp requested a review from ShaneHarvey October 31, 2024 20:44
@@ -18,6 +18,8 @@
import os
import sys

from bson import RawBSONDocument, encode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RawBSONDocument needs to be imported from bson.raw_bson

await self.client.bulk_write(models=models)

self.assertIsNone(doc._RawBSONDocument__inflated_doc)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@NoahStapp NoahStapp requested a review from ShaneHarvey November 1, 2024 13:17
@ShaneHarvey
Copy link
Member

When merging can you update the commit message to mention the client.bulk_write change?

@NoahStapp NoahStapp changed the title PYTHON-4915 - Add guidance on adding _id fields to documents to CRUD spec PYTHON-4915 - Add guidance on adding _id fields to documents to CRUD spec, reorder client.bulk_write generated _id fields Nov 12, 2024
@NoahStapp NoahStapp merged commit 72a5109 into mongodb:master Nov 12, 2024
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants