PYTHON-4915 - Add guidance on adding _id fields to documents to CRUD spec, reorder client.bulk_write generated _id fields #1976

NoahStapp · 2024-10-28T16:01:14Z

No description provided.

ShaneHarvey

We don't need this change. For one it adds a likely non-trivial perf cost for documents with many top-level fields. We also already reorder the _id to be the first field when encoding a document to BSON, see:

mongo-python-driver/bson/__init__.py

Lines 1005 to 1006 in 3ef565f

    
           if top_level and "_id" in doc: 
        
               elements.append(_name_value_to_bson(b"_id\x00", doc["_id"], check_keys, opts))

Instead we should add a MockupDB test to assert that we actually send the _id field first, if such a test doesn't already exist.

…spec

NoahStapp · 2024-10-28T20:20:25Z

We don't need this change. For one it adds a likely non-trivial perf cost for documents with many top-level fields. We also already reorder the _id to be the first field when encoding a document to BSON, see:

mongo-python-driver/bson/__init__.py

Lines 1005 to 1006 in 3ef565f

if top_level and "_id" in doc:

elements.append(_name_value_to_bson(b"_id\x00", doc["_id"], check_keys, opts))

Instead we should add a MockupDB test to assert that we actually send the _id field first, if such a test doesn't already exist.

insert operations and collection-level bulkWrite correctly order their BSON documents, but client-level bulkWrite does not:

client.bulk_write([InsertOne(namespace="db.coll", document={'x2': 1})]) 

# Sends this to the server
{"bulkWrite": 1, "errorsOnly": true, "ordered": true, "$db": "admin", "ops": [{"insert": 0, "document": {"x2": 1, "_id": {"$oid": "671ff1d0fe637fbcdaa64254"}}}], "nsInfo": [{"ns": "db.coll"}]}

This is because the document is embedded within the operation mapping, so it does not put the _id field first.

ShaneHarvey · 2024-10-28T23:14:48Z

Ah good find. Then we need to add the change back for client.bulkWrite. An alternative to copying the data is to use ChainMap:

>>> from collections import ChainMap
>>> bson.encode({'subdoc': ChainMap({'_id': 1}, {'d': 'doc'})})
b'&\x00\x00\x00\x03subdoc\x00\x19\x00\x00\x00\x02d\x00\x04\x00\x00\x00doc\x00\x10_id\x00\x01\x00\x00\x00\x00\x00'

Although we may need to document this interaction with command monitoring. We'll also still need to add _id to the input for backwards compat.

ShaneHarvey · 2024-10-29T22:08:01Z

test/mockupdb/test_id_ordering.py

@@ -0,0 +1,52 @@
+from __future__ import annotations


Let's add the boilerplate License comment.

ShaneHarvey · 2024-10-29T22:08:45Z

test/mockupdb/test_id_ordering.py

+pytestmark = pytest.mark.mockupdb
+
+
+class TestIdOrdering(PyMongoTestCase):


Can you add a link to the crud spec that describes this test?

Once the spec is merged, yes.

Was the spec merged?

Yes. added!

ShaneHarvey · 2024-10-29T22:10:44Z

pymongo/asynchronous/client_bulk.py

+            new_document = {"_id": ObjectId()}
+            new_document.update(document)
+            document.clear()
+            document.update(new_document)


Thoughts on the perf implications of this vs ChainMap? Yet another way is to encode the documents to RawBSONDocuments thus relying on the bson layer to reorder the id field.

Using ChainMap is cleaner, encoding to RawBSONDocuments might have additional performance costs.

ShaneHarvey · 2024-10-29T22:13:17Z

pymongo/asynchronous/client_bulk.py

-            document["_id"] = ObjectId()
+            new_document = {"_id": ObjectId()}
+            new_document.update(document)
+            document.clear()


The more I think about it the more I think it's problematic to call clear() and update() here. Those methods could have unintentional side effects aside from the perf problems. For example consider a user passing a custom mapping class which overrides clear()/update().

Using ChainMap makes sense, agreed. Explicitly modifying a user-supplied mapping will always carry some risks unfortunately, using the least amount of APIs as possible seems like a safer bet here.

Did some more thinking: is the added complexity and changing of the type to ChainMap here worth it over this much simpler approach:

if "_id" in document: document = {"_id": document["_id"]} | document else: id = ObjectId() document["_id"] = id document = {"_id": id} | document

If the original document already had an _id field, this doesn't modify it, which is consistent with our other insert code paths. If we generated an _id field, we still add it to the original document, but we don't worry about the order of the original document.

This also resolves the doctest error we're seeing due to using ChainMap.

Simpler yes, but it's not performant:

$ python -m timeit -s 'd={str(k):k for k in range(10000)};from collections import ChainMap' 'ChainMap(d,{"_id":1})' 2000000 loops, best of 5: 163 nsec per loop $ python -m timeit -s 'd={str(k):k for k in range(10000)}' '{"_id":1}|d' 2000 loops, best of 5: 143 usec per loop

Based on the above, the slow approach adds 2 milliseconds per 100,000 fields copied on my machine. That's significant enough to warrant the complexity.

For the doc test, we probably want to unwrap the ChainMap (via .maps) before exposing it back to the user in bulk errors and possibly even command monitoring.

Excellent point!

Yeah unwrapping it back into the original map makes sense.

ShaneHarvey · 2024-10-29T22:47:10Z

pymongo/asynchronous/client_bulk.py

@@ -133,7 +133,10 @@ def add_insert(self, namespace: str, document: _DocumentOut) -> None:
        validate_is_document_type("document", document)
        # Generate ObjectId client side.
        if not (isinstance(document, RawBSONDocument) or "_id" in document):
-            document["_id"] = ObjectId()
+            new_document = {"_id": ObjectId()}


This implementation is also incomplete because it does not put the _id field first if the user supplies it. For example when inserting {"a": 1, "_id": 2}. We should add tests for this case for insert/bulk/clientBulk as well.

We've decided to make re-ordering user-supplied _id fields optional due to the complexity of doing so across different driver implementations. We can do it in PyMongo if we want, but it won't be standard across all drivers.

ShaneHarvey · 2024-10-30T18:01:08Z

pymongo/asynchronous/client_bulk.py

@@ -133,7 +134,9 @@ def add_insert(self, namespace: str, document: _DocumentOut) -> None:
        validate_is_document_type("document", document)
        # Generate ObjectId client side.
        if not (isinstance(document, RawBSONDocument) or "_id" in document):
-            document["_id"] = ObjectId()
+            document = ChainMap(document, {"_id": ObjectId()})


We still need to add id to the input document here. Can you also add a test for that?

ShaneHarvey · 2024-10-30T18:06:33Z

pymongo/asynchronous/client_bulk.py

@@ -133,7 +134,9 @@ def add_insert(self, namespace: str, document: _DocumentOut) -> None:
        validate_is_document_type("document", document)
        # Generate ObjectId client side.
        if not (isinstance(document, RawBSONDocument) or "_id" in document):
-            document["_id"] = ObjectId()
+            document = ChainMap(document, {"_id": ObjectId()})
+        elif not isinstance(document, RawBSONDocument) and "_id" in document:


Could you refactor these two if statements to avoid repeating the checks?

if not isinstance(document, RawBSONDocument): if "_id" in document: ... else:

It would also be worthwhile to add a comment explaining the ChainMap usage and why we have to use here but not in other insert code paths.

ShaneHarvey · 2024-10-30T21:53:19Z

pymongo/_client_bulk_shared.py

@@ -16,7 +16,7 @@
 """Constants, types, and classes shared across Client Bulk Write API implementations."""
 from __future__ import annotations

-from typing import TYPE_CHECKING, Any, Mapping, MutableMapping, NoReturn
+from typing import TYPE_CHECKING, Any, ChainMap, Mapping, MutableMapping, NoReturn


This should be from collections import ChainMap

ShaneHarvey · 2024-10-30T21:53:23Z

pymongo/monitoring.py

@@ -190,7 +190,7 @@ def connection_checked_in(self, event):

 import datetime
 from collections import abc, namedtuple
-from typing import TYPE_CHECKING, Any, Mapping, Optional, Sequence
+from typing import TYPE_CHECKING, Any, ChainMap, Mapping, Optional, Sequence


from collections import ChainMap

ShaneHarvey · 2024-10-30T21:58:09Z

pymongo/asynchronous/client_bulk.py

+            else:
+                id = ObjectId()
+                document["_id"] = id
+                document = ChainMap(document, {"_id": id})


I just realized this but what do you think about pushing this id-reordering logic down into _client_batched_op_msg_impl? That way we don't need to expose ChainMap anywhere and we don't need to unwrap it either.

# Encode current operation doc and, if newly added, namespace doc. if real_op_type == "insert": op_doc = ... # ChainMap stuff op_doc_encoded = _dict_to_bson(op_doc, False, opts)

We'd still need to unwrap it, even with this change:

# Started events [{'bulkWrite': 1, 'errorsOnly': True, 'ordered': False, 'lsid': {'id': Binary(b'\xa3\xc7\x80\xdd\x07\x98L\x13\x81\xb8\xbcY\xe8\xa0\x04\xf3', 4)}, '$db': 'admin', 'ops': [{'insert': 0, 'document': ChainMap({'foo': 'bar', '_id': 5}, {'_id': 5})}, {'insert': 1, 'document': ChainMap({'foo': 'bar', '_id': 6}, {'_id': 6})}, {'insert': 0, 'document': ChainMap({'foo': 'bar', '_id': 5}, {'_id': 5})}, {'insert': 1, 'document': ChainMap({'foo': 'bar', '_id': 7}, {'_id': 7})}, {'delete': 0, 'filter': {'foo': 'bar', '_id': 5}, 'multi': False}], 'nsInfo': [{'ns': 'db.test_five'}, {'ns': 'db.test_six'}]}] # Bulk write error batch op errors occurred, full error: {'anySuccessful': True, 'error': None, 'writeErrors': [{'ok': 0.0, 'idx': 1, 'code': 11000, 'errmsg': 'E11000 duplicate key error collection: db.test_six index: _id_ dup key: { _id: 6 }', 'keyPattern': {'_id': 1}, 'keyValue': {'_id': 6}, 'n': 0, 'op': {'insert': 1, 'document': ChainMap({'foo': 'bar', '_id': 6}, {'_id': 6})}}, {'ok': 0.0, 'idx': 2, 'code': 11000, 'errmsg': 'E11000 duplicate key error collection: db.test_five index: _id_ dup key: { _id: 5 }', 'keyPattern': {'_id': 1}, 'keyValue': {'_id': 5}, 'n': 0, 'op': {'insert': 0, 'document': ChainMap({'foo': 'bar', '_id': 5}, {'_id': 5})}}, {'ok': 0.0, 'idx': 3, 'code': 11000, 'errmsg': 'E11000 duplicate key error collection: db.test_six index: _id_ dup key: { _id: 7 }', 'keyPattern': {'_id': 1}, 'keyValue': {'_id': 7}, 'n': 0, 'op': {'insert': 1, 'document': ChainMap({'foo': 'bar', '_id': 7}, {'_id': 7})}}], 'writeConcernErrors': [], 'nInserted': 1, 'nUpserted': 0, 'nMatched': 0, 'nModified': 0, 'nDeleted': 1, 'insertResults': {}, 'updateResults': {}, 'deleteResults': {}}

I still like moving the ChainMap logic into _client_batched_op_msg_impl to make how we add _id fields consistent across insert methods.

…impl

ShaneHarvey · 2024-10-31T17:47:42Z

pymongo/message.py

+        # it won't be automatically re-ordered by the BSON conversion.
+        # We use ChainMap here to make the _id field the first field instead.
+        if real_op_type == "insert":
+            op_doc["document"] = ChainMap(op_doc["document"], {"_id": op_doc["document"]["_id"]})  # type: ignore[index]


The only reason we need to unwrap ChainMap later is because we're mutating op_doc. Instead we can do this:

doc_to_encode = op_doc if real_op_type == "insert": doc = op_doc["document"] if not isinstance(doc, RawBSONDocument): doc_to_encode = op_doc.copy() # Shallow copy doc_to_encode["document"] = ChainMap(doc, {"_id": doc["_id"]}) # type: ignore[index] op_doc_encoded = _dict_to_bson(doc_to_encode, False, opts)

We also still need the RawBSONDocument here. We should add a test to ensure RawBSONDocument is not inflated after a call to bulk_write.

Ah, I see! That's a slick solution.

ShaneHarvey · 2024-10-31T21:08:08Z

test/asynchronous/test_client_bulk_write.py

@@ -18,6 +18,8 @@
 import os
 import sys

+from bson import RawBSONDocument, encode


RawBSONDocument needs to be imported from bson.raw_bson

ShaneHarvey · 2024-10-31T21:08:30Z

test/asynchronous/test_client_bulk_write.py

+        await self.client.bulk_write(models=models)
+
+        self.assertIsNone(doc._RawBSONDocument__inflated_doc)
+


ShaneHarvey · 2024-11-11T23:18:38Z

When merging can you update the commit message to mention the client.bulk_write change?

NoahStapp mentioned this pull request Oct 28, 2024

DRIVERS-1408 - Add guidance on adding _id fields to documents to CRUD spec mongodb/specifications#1688

Merged

NoahStapp requested a review from ShaneHarvey October 28, 2024 16:09

ShaneHarvey requested changes Oct 28, 2024

View reviewed changes

PYTHON-4915 - Add guidance on adding _id fields to documents to CRUD …

8906e84

…spec

NoahStapp force-pushed the PYTHON-4915 branch from 75aef8e to 8906e84 Compare October 28, 2024 18:40

Reorder client.bulkWrite _id

425cd1b

NoahStapp requested a review from ShaneHarvey October 29, 2024 14:21

ShaneHarvey requested changes Oct 29, 2024

View reviewed changes

ShaneHarvey reviewed Oct 29, 2024

View reviewed changes

NoahStapp added 2 commits October 30, 2024 09:52

Use ChainMap

1b3df52

Add license

36187bb

NoahStapp requested a review from ShaneHarvey October 30, 2024 13:53

ShaneHarvey requested changes Oct 30, 2024

View reviewed changes

Unwrap ChainMaps during logging

0e07e18

NoahStapp requested a review from ShaneHarvey October 30, 2024 21:09

ShaneHarvey reviewed Oct 30, 2024

View reviewed changes

Move re-order logic for client bulkWrite into _client_batched_op_msg_…

13568b1

…impl

ShaneHarvey requested changes Oct 31, 2024

View reviewed changes

Isolate ChainMap + add RawBSONDocument not inflated test

da83afc

NoahStapp requested a review from ShaneHarvey October 31, 2024 20:44

typing fixes

8894f23

ShaneHarvey reviewed Oct 31, 2024

View reviewed changes

Cleanup

b2dede3

NoahStapp requested a review from ShaneHarvey November 1, 2024 13:17

ShaneHarvey approved these changes Nov 11, 2024

View reviewed changes

NoahStapp changed the title ~~PYTHON-4915 - Add guidance on adding _id fields to documents to CRUD spec~~ PYTHON-4915 - Add guidance on adding _id fields to documents to CRUD spec, reorder client.bulk_write generated _id fields Nov 12, 2024

NoahStapp merged commit 72a5109 into mongodb:master Nov 12, 2024
36 checks passed

NoahStapp mentioned this pull request Jan 10, 2025

Revert "PYTHON-4915 - Add guidance on adding _id fields to documents to CRUD spec, reorder client.bulk_write generated _id fields" #2055

Merged

	if top_level and "_id" in doc:
	elements.append(_name_value_to_bson(b"_id\x00", doc["_id"], check_keys, opts))

		pytestmark = pytest.mark.mockupdb


		class TestIdOrdering(PyMongoTestCase):

		await self.client.bulk_write(models=models)

		self.assertIsNone(doc._RawBSONDocument__inflated_doc)

PYTHON-4915 - Add guidance on adding _id fields to documents to CRUD spec, reorder client.bulk_write generated _id fields #1976

PYTHON-4915 - Add guidance on adding _id fields to documents to CRUD spec, reorder client.bulk_write generated _id fields #1976

Uh oh!

Conversation

NoahStapp commented Oct 28, 2024

Uh oh!

ShaneHarvey left a comment

Choose a reason for hiding this comment

Uh oh!

NoahStapp commented Oct 28, 2024

Uh oh!

ShaneHarvey commented Oct 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NoahStapp Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey commented Nov 11, 2024

Uh oh!

Uh oh!

Uh oh!

ShaneHarvey Oct 29, 2024 •

edited

Loading

NoahStapp Oct 30, 2024 •

edited

Loading

ShaneHarvey Oct 30, 2024 •

edited

Loading