-
Notifications
You must be signed in to change notification settings - Fork 1.1k
PYTHON-4915 - Add guidance on adding _id fields to documents to CRUD spec, reorder client.bulk_write generated _id fields #1976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this change. For one it adds a likely non-trivial perf cost for documents with many top-level fields. We also already reorder the _id to be the first field when encoding a document to BSON, see:
mongo-python-driver/bson/__init__.py
Lines 1005 to 1006 in 3ef565f
if top_level and "_id" in doc: | |
elements.append(_name_value_to_bson(b"_id\x00", doc["_id"], check_keys, opts)) |
Instead we should add a MockupDB test to assert that we actually send the _id
field first, if such a test doesn't already exist.
75aef8e
to
8906e84
Compare
This is because the |
Ah good find. Then we need to add the change back for client.bulkWrite. An alternative to copying the data is to use ChainMap: >>> from collections import ChainMap
>>> bson.encode({'subdoc': ChainMap({'_id': 1}, {'d': 'doc'})})
b'&\x00\x00\x00\x03subdoc\x00\x19\x00\x00\x00\x02d\x00\x04\x00\x00\x00doc\x00\x10_id\x00\x01\x00\x00\x00\x00\x00' Although we may need to document this interaction with command monitoring. We'll also still need to add |
@@ -0,0 +1,52 @@ | |||
from __future__ import annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the boilerplate License comment.
pytestmark = pytest.mark.mockupdb | ||
|
||
|
||
class TestIdOrdering(PyMongoTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a link to the crud spec that describes this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the spec is merged, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the spec merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. added!
pymongo/asynchronous/client_bulk.py
Outdated
new_document = {"_id": ObjectId()} | ||
new_document.update(document) | ||
document.clear() | ||
document.update(new_document) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on the perf implications of this vs ChainMap? Yet another way is to encode the documents to RawBSONDocuments thus relying on the bson layer to reorder the id field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ChainMap is cleaner, encoding to RawBSONDocuments might have additional performance costs.
pymongo/asynchronous/client_bulk.py
Outdated
document["_id"] = ObjectId() | ||
new_document = {"_id": ObjectId()} | ||
new_document.update(document) | ||
document.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about it the more I think it's problematic to call clear() and update() here. Those methods could have unintentional side effects aside from the perf problems. For example consider a user passing a custom mapping class which overrides clear()/update().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ChainMap makes sense, agreed. Explicitly modifying a user-supplied mapping will always carry some risks unfortunately, using the least amount of APIs as possible seems like a safer bet here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some more thinking: is the added complexity and changing of the type to ChainMap
here worth it over this much simpler approach:
if "_id" in document:
document = {"_id": document["_id"]} | document
else:
id = ObjectId()
document["_id"] = id
document = {"_id": id} | document
If the original document already had an _id
field, this doesn't modify it, which is consistent with our other insert code paths. If we generated an _id
field, we still add it to the original document, but we don't worry about the order of the original document.
This also resolves the doctest error we're seeing due to using ChainMap
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler yes, but it's not performant:
$ python -m timeit -s 'd={str(k):k for k in range(10000)};from collections import ChainMap' 'ChainMap(d,{"_id":1})'
2000000 loops, best of 5: 163 nsec per loop
$ python -m timeit -s 'd={str(k):k for k in range(10000)}' '{"_id":1}|d'
2000 loops, best of 5: 143 usec per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the above, the slow approach adds 2 milliseconds per 100,000 fields copied on my machine. That's significant enough to warrant the complexity.
For the doc test, we probably want to unwrap the ChainMap (via .maps
) before exposing it back to the user in bulk errors and possibly even command monitoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah unwrapping it back into the original map makes sense.
pymongo/asynchronous/client_bulk.py
Outdated
@@ -133,7 +133,10 @@ def add_insert(self, namespace: str, document: _DocumentOut) -> None: | |||
validate_is_document_type("document", document) | |||
# Generate ObjectId client side. | |||
if not (isinstance(document, RawBSONDocument) or "_id" in document): | |||
document["_id"] = ObjectId() | |||
new_document = {"_id": ObjectId()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is also incomplete because it does not put the _id
field first if the user supplies it. For example when inserting {"a": 1, "_id": 2}
. We should add tests for this case for insert/bulk/clientBulk as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've decided to make re-ordering user-supplied _id
fields optional due to the complexity of doing so across different driver implementations. We can do it in PyMongo if we want, but it won't be standard across all drivers.
pymongo/asynchronous/client_bulk.py
Outdated
@@ -133,7 +134,9 @@ def add_insert(self, namespace: str, document: _DocumentOut) -> None: | |||
validate_is_document_type("document", document) | |||
# Generate ObjectId client side. | |||
if not (isinstance(document, RawBSONDocument) or "_id" in document): | |||
document["_id"] = ObjectId() | |||
document = ChainMap(document, {"_id": ObjectId()}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to add id to the input document here. Can you also add a test for that?
pymongo/asynchronous/client_bulk.py
Outdated
@@ -133,7 +134,9 @@ def add_insert(self, namespace: str, document: _DocumentOut) -> None: | |||
validate_is_document_type("document", document) | |||
# Generate ObjectId client side. | |||
if not (isinstance(document, RawBSONDocument) or "_id" in document): | |||
document["_id"] = ObjectId() | |||
document = ChainMap(document, {"_id": ObjectId()}) | |||
elif not isinstance(document, RawBSONDocument) and "_id" in document: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you refactor these two if statements to avoid repeating the checks?
if not isinstance(document, RawBSONDocument):
if "_id" in document:
...
else:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be worthwhile to add a comment explaining the ChainMap usage and why we have to use here but not in other insert code paths.
pymongo/_client_bulk_shared.py
Outdated
@@ -16,7 +16,7 @@ | |||
"""Constants, types, and classes shared across Client Bulk Write API implementations.""" | |||
from __future__ import annotations | |||
|
|||
from typing import TYPE_CHECKING, Any, Mapping, MutableMapping, NoReturn | |||
from typing import TYPE_CHECKING, Any, ChainMap, Mapping, MutableMapping, NoReturn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be from collections import ChainMap
pymongo/monitoring.py
Outdated
@@ -190,7 +190,7 @@ def connection_checked_in(self, event): | |||
|
|||
import datetime | |||
from collections import abc, namedtuple | |||
from typing import TYPE_CHECKING, Any, Mapping, Optional, Sequence | |||
from typing import TYPE_CHECKING, Any, ChainMap, Mapping, Optional, Sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from collections import ChainMap
pymongo/asynchronous/client_bulk.py
Outdated
else: | ||
id = ObjectId() | ||
document["_id"] = id | ||
document = ChainMap(document, {"_id": id}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized this but what do you think about pushing this id-reordering logic down into _client_batched_op_msg_impl
? That way we don't need to expose ChainMap anywhere and we don't need to unwrap it either.
# Encode current operation doc and, if newly added, namespace doc.
if real_op_type == "insert":
op_doc = ... # ChainMap stuff
op_doc_encoded = _dict_to_bson(op_doc, False, opts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd still need to unwrap it, even with this change:
# Started events
[{'bulkWrite': 1, 'errorsOnly': True, 'ordered': False, 'lsid': {'id': Binary(b'\xa3\xc7\x80\xdd\x07\x98L\x13\x81\xb8\xbcY\xe8\xa0\x04\xf3', 4)}, '$db': 'admin', 'ops': [{'insert': 0, 'document': ChainMap({'foo': 'bar', '_id': 5}, {'_id': 5})}, {'insert': 1, 'document': ChainMap({'foo': 'bar', '_id': 6}, {'_id': 6})}, {'insert': 0, 'document': ChainMap({'foo': 'bar', '_id': 5}, {'_id': 5})}, {'insert': 1, 'document': ChainMap({'foo': 'bar', '_id': 7}, {'_id': 7})}, {'delete': 0, 'filter': {'foo': 'bar', '_id': 5}, 'multi': False}], 'nsInfo': [{'ns': 'db.test_five'}, {'ns': 'db.test_six'}]}]
# Bulk write error
batch op errors occurred, full error: {'anySuccessful': True, 'error': None, 'writeErrors': [{'ok': 0.0, 'idx': 1, 'code': 11000, 'errmsg': 'E11000 duplicate key error collection: db.test_six index: _id_ dup key: { _id: 6 }', 'keyPattern': {'_id': 1}, 'keyValue': {'_id': 6}, 'n': 0, 'op': {'insert': 1, 'document': ChainMap({'foo': 'bar', '_id': 6}, {'_id': 6})}}, {'ok': 0.0, 'idx': 2, 'code': 11000, 'errmsg': 'E11000 duplicate key error collection: db.test_five index: _id_ dup key: { _id: 5 }', 'keyPattern': {'_id': 1}, 'keyValue': {'_id': 5}, 'n': 0, 'op': {'insert': 0, 'document': ChainMap({'foo': 'bar', '_id': 5}, {'_id': 5})}}, {'ok': 0.0, 'idx': 3, 'code': 11000, 'errmsg': 'E11000 duplicate key error collection: db.test_six index: _id_ dup key: { _id: 7 }', 'keyPattern': {'_id': 1}, 'keyValue': {'_id': 7}, 'n': 0, 'op': {'insert': 1, 'document': ChainMap({'foo': 'bar', '_id': 7}, {'_id': 7})}}], 'writeConcernErrors': [], 'nInserted': 1, 'nUpserted': 0, 'nMatched': 0, 'nModified': 0, 'nDeleted': 1, 'insertResults': {}, 'updateResults': {}, 'deleteResults': {}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still like moving the ChainMap logic into _client_batched_op_msg_impl
to make how we add _id
fields consistent across insert methods.
pymongo/message.py
Outdated
# it won't be automatically re-ordered by the BSON conversion. | ||
# We use ChainMap here to make the _id field the first field instead. | ||
if real_op_type == "insert": | ||
op_doc["document"] = ChainMap(op_doc["document"], {"_id": op_doc["document"]["_id"]}) # type: ignore[index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason we need to unwrap ChainMap later is because we're mutating op_doc. Instead we can do this:
doc_to_encode = op_doc
if real_op_type == "insert":
doc = op_doc["document"]
if not isinstance(doc, RawBSONDocument):
doc_to_encode = op_doc.copy() # Shallow copy
doc_to_encode["document"] = ChainMap(doc, {"_id": doc["_id"]}) # type: ignore[index]
op_doc_encoded = _dict_to_bson(doc_to_encode, False, opts)
We also still need the RawBSONDocument here. We should add a test to ensure RawBSONDocument is not inflated after a call to bulk_write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see! That's a slick solution.
@@ -18,6 +18,8 @@ | |||
import os | |||
import sys | |||
|
|||
from bson import RawBSONDocument, encode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RawBSONDocument needs to be imported from bson.raw_bson
await self.client.bulk_write(models=models) | ||
|
||
self.assertIsNone(doc._RawBSONDocument__inflated_doc) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
When merging can you update the commit message to mention the client.bulk_write change? |
No description provided.