-
Notifications
You must be signed in to change notification settings - Fork 82
feat: support union type for basic types #510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This reverts commit f8eb3cc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Really appreciate the effort!
Please see the bunch of comments. Welcome for any further discussion!
src/base/value.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the implementations, I realized I overlooked one thing when creating the original issue: value -> JSON -> value may get a different value after a roundtrip, under union type (the value representation has stronger type information and can uniquely identify the type for basic types, but JSON are weaker typed).
This can be a problem, meaning a ser-deser roundtrip is no longer transparent (e.g. we cache intermediate computation results for reuse).
Now I'm thinking maybe we can do the following to aid:
- For union typed values, add a number tag to indicate the branch ID, e.g.
UnionVariant(tag_id, value)
, and serialize it as a 2-element array. - This means the type representation (
UnionType
) needs to keep aVec
, so thetag_id
can more easily work with it. I still like the idea of sorting possible types though - consider doing the sorting when creating theUnionType
. - Note that we have another
TypedValue
serialization logic:Lines 1049 to 1057 in 95c7ece
impl Serialize for TypedValue<'_> { fn serialize<S: serde::Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> { match (self.t, self.v) { (_, Value::Null) => serializer.serialize_none(), (ValueType::Basic(_), v) => v.serialize(serializer), (ValueType::Struct(s), Value::Struct(field_values)) => TypedFieldsValue { schema: &s.fields, values_iter: field_values.fields.iter(), }
The difference is that these serializations are more friendly for being consumed outside cocoindex without cocoindex type info (they are used in export / dump format now), and not for another deserialization. So for these, we don't need to carrytag_id
along.
Welcome to any further discussions on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand that transparency can be a problem. My current implementation includes type guessing that may cause inconsistency.
So, my understanding is that the tag_id
is for the current type of the value, and it's serialized altogether with the value as a JSON array. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct.
(sorry missed this question yesterday)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the code! Looked at the latest in-progress implementation. This is still a little bit different from what I thought.
UnionVariant(tag_id, value)
I suggested is a branch of BasicValue
. Writing it out, may be something like this:
enum BasicValue {
...
UnionVariant{tag_id: usize, value: Box<BasicValue>},
}
e.g. if a union type is Str | Int64
, Str
has tag ID 0, Int64
has tag ID 1.
Both serialization and deserialization will be more straightforward:
- To serialize: directly convert it into array
[tag_id, serialize(value)]
- To deserialize (
from_json()
method): with the tag_id, we can directly get the specific type from the vector inUnionType
, then use it to deserialize value.
This will be simpler and more robust: we don't need to guess the BasicType
from the BasicValue
, which may not be always possible (e.g. zero-length vectors, or a type like Str | Vector[Str | Int64]
).
Besides, I realized there're one more thing may need discuss - I didn't realize last time.
For deserialization, there're two types of use cases:
- Deserialize what we serialized. It will guarantee we get the same thing in a roundtrip.
- Deserialize JSON values created by some external systems. Currently the only situation is
ValueExtractor
injson_schema.rs
(e.g. values are created by LLM for the case), which we don't expect they put the tag ID. On the other hand, we don't expect the value type preserved in a roundtrip for this case too. So we can guess by trying each possible types during deserialization.
To distinguish these two, we can add a new input to from_json()
method, to indicate which mode, e.g. we can have another enum like:
enum DeserializationMode {
/// Deserialize what we serialized. Guaranteed to preserve original type.
Internal,
/// Deserialize JSON values produced by external systems.
External,
}
Thanks for your patience! Let me know if you have any other thoughts. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the misunderstanding. I was actually concerned by serializing the element type.
It is a simple way to eliminate ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I was initially thinking about serialize and deserialize BasicValue
without a UnionVariant
branch.
The serialization creates the tag_id
when BasicValueType
is Union
:
E.g.
BasicValue::Int64(10)
with typeStr | Int
->{ tag_id: 1, value: 10 }
And deser uses the BasicValueType
and tag_id
to construct BasicValue
:
E.g.
{ tag_id: 1, value: 10 }
with typeStr | Int
->BasicValue::Int64(10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thought! This may have the following difficulties though:
- During serialization we don't have the
BasicValueType
. The serialization is designed in a way that can work standlone with theValue
(otherwise we need to do it by a serialization method not standard forserde
, and a lot things need to be changed, e.g. other serializable structs usingValue
as fields). - Even if we pass the type down, we still need to guess the type based on the value. It's doable for most cases, but there're still some difficult cases. e.g. when there's a vector, we need to look into the element, which can be union type too. We'll introduce more times in the future (e.g.
Enum
, [FEATURE] Support Enum as a basic type #523), which may add more complexity (e.g. how aboutVector[Str] | Vector[Enum1]
)?
I prefer avoiding these complexities, to avoid hitting a case that we cannot handle some day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we also need to update make_engine_value_decoder()
in convert.py
.
src/utils/union.rs
Outdated
|
||
/// Move an iterable and parse it into a union type. | ||
/// If there is only one single unique type, it returns a single `BasicValueType`. | ||
pub fn parse_from<T>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like both parse_from()
and coerce_from()
are only used by tests. Then just wonder what do we lose if we directly construct the UnionType
in tests?
I can see we may seed something similar for the following purposes:
- When we get a type encoded from Python code, we may want to canonicalize it, e.g. make sure no duplicated code, no nested union, etc. This is the place we need this:
cocoindex/src/ops/py_factory.rs
Lines 171 to 172 in 95c7ece
let (result_type, executor) = result .extract::<(crate::py::Pythonized<schema::EnrichedValueType>, Py<PyAny>)>(py)?; - When building a type in some operations (like this) some helpers may also be needed.
For 1, in my mind a function like this may be ideal:
fn canonicalize_type(t: EnrichedValueType): EnrichedValueType
For 2, what we need may be closer to this parse_from
, while in my mind the util may directly return a BasicValueType
for ease of use.
(since we don't have a op built in Rust needs this now, we don't need to worry about 2 for now)
Thanks! I'll see what I can do |
bb693e8
to
39b6039
Compare
Here is my solution on the union type support. It includes basic conversions and representations for the union type.
If there are any problems with the change, feel free to speak out for revision or reject.
Context
This PR is an attempt to solve Issue #436. It includes descriptions and conceptual works about union type support.
Changes
UnionType
.BasicValueType
.UnionType
includes aBTreeSet
to store the types, with extra features.int | str | int
isint | str
.BasicValueType
. It might be changed to explicit ranks.int | (int | str) | str | float
isint | str | float
LocalDateTime
,OffsetDateTime
,Date
, andTime
), JSON, and String appear in a union at the same time, the parser will try parsing the value against each possible type in the reversed order of the definition (might be changed to explicit ranks).oneOf
.Union[type1 | type2 | ...]
.