Skip to content

feat: support union type for basic types #510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 57 commits into
base: main
Choose a base branch
from

Conversation

chardoncs
Copy link

Here is my solution on the union type support. It includes basic conversions and representations for the union type.

If there are any problems with the change, feel free to speak out for revision or reject.

Context

This PR is an attempt to solve Issue #436. It includes descriptions and conceptual works about union type support.

Changes

  • Added union in the basic value type, including a new util struct UnionType.
    • [Breaking] Added the union option to the BasicValueType.
    • UnionType includes a BTreeSet to store the types, with extra features.
      • Deduplication: E.g., int | str | int is int | str.
      • Auto-sorting. The current sorting strategy is using the same order of the definition in BasicValueType. It might be changed to explicit ranks.
      • Auto-flattening: E.g., int | (int | str) | str | float is int | str | float
      • If there is only one type, the parser returns the exact type.
      • String guessing: If some of UUID, DateTime (LocalDateTime, OffsetDateTime, Date, and Time), JSON, and String appear in a union at the same time, the parser will try parsing the value against each possible type in the reversed order of the definition (might be changed to explicit ranks).
  • Union type conversion for Qdrant.
  • Union type conversion for PostgreSQL.
  • JSON Schema for union type, using oneOf.
  • Format display: Union[type1 | type2 | ...].
  • [Breaking] Added union support for type encoder in Python API.

Copy link
Member

@badmonster0 badmonster0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Really appreciate the effort!

Please see the bunch of comments. Welcome for any further discussion!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the implementations, I realized I overlooked one thing when creating the original issue: value -> JSON -> value may get a different value after a roundtrip, under union type (the value representation has stronger type information and can uniquely identify the type for basic types, but JSON are weaker typed).

This can be a problem, meaning a ser-deser roundtrip is no longer transparent (e.g. we cache intermediate computation results for reuse).

Now I'm thinking maybe we can do the following to aid:

  • For union typed values, add a number tag to indicate the branch ID, e.g. UnionVariant(tag_id, value), and serialize it as a 2-element array.
  • This means the type representation (UnionType) needs to keep a Vec, so the tag_id can more easily work with it. I still like the idea of sorting possible types though - consider doing the sorting when creating the UnionType.
  • Note that we have another TypedValue serialization logic:

    cocoindex/src/base/value.rs

    Lines 1049 to 1057 in 95c7ece

    impl Serialize for TypedValue<'_> {
    fn serialize<S: serde::Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> {
    match (self.t, self.v) {
    (_, Value::Null) => serializer.serialize_none(),
    (ValueType::Basic(_), v) => v.serialize(serializer),
    (ValueType::Struct(s), Value::Struct(field_values)) => TypedFieldsValue {
    schema: &s.fields,
    values_iter: field_values.fields.iter(),
    }

    The difference is that these serializations are more friendly for being consumed outside cocoindex without cocoindex type info (they are used in export / dump format now), and not for another deserialization. So for these, we don't need to carry tag_id along.

Welcome to any further discussions on this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand that transparency can be a problem. My current implementation includes type guessing that may cause inconsistency.

So, my understanding is that the tag_id is for the current type of the value, and it's serialized altogether with the value as a JSON array. Is that correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct.
(sorry missed this question yesterday)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the code! Looked at the latest in-progress implementation. This is still a little bit different from what I thought.

UnionVariant(tag_id, value) I suggested is a branch of BasicValue. Writing it out, may be something like this:

enum BasicValue {
  ...
  UnionVariant{tag_id: usize, value: Box<BasicValue>},
}

e.g. if a union type is Str | Int64, Str has tag ID 0, Int64 has tag ID 1.

Both serialization and deserialization will be more straightforward:

  • To serialize: directly convert it into array [tag_id, serialize(value)]
  • To deserialize (from_json() method): with the tag_id, we can directly get the specific type from the vector in UnionType, then use it to deserialize value.

This will be simpler and more robust: we don't need to guess the BasicType from the BasicValue, which may not be always possible (e.g. zero-length vectors, or a type like Str | Vector[Str | Int64]).


Besides, I realized there're one more thing may need discuss - I didn't realize last time.

For deserialization, there're two types of use cases:

  1. Deserialize what we serialized. It will guarantee we get the same thing in a roundtrip.
  2. Deserialize JSON values created by some external systems. Currently the only situation is ValueExtractor in json_schema.rs (e.g. values are created by LLM for the case), which we don't expect they put the tag ID. On the other hand, we don't expect the value type preserved in a roundtrip for this case too. So we can guess by trying each possible types during deserialization.

To distinguish these two, we can add a new input to from_json() method, to indicate which mode, e.g. we can have another enum like:

enum DeserializationMode {
  /// Deserialize what we serialized. Guaranteed to preserve original type.
  Internal,
  /// Deserialize JSON values produced by external systems. 
  External,
}

Thanks for your patience! Let me know if you have any other thoughts. Thanks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the misunderstanding. I was actually concerned by serializing the element type.

It is a simple way to eliminate ambiguity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I was initially thinking about serialize and deserialize BasicValue without a UnionVariant branch.

The serialization creates the tag_id when BasicValueType is Union:

E.g. BasicValue::Int64(10) with type Str | Int -> { tag_id: 1, value: 10 }

And deser uses the BasicValueType and tag_id to construct BasicValue:

E.g. { tag_id: 1, value: 10 } with type Str | Int -> BasicValue::Int64(10)

Copy link
Member

@badmonster0 badmonster0 May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thought! This may have the following difficulties though:

  • During serialization we don't have the BasicValueType. The serialization is designed in a way that can work standlone with the Value (otherwise we need to do it by a serialization method not standard for serde, and a lot things need to be changed, e.g. other serializable structs using Value as fields).
  • Even if we pass the type down, we still need to guess the type based on the value. It's doable for most cases, but there're still some difficult cases. e.g. when there's a vector, we need to look into the element, which can be union type too. We'll introduce more times in the future (e.g. Enum, [FEATURE] Support Enum as a basic type #523), which may add more complexity (e.g. how about Vector[Str] | Vector[Enum1])?

I prefer avoiding these complexities, to avoid hitting a case that we cannot handle some day.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we also need to update make_engine_value_decoder() in convert.py.


/// Move an iterable and parse it into a union type.
/// If there is only one single unique type, it returns a single `BasicValueType`.
pub fn parse_from<T>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like both parse_from() and coerce_from() are only used by tests. Then just wonder what do we lose if we directly construct the UnionType in tests?

I can see we may seed something similar for the following purposes:

  1. When we get a type encoded from Python code, we may want to canonicalize it, e.g. make sure no duplicated code, no nested union, etc. This is the place we need this:
    let (result_type, executor) = result
    .extract::<(crate::py::Pythonized<schema::EnrichedValueType>, Py<PyAny>)>(py)?;
  2. When building a type in some operations (like this) some helpers may also be needed.

For 1, in my mind a function like this may be ideal:

fn canonicalize_type(t: EnrichedValueType): EnrichedValueType 

For 2, what we need may be closer to this parse_from, while in my mind the util may directly return a BasicValueType for ease of use.

(since we don't have a op built in Rust needs this now, we don't need to worry about 2 for now)

@chardoncs
Copy link
Author

chardoncs commented May 19, 2025

Thanks! I'll see what I can do

@chardoncs chardoncs force-pushed the expr-union-type-impl branch from bb693e8 to 39b6039 Compare May 21, 2025 06:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants