Skip to content

Commit 39b6039

Browse files
authored
Merge branch 'cocoindex-io:main' into expr-union-type-impl
2 parents beaa1c1 + 6a0bbbc commit 39b6039

File tree

9 files changed

+131
-32
lines changed

9 files changed

+131
-32
lines changed

docs/docs/core/basics.md

Lines changed: 10 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
---
2-
title: Basics
3-
description: "CocoIndex basic concepts: indexing flow, data, operations, data updates, etc."
2+
title: Indexing Basics
3+
description: "CocoIndex basic concepts for indexing: indexing flow, data, operations, data updates, etc."
44
---
55

6-
# CocoIndex Basics
6+
# CocoIndex Indexing Basics
77

88
An **index** is a collection of data stored in a way that is easy for retrieval.
99

10-
CocoIndex is an ETL framework for building indexes from specified data sources, a.k.a. indexing. It also offers utilities for users to retrieve data from the indexes.
10+
CocoIndex is an ETL framework for building indexes from specified data sources, a.k.a. **indexing**. It also offers utilities for users to retrieve data from the indexes.
1111

12-
## Indexing flow
12+
An **indexing flow** extracts data from specified data sources, upon specified transformations, and puts the transformed data into specified storage for later retrieval.
1313

14-
An indexing flow extracts data from specified data sources, upon specified transformations, and puts the transformed data into specified storage for later retrieval.
14+
## Indexing flow elements
1515

1616
An indexing flow has two aspects: data and operations on data.
1717

@@ -42,7 +42,7 @@ An **operation** in an indexing flow defines a step in the flow. An operation is
4242

4343
"import" and "transform" operations produce output data, whose data type is determined based on the operation spec and data types of input data (for "transform" operation only).
4444

45-
### Example
45+
## An indexing flow example
4646

4747
For the example shown in the [Quickstart](../getting_started/quickstart) section, the indexing flow is as follows:
4848

@@ -60,7 +60,7 @@ This shows schema and example data for the indexing flow:
6060

6161
![Data Example](data_example.svg)
6262

63-
### Life cycle of an indexing flow
63+
## Life cycle of an indexing flow
6464

6565
An indexing flow, once set up, maintains a long-lived relationship between data source and data in target storage. This means:
6666

@@ -95,19 +95,10 @@ CocoIndex works the same way, but with more powerful capabilities:
9595

9696
This means when writing your flow operations, you can treat source data as if it were static - focusing purely on defining the transformation logic. CocoIndex takes care of maintaining the dynamic relationship between sources and target data behind the scenes.
9797

98-
### Internal storage
98+
## Internal storage
9999

100100
As an indexing flow is long-lived, it needs to store intermediate data to keep track of the states.
101101
CocoIndex uses internal storage for this purpose.
102102

103103
Currently, CocoIndex uses Postgres database as the internal storage.
104-
See [Initialization](initialization) for configuring its location, and `cocoindex setup` CLI command (see [CocoIndex CLI](cli)) creates tables for the internal storage.
105-
106-
## Retrieval
107-
108-
There are two ways to retrieve data from target storage built by an indexing flow:
109-
110-
* Query the underlying target storage directly for maximum flexibility.
111-
* Use CocoIndex *query handlers* for a more convenient experience with built-in tooling support (e.g. CocoInsight) to understand query performance against the target data.
112-
113-
Query handlers are tied to specific indexing flows. They accept query inputs, transform them by defined operations, and retrieve matching data from the target storage that was created by the flow.
104+
See [Initialization](initialization) for configuring its location, and `cocoindex setup` CLI command (see [CocoIndex CLI](cli)) creates tables for the internal storage.

docs/docs/core/flow_def.mdx

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
---
22
title: Flow Definition
33
description: Define a CocoIndex flow, by specifying source, transformations and storages, and connect input/output data of them.
4-
toc_max_heading_level: 4
54
---
65

76
import Tabs from '@theme/Tabs';

docs/docs/getting_started/quickstart.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@ if __name__ == "__main__":
132132
133133
The `@cocoindex.main_fn` declares a function as the main function for an indexing application. This achieves the following effects:
134134
135-
* Initialize the CocoIndex librart states. Settings (e.g. database URL) are loaded from environment variables by default.
135+
* Initialize the CocoIndex library states. Settings (e.g. database URL) are loaded from environment variables by default.
136136
* When the CLI is invoked with `cocoindex` subcommand, `cocoindex CLI` takes over the control, which provides convenient ways to manage the index. See the next step for more details.
137137
138138
## Step 3: Run the indexing pipeline and queries

docs/docs/query.mdx

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
---
2+
title: Query Support
3+
description: CocoIndex supports vector search and text search.
4+
---
5+
6+
import Tabs from '@theme/Tabs';
7+
import TabItem from '@theme/TabItem';
8+
9+
# CocoIndex Query Support
10+
11+
The main functionality of CocoIndex is indexing.
12+
The goal of indexing is to enable efficient querying against your data.
13+
You can use any libraries or frameworks of your choice to perform queries.
14+
At the same time, CocoIndex provides seamless integration between indexing and querying workflows.
15+
For example, you can share transformations between indexing and querying, and easily retrieve table names when using CocoIndex's default naming conventions.
16+
17+
## Transform Flow
18+
19+
Sometimes a part of the transformation logic needs to be shared between indexing and querying,
20+
e.g. when we build a vector index and query against it, the embedding computation needs to be consistent between indexing and querying.
21+
22+
In this case, you can:
23+
24+
1. Extract a sub-flow with the shared transformation logic into a standalone function.
25+
* It takes one or more data slices as input.
26+
* It returns one data slice as output.
27+
* You need to annotate data types for both inputs and outputs as type parameter for `cocoindex.DataSlice[T]`. See [data types](./core/data_types.mdx) for more details about supported data types.
28+
29+
2. When you're defining your indexing flow, you can directly call the function.
30+
The body will be executed, so that the transformation logic will be added as part of the indexing flow.
31+
32+
3. At query time, you usually want to directly run the function with specific input data, instead of letting it called as part of a long-lived indexing flow.
33+
To do this, declare the function as a *transform flow*, by decorating it with `@cocoindex.transform_flow()`.
34+
This will add a `eval()` method to the function, so that you can directly call with specific input data.
35+
36+
37+
<Tabs>
38+
<TabItem value="python" label="Python">
39+
40+
The [quickstart](getting_started/quickstart#step-41-extract-common-transformations) shows an example:
41+
42+
```python
43+
@cocoindex.transform_flow()
44+
def text_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
45+
return text.transform(
46+
cocoindex.functions.SentenceTransformerEmbed(
47+
model="sentence-transformers/all-MiniLM-L6-v2"))
48+
```
49+
50+
When you're defining your indexing flow, you can directly call the function:
51+
52+
```python
53+
with doc["chunks"].row() as chunk:
54+
chunk["embedding"] = text_to_embedding(chunk["text"])
55+
```
56+
57+
or, using the `call()` method of the transform flow on the first argument, to make operations chainable:
58+
59+
```python
60+
with doc["chunks"].row() as chunk:
61+
chunk["embedding"] = chunk["text"].call(text_to_embedding)
62+
```
63+
64+
Any time, you can call the `eval()` method with specific string, which will return a `list[float]`:
65+
66+
```python
67+
print(text_to_embedding.eval("Hello, world!"))
68+
```
69+
70+
</TabItem>
71+
</Tabs>
72+
73+
## Get Target Native Names
74+
75+
In your indexing flow, when you export data to a target, you can specify the target name (e.g. a database table name, a collection name, the node label in property graph databases, etc.) explicitly,
76+
or for some backends you can also omit it and let CocoIndex generate a default name for you.
77+
For the latter case, CocoIndex provides a utility function `cocoindex.utils.get_target_storage_default_name()` to get the default name.
78+
It takes the following arguments:
79+
80+
* `flow` (type: `cocoindex.Flow`): The flow to get the default name for.
81+
* `target_name` (type: `str`): The export target name, appeared in the `export()` call.
82+
83+
For example:
84+
85+
<Tabs>
86+
<TabItem value="python" label="Python">
87+
88+
```python
89+
table_name = cocoindex.utils.get_target_storage_default_name(text_embedding_flow, "doc_embeddings")
90+
query = f"SELECT filename, text FROM {table_name} ORDER BY embedding <=> %s::vector DESC LIMIT 5"
91+
...
92+
```
93+
94+
</TabItem>
95+
</Tabs>
96+

docs/sidebars.ts

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,11 @@ const sidebars: SidebarsConfig = {
4444
'ai/llm',
4545
],
4646
},
47+
{
48+
type: 'doc',
49+
id: 'query',
50+
label: 'Query Support',
51+
},
4752
{
4853
type: 'category',
4954
label: 'About',

examples/image_search_example/main.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ def image_object_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope:
9494
# --- CocoIndex initialization on startup ---
9595
@app.on_event("startup")
9696
def startup_event():
97-
settings = cocoindex.setting.Settings.from_env()
97+
settings = cocoindex.Settings.from_env()
9898
cocoindex.init(settings)
9999
app.state.query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
100100
name="ImageObjectSearch",

examples/product_recommendation/main.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ class ProductTaxonomyInfo:
6161
complementary_taxonomies: list[ProductTaxonomy]
6262

6363
@cocoindex.op.function(behavior_version=2)
64-
def extract_product_info(product: cocoindex.typing.Json, filename: str) -> ProductInfo:
64+
def extract_product_info(product: cocoindex.Json, filename: str) -> ProductInfo:
6565
# Print markdown for LLM to extract the taxonomy and complimentary taxonomy
6666
return ProductInfo(
6767
id=f"{filename.removesuffix('.json')}",

python/cocoindex/__init__.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,15 @@
22
Cocoindex is a framework for building and running indexing pipelines.
33
"""
44
from . import functions, query, sources, storages, cli, utils
5-
from .flow import FlowBuilder, DataScope, DataSlice, Flow, flow_def, transform_flow
5+
6+
from .auth_registry import AuthEntryReference, add_auth_entry, ref_auth_entry
7+
from .flow import FlowBuilder, DataScope, DataSlice, Flow, transform_flow
8+
from .flow import flow_def, flow_def as flow
69
from .flow import EvaluateAndDumpOptions, GeneratedField
710
from .flow import update_all_flows_async, FlowLiveUpdater, FlowLiveUpdaterOptions
11+
from .lib import init, start_server, stop, main_fn
812
from .llm import LlmSpec, LlmApiType
913
from .index import VectorSimilarityMetric, VectorIndexDef, IndexOptions
10-
from .auth_registry import AuthEntryReference, add_auth_entry, ref_auth_entry
11-
from .lib import *
1214
from .setting import DatabaseConnectionSpec, Settings, ServerSettings
1315
from .setting import get_app_namespace
14-
from ._engine import OpArgSchema
1516
from .typing import Float32, Float64, LocalDateTime, OffsetDateTime, Range, Vector, Json

python/cocoindex/flow.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -718,9 +718,11 @@ async def _build_flow_info_async(self) -> TransformFlowInfo:
718718
for (param_name, param), param_type in zip(sig.parameters.items(), self._flow_arg_types):
719719
if param.kind not in (inspect.Parameter.POSITIONAL_OR_KEYWORD,
720720
inspect.Parameter.KEYWORD_ONLY):
721-
raise ValueError(f"Parameter {param_name} is not a parameter can be passed by name")
722-
engine_ds = flow_builder_state.engine_flow_builder.add_direct_input(
723-
param_name, encode_enriched_type(param_type))
721+
raise ValueError(f"Parameter `{param_name}` is not a parameter can be passed by name")
722+
encoded_type = encode_enriched_type(param_type)
723+
if encoded_type is None:
724+
raise ValueError(f"Parameter `{param_name}` has no type annotation")
725+
engine_ds = flow_builder_state.engine_flow_builder.add_direct_input(param_name, encoded_type)
724726
kwargs[param_name] = DataSlice(_DataSliceState(flow_builder_state, engine_ds))
725727

726728
output = self._flow_fn(**kwargs)
@@ -780,8 +782,13 @@ def _transform_flow_wrapper(fn: Callable[..., DataSlice[T]]):
780782
for (param_name, param) in sig.parameters.items():
781783
if param.kind not in (inspect.Parameter.POSITIONAL_OR_KEYWORD,
782784
inspect.Parameter.KEYWORD_ONLY):
783-
raise ValueError(f"Parameter {param_name} is not a parameter can be passed by name")
784-
arg_types.append(_get_data_slice_annotation_type(param.annotation))
785+
raise ValueError(f"Parameter `{param_name}` is not a parameter can be passed by name")
786+
value_type_annotation = _get_data_slice_annotation_type(param.annotation)
787+
if value_type_annotation is None:
788+
raise ValueError(
789+
f"Parameter `{param_name}` for {fn} has no value type annotation. "
790+
"Please use `cocoindex.DataSlice[T]` where T is the type of the value.")
791+
arg_types.append(value_type_annotation)
785792

786793
_transform_flow = TransformFlow(fn, arg_types)
787794
functools.update_wrapper(_transform_flow, fn)

0 commit comments

Comments
 (0)