diff --git a/site/content/3.13/about-arangodb/features/community-edition.md b/site/content/3.13/about-arangodb/features/community-edition.md index 9ee6ac647d..1d8f118af9 100644 --- a/site/content/3.13/about-arangodb/features/community-edition.md +++ b/site/content/3.13/about-arangodb/features/community-edition.md @@ -149,7 +149,7 @@ see [arangodb.com/community-server/](https://www.arangodb.com/community-server/) {{% /comment %}} {{% comment %}} Experimental feature -- [**Vector search**](#TODO): +- [**Vector search**](../../index-and-search/indexing/working-with-indexes/vector-indexes.md): Find items with similar properties by comparing vector embeddings generated by machine learning models. {{% /comment %}} diff --git a/site/content/3.13/aql/functions/vector.md b/site/content/3.13/aql/functions/vector.md new file mode 100644 index 0000000000..a34f54d0da --- /dev/null +++ b/site/content/3.13/aql/functions/vector.md @@ -0,0 +1,95 @@ +--- +title: Vector search functions in AQL +menuTitle: Vector +weight: 60 +description: >- + The functions for vector search let you utilize indexed vector embeddings to + quickly find semantically similar documents +--- +To use vector search, you need to have vector embeddings stored in documents +and the attribute that stores them needs to be indexed by a +[vector index](../../index-and-search/indexing/working-with-indexes/vector-indexes.md). + +{{< warning >}} +The vector index is an experimental feature that you need to enable for the +ArangoDB server with the `--experimental-vector-index` startup option. +Once enabled for a deployment, it cannot be disabled anymore because it +permanently changes how the data is managed by the RocksDB storage engine +(it adds an additional column family). +{{< /warning >}} + +{{< comment >}}TODO: Add DSS docs or already mention because of ArangoGraph with ML? +You can calculate vector embeddings using ArangoDB's GraphML capabilities or +external tools. +{{< /comment >}} + +## Distance functions + +In order to utilize a vector index, you need to use one of the following +vector distance functions in a query, sort by this distance, and specify the +maximum number of similar documents to retrieve with a `LIMIT` operation. +Example: + +```aql +FOR doc IN coll + SORT APPROX_NEAR_L2(doc.vector, @q) + LIMIT 5 + RETURN doc +``` + +The `@q` bind variable needs to be vector (array of numbers) with the dimension +as specified in the vector index. It defines the point at which to look for +neighbors (`5` in this case). + +The sorting order needs to be **ascending for the L2 metric** (shown above) and +**descending for the cosine metric**: + +```aql +FOR doc IN coll + SORT APPROX_NEAR_COSINE(doc.vector, @q) DESC + LIMIT 5 + RETURN doc +``` + +### APPROX_NEAR_COSINE() + +`APPROX_NEAR_COSINE(vector1, vector2, options) → dist` + +Retrieve the approximate distance using the cosine metric, accelerated by a +matching vector index. + +- **vector1** (array of numbers): The first vector. Either this parameter or + `vector2` needs to reference a stored attribute holding the vector embedding. + attribute of a stored document that stores a vector, like `doc.vector` +- **vector2** (array of numbers): The second vector. Either this parameter or + `vector1` needs to reference a stored attribute holding the vector embedding. +- **options** (object, _optional_): + - **nProbe** (number, _optional_): How many neighboring centroids to consider + for the search results. The larger the number, the slower the search but the + better the search results. If not specified, the `defaultNProbe` value of + the vector index is used. +- returns **dist** (number): The approximate cosine distance between both vectors. + + + +### APPROX_NEAR_L2() + +`APPROX_NEAR_L2(vector1, vector2, options) → dist` + +Retrieve the approximate distance using the L2 (Euclidean) metric, accelerated +by a matching vector index. + +- **vector1** (array of numbers): The first vector. Either this parameter or + `vector2` needs to reference a stored attribute holding the vector embedding. + attribute of a stored document that stores a vector, like `doc.vector` +- **vector2** (array of numbers): The second vector. Either this parameter or + `vector1` needs to reference a stored attribute holding the vector embedding. +- **options** (object, _optional_): + - **nProbe** (number, _optional_): How many neighboring centroids to consider + for the search results. The larger the number, the slower the search but the + better the search results. If not specified, the `defaultNProbe` value of + the vector index is used. +- returns **dist** (number): The approximate L2 (Euclidean) distance between + both vectors. + + diff --git a/site/content/3.13/index-and-search/indexing/basics.md b/site/content/3.13/index-and-search/indexing/basics.md index d43d950350..afb4b86925 100644 --- a/site/content/3.13/index-and-search/indexing/basics.md +++ b/site/content/3.13/index-and-search/indexing/basics.md @@ -369,6 +369,16 @@ the `GEO_DISTANCE()` function, or if `FILTER` conditions with `GEO_CONTAINS()` or `GEO_INTERSECTS()` are used. It will not be used for other types of queries or conditions. +## Vector Index + +Vector indexes let you index vector embeddings stored in documents. Such +vectors are arrays of numbers that represent the meaning and relationships of +data numerically. You can you quickly find a given number of semantically +similar documents by searching for close neighbors in a high-dimensional +vector space. + +See [Vector Indexes](working-with-indexes/vector-indexes.md) for details. + ## Fulltext Index {{< warning >}} diff --git a/site/content/3.13/index-and-search/indexing/which-index-to-use-when.md b/site/content/3.13/index-and-search/indexing/which-index-to-use-when.md index fc97fc3c92..97f3d8206f 100644 --- a/site/content/3.13/index-and-search/indexing/which-index-to-use-when.md +++ b/site/content/3.13/index-and-search/indexing/which-index-to-use-when.md @@ -106,6 +106,18 @@ different usage scenarios: of the Earth. It supports points, lines, and polygons. See [Geo-Spatial Indexes](working-with-indexes/geo-spatial-indexes.md). +- **Vector index**: You can find semantically similar documents quickly with + vector indexes. It is required to calculate and store vector embeddings first, + and you may need to update the embeddings when adding new documents. + Vector indexes cannot be used for other types of searches, like equality and + range queries or full-text search. + + Vector indexes are utilized via special distance functions, in combination with + a `SORT` operation to sort by the distance, and a `LIMIT` operation to define + how many similar documents to retrieve. + + See [Vector indexes](working-with-indexes/vector-indexes.md) for details. + - **fulltext index**: a fulltext index can be used to index all words contained in a specific attribute of all documents in a collection. Only words with a (specifiable) minimum length are indexed. Word tokenization is done using diff --git a/site/content/3.13/index-and-search/indexing/working-with-indexes/vector-indexes.md b/site/content/3.13/index-and-search/indexing/working-with-indexes/vector-indexes.md new file mode 100644 index 0000000000..aa503fc882 --- /dev/null +++ b/site/content/3.13/index-and-search/indexing/working-with-indexes/vector-indexes.md @@ -0,0 +1,156 @@ +--- +title: Vector indexes +menuTitle: Vector Indexes +weight: 40 +description: >- + You can index vector embeddings to allow queries to quickly find semantically + similar documents +--- +Vector indexes let you index vector embeddings stored in documents. Such +vectors are arrays of numbers that represent the meaning and relationships of +data numerically. You can you quickly find a given number of semantically +similar documents by searching for close neighbors in a high-dimensional +vector space. + +The vector index implementation uses the [Faiss library](https://github.com/facebookresearch/faiss/) +to support L2 and cosine metrics. The index used is IndexIVFFlat, the quantizer +for L2 is IndexFlatL2, and the cosine uses IndexFlatIP, where vectors are +normalized before insertion and search. + +Sometimes, if there is no relevant data found in the list, the faiss might not +produce the top K requested results. Therefore, only the found results is provided. + +{{< warning >}} +The vector index is an experimental feature that you need to enable for the +ArangoDB server with the `--experimental-vector-index` startup option. +Once enabled for a deployment, it cannot be disabled anymore because it +permanently changes how the data is managed by the RocksDB storage engine +(it adds an additional column family). +{{< /warning >}} + +### How to use vector indexes + +Creating an index triggers training the index on top of real data, which is a limitation that assumes the data already exists for the vector field upon which the index is created. +The number of training points depends on the nLists parameter; a bigger nLists will produce more correct results but will increase the training time necessary to build the index. + + +## Vector index properties + +- **name** (_optional_): A user-defined name for the index for easier + identification. If not specified, a name is automatically generated. +- **type**: The index type. Needs to be `"vector"`. +- **fields** (array of strings): A list with a single attribute path to specify + where the vector embedding is stored in each document. The vector data needs + to be populated before creating the index. + + If you want to index another vector embedding attribute, you need to create a + separate vector index. +- **params**: The parameters as used by the Faiss library. + - **metric** (string): Whether to use `cosine` or `l2` (Euclidean) distance calculation. + - **dimension** (number): The vector dimension. The attribute to index needs to + have this many elements in the array that stores the vector embedding. + - **nLists** (number): The number of centroids in the index. What value to choose + depends on the data distribution and chosen metric. According to + [The Faiss library paper](https://arxiv.org/abs/2401.08281), it should be + around `15 * N` where `N` is the number of documents in the collection, + respectively the number of documents in the shard for cluster deployments. + - **defaultNProbe** (number, _optional_): How many neighboring centroids to + consider for the search results by default. The larger the number, the slower + the search but the better the search results. The default is `1`. + - **trainingIterations** (number, _optional_): The number of iterations in the + training process. The default is `25`. Smaller values lead to a faster index + creation but may yield worse search results. + - **factory** (string, _optional_): You can specify a factory string to pass + through to the underlying Faiss library, allowing you to combine different + options, for example: + - `"IVF100_HNSW10,Flat"` + - `"IVF100,SQ4"` + - `"IVF10_HNSW5,Flat"` + - `"IVF100_HNSW5,PQ256x16"` + The base index must be an IVF to work with ArangoDB. For more information on + how to create these custom indexes, see the + [Faiss Wiki](https://github.com/facebookresearch/faiss/wiki/The-index-factory). + +## Interfaces + +### Create a vector index + +{{< tabs "interfaces" >}} + +{{< tab "Web interface" >}} +1. In the **Collections** section, click the name or row of the desired collection. +2. Go to the **Indexes** tab. +3. Click **Add index**. +4. Select **Vector** as the **Type**. +5. Enter the name of the attribute that holds the vector embeddings into **Fields**. +6. Set the parameters for the vector index, see [Vector index parameters](#vector-index-parameters). +7. Optionally give the index a user-defined name. +8. Click **Create**. +{{< /tab >}} + +{{< tab "arangosh" >}} +```js +db.coll.ensureIndex({ + name: "vector_l2", + type: "vector", + fields: ["embedding"], + params: { + metric: "l2", + dimension: 544, + nLists: 100, + defaultNProbe: 1, + trainingIterations: 25 + } +}); +``` +{{< /tab >}} + +{{< tab "cURL" >}} +```sh +curl -d '{"name":"vector_l2","type":"vector","fields":["embedding"],"params":{"metric":"l2","dimension":544,"nLists":100,"defaultNProbe":1,"trainingIterations":25}}' http://localhost:8529/_db/mydb/_api/index?collection=coll +``` +{{< /tab >}} + +{{< tab "JavaScript" >}} +```js +const info = await coll.ensureIndex({ + name: "vector_l2", + type: "vector", + fields: ["embedding"], + params: { + metric: "l2", + dimension: 544, + nLists: 100, + defaultNProbe: 1, + trainingIterations: 25 + } +}); +``` +{{< /tab >}} + +{{< tab "Go" >}} +The Go driver does not support vector indexes yet. +{{< /tab >}} + +{{< tab "Java" >}} +The Java driver does not support vector indexes yet. +{{< /tab >}} + +{{< tab "Python" >}} +```py +info = coll.add_index({ + "name": "vector_l2", + "type": "vector", + "fields": ["embedding"], + "params": { + "metric": "l2", + "dimension": 544 + "nLists": 100, + "defaultNProbe": 1, + "trainingIterations": 25 + } +}) +``` +{{< /tab >}} + +{{< /tabs >}}