Live Demo: https://do-me.github.io/semantic-segment-explorer/
In-browser tool to explore semantic similarity by generating overlapping text segments or N-grams and querying them using Transformers.js. This application allows you to input a source text, which is then broken down into numerous overlapping segments. Each unique segment is embedded using minishlab/potion-retrieval-32M
, and you can then query these segments to find those most semantically similar to your query.
The main motivation behind this app is to experiment with different text chunking/segmentation strategies and observe how the semantic similarity results vary, especially with segments of different lengths.
- Flexible Text Input: Paste any source text to analyze.
- Advanced Segmentation:
- With Sentence Boundaries (Default): Text is split into sentences, and segments (all contiguous word combinations) are generated within each sentence.
- Without Sentence Boundaries: Segments are generated from all contiguous word combinations across the entire text.
- In-Browser Embeddings: Uses Hugging Face Transformers.js with the minishlab/potion-retrieval-32M model to generate embeddings directly in the user's browser. No server-side processing needed for the core AI!
- Semantic Querying: Find text segments most semantically similar to your input query.
- Query-As-You-Type: (Optional) Get instant search results as you type your query.
- Input Text: The user provides a source text.
- Segmentation:
- The text is processed based on the "Use Sentence Boundaries" setting.
- With Sentence Boundaries: The text is first split into individual sentences. Then, for each sentence, all possible contiguous sub-sequences of words are generated as segments.
- Without Sentence Boundaries: All possible contiguous sub-sequences of words are generated from the entire input text.
- Duplicate segments are removed.
- Embedding: Each unique segment is converted into a numerical vector (embedding) using the minishlab/potion-retrieval-32M model running via Transformers.js in the browser.
- Indexing: These embeddings are stored locally in the browser's memory.
- Querying:
- The user inputs a query.
- The query is also embedded using the same model.
- The cosine similarity between the query embedding and all indexed segment embeddings is calculated.
- The top N most similar segments are displayed as results.
- With Sentence Boundaries (default): The text is first split into S sentences. Segments are generated within each sentence. If Ns is the average number of words per sentence, the number of unique segments is roughly Σ(Ns,i*(Ns,i+1)/2) for each sentence i (though duplicates across sentences or within are removed). This is generally less than O(N2) for the total number of words N.
- Without Sentence Boundaries: The number of potential segments grows quadratically with the total number of words (N) in your source text, approximately N * (N+1) / 2. After removing duplicates, the actual count may be lower but can still be substantial. This is O(N2) in terms of combinations generated.
- Warning: For long texts (e.g., >1000 words), using this option can lead to a very large number of segments, potentially crashing the browser tab due to memory or processing limits.
- Frontend: HTML, Tailwind CSS
- JavaScript: Vanilla JS (ES Modules)
- Machine Learning: Hugging Face Transformers.js
- Embedding Model: minishlab/potion-retrieval-32M (a compact and efficient model suitable for in-browser use)
- Visit the Live Demo: https://do-me.github.io/semantic-segment-explorer/
- Input Source Text: Paste your text into the main textarea.
- Choose Segmentation Strategy: Decide whether to use sentence boundaries (checked by default, recommended for longer texts).
- Generate & Index: Click the "Generate & Index Segments" button. Wait for the processing to complete (progress will be shown).
- Query: Once indexing is done, the query section will appear. Type your search query.
- Search: Click "Search" or enable "Query-As-You-Type" for instant results.
- View Results: Semantically similar segments from your source text will be displayed.
This project was primarily created to experiment with different text segmentation (chunking) techniques for semantic search. Segment length and boundaries do affect retrieval quality obviously but it's cool to see how sometimes longer segments are more similar to a query.
If you're interested in semantic search or similar in-browser AI applications, you might also like:
- SemanticFinder: Another in-browser semantic search tool focusing on sentence-level similarity.
- Guerilla Semantic Search Tutorial: An article discussing approaches to client-side semantic search.
- My other semantic search projects on GitHub: A collection of related experiments and tools.
This demo is mainly experimental and I don't intend on developing it much further. Still, your contributions are more than welcome!
Distributed under the MIT License.
- Xenova and Hugging Face for their incredible
transformers.js
library and model hosting. - The creators of the minishlab/potion-retrieval-32M for creating super fast static embeddings
- Tailwind CSS
- Gemini 2.5 Pro for quickly creating the UI skeleton. The inferencing code actually stems from here: https://www.reddit.com/r/LocalLLaMA/comments/1glwbsq/using_highthroughput_model2vecpotion_embedding/ & here: MinishLab/model2vec#75. Without these snippets Gemini did not succeed in creating a fully working app.