Semantic Segment Explorer

Live Demo: https://do-me.github.io/semantic-segment-explorer/

In-browser tool to explore semantic similarity by generating overlapping text segments or N-grams and querying them using Transformers.js. This application allows you to input a source text, which is then broken down into numerous overlapping segments. Each unique segment is embedded using minishlab/potion-retrieval-32M, and you can then query these segments to find those most semantically similar to your query.

The main motivation behind this app is to experiment with different text chunking/segmentation strategies and observe how the semantic similarity results vary, especially with segments of different lengths.

📑 Key Features

Flexible Text Input: Paste any source text to analyze.
Advanced Segmentation:
- With Sentence Boundaries (Default): Text is split into sentences, and segments (all contiguous word combinations) are generated within each sentence.
- Without Sentence Boundaries: Segments are generated from all contiguous word combinations across the entire text.
In-Browser Embeddings: Uses Hugging Face Transformers.js with the minishlab/potion-retrieval-32M model to generate embeddings directly in the user's browser. No server-side processing needed for the core AI!
Semantic Querying: Find text segments most semantically similar to your input query.
Query-As-You-Type: (Optional) Get instant search results as you type your query.

🤔 How It Works

Input Text: The user provides a source text.
Segmentation:
- The text is processed based on the "Use Sentence Boundaries" setting.
- With Sentence Boundaries: The text is first split into individual sentences. Then, for each sentence, all possible contiguous sub-sequences of words are generated as segments.
- Without Sentence Boundaries: All possible contiguous sub-sequences of words are generated from the entire input text.
- Duplicate segments are removed.
Embedding: Each unique segment is converted into a numerical vector (embedding) using the minishlab/potion-retrieval-32M model running via Transformers.js in the browser.
Indexing: These embeddings are stored locally in the browser's memory.
Querying:
- The user inputs a query.
- The query is also embedded using the same model.
- The cosine similarity between the query embedding and all indexed segment embeddings is calculated.
- The top N most similar segments are displayed as results.

Segment Generation Complexity:

With Sentence Boundaries (default): The text is first split into S sentences. Segments are generated within each sentence. If N_s is the average number of words per sentence, the number of unique segments is roughly Σ(N_s,i*(N_s,i+1)/2) for each sentence i (though duplicates across sentences or within are removed). This is generally less than O(N²) for the total number of words N.
Without Sentence Boundaries: The number of potential segments grows quadratically with the total number of words (N) in your source text, approximately N * (N+1) / 2. After removing duplicates, the actual count may be lower but can still be substantial. This is O(N²) in terms of combinations generated.
- Warning: For long texts (e.g., >1000 words), using this option can lead to a very large number of segments, potentially crashing the browser tab due to memory or processing limits.

🛠️ Technical Stack

Frontend: HTML, Tailwind CSS
JavaScript: Vanilla JS (ES Modules)
Machine Learning: Hugging Face Transformers.js
Embedding Model: minishlab/potion-retrieval-32M (a compact and efficient model suitable for in-browser use)

🚀 Getting Started / How to Use

Visit the Live Demo: https://do-me.github.io/semantic-segment-explorer/
Input Source Text: Paste your text into the main textarea.
Choose Segmentation Strategy: Decide whether to use sentence boundaries (checked by default, recommended for longer texts).
Generate & Index: Click the "Generate & Index Segments" button. Wait for the processing to complete (progress will be shown).
Query: Once indexing is done, the query section will appear. Type your search query.
Search: Click "Search" or enable "Query-As-You-Type" for instant results.
View Results: Semantically similar segments from your source text will be displayed.

💡 Motivation & Related Work

This project was primarily created to experiment with different text segmentation (chunking) techniques for semantic search. Segment length and boundaries do affect retrieval quality obviously but it's cool to see how sometimes longer segments are more similar to a query.

If you're interested in semantic search or similar in-browser AI applications, you might also like:

SemanticFinder: Another in-browser semantic search tool focusing on sentence-level similarity.
Guerilla Semantic Search Tutorial: An article discussing approaches to client-side semantic search.
My other semantic search projects on GitHub: A collection of related experiments and tools.

🤝 Contributing

This demo is mainly experimental and I don't intend on developing it much further. Still, your contributions are more than welcome!

📜 License

Distributed under the MIT License.

🙏 Acknowledgements

Xenova and Hugging Face for their incredible transformers.js library and model hosting.
The creators of the minishlab/potion-retrieval-32M for creating super fast static embeddings
Tailwind CSS
Gemini 2.5 Pro for quickly creating the UI skeleton. The inferencing code actually stems from here: https://www.reddit.com/r/LocalLLaMA/comments/1glwbsq/using_highthroughput_model2vecpotion_embedding/ & here: MinishLab/model2vec#75. Without these snippets Gemini did not succeed in creating a fully working app.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html
main.js		main.js
styles.css		styles.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic Segment Explorer

📑 Key Features

🤔 How It Works

Segment Generation Complexity:

🛠️ Technical Stack

🚀 Getting Started / How to Use

💡 Motivation & Related Work

🤝 Contributing

📜 License

🙏 Acknowledgements

📸 Screenshot

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

do-me/semantic-segment-explorer

Folders and files

Latest commit

History

Repository files navigation

Semantic Segment Explorer

📑 Key Features

🤔 How It Works

Segment Generation Complexity:

🛠️ Technical Stack

🚀 Getting Started / How to Use

💡 Motivation & Related Work

🤝 Contributing

📜 License

🙏 Acknowledgements

📸 Screenshot

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages