Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"
-
Updated
May 29, 2025 - Python
Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"
🦙 echoOLlama: A real-time voice AI platform powered by local LLMs. Features WebSocket streaming, voice interactions, and OpenAI API compatibility. Built with FastAPI, Redis, and PostgreSQL. Perfect for private AI conversations and custom voice assistants.
Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖
A multimodal image search engine built on the GME model, capable of handling diverse input types. Whether you're querying with text, images, or both, provides powerful and flexible image retrieval under arbitrary inputs. Perfect for research and demos.
🎉 [ACL 2025] The code repository for "Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning" in PyTorch.
Official implementation of CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation
This repository showcases a collection of innovative projects by Charan H U, focusing on cutting-edge technologies such as facial emotion recognition, fitness tracking, and multi-model applications. Each project demonstrates practical implementations of advanced AI/ML techniques, making it a valuable resource for developers and researchers.
AI multi-model using RAG and Langchain
Evaluating ‘Graphical Perception’ with Multimodal Large Language Models
Elarova — A smart, multimodal research assistant designed to help students by combining speech, text, and other input modes for efficient academic research and study support. Powered by state-of-the-art speech recognition, text-to-speech, and AI models, including meta-llama/llama-4-scout-17b-16e-instruct, with an easy-to-use Gradio web interface.
ElaMath is a smart, voice-enabled math assistant that helps students solve and understand math problems using both spoken questions and images. It’s powered by the powerful multimodal meta-llama/llama-4-scout-17b-16e-instruct model via Groq API, combined with Whisper for speech recognition and ElevenLabs/gTTS for natural voice responses.
Using MAIRA-2 multimodal transformer designed for the generation of grounded or non-grounded radiology reports from chest X-rays.
Multi-Modal Healthcare Assistant
Create a tool that uses a multimodal LLM to describe testing instructions for any digital product's features, based on the screenshots.
This repo contains integration of LangChain with Google Gemini LLM
Add a description, image, and links to the multimodel-large-language-model topic page so that developers can more easily learn about it.
To associate your repository with the multimodel-large-language-model topic, visit your repo's landing page and select "manage topics."