Description
Discussed in #4350
Originally posted by cmp-nct December 7, 2023
I've just seen CovVLM which is a Vicuna 7B language model behind a 9B vision tower (laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) on a opensource license.
I've compared it with llava-1.5 (not even compareable) and Qwen-VL and it beats Qwen-VL by a margin in OCR abilities, detection of details and no or almost no hallucinations.
It understands handwritten as well as typed letters, context, fine details, background graphics
It can also locate tiny visual targets with pixel coordinates
I'm quite blown away that I didn't know it before..
I believe that this is what we need, it has similarities to llava but adds an additional expert model, so that's not super quick to implement.
In addition the ViT needs K-type quantization support.
Definitely worth a close look
URL: https://github.com/THUDM/CogVLM
Webdemo: http://36.103.203.44:7861/
Paper: https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf
Look at this example, I asked for a JSON representation - not cherry picked, it can actually extract all of the content with minimal errors:
The image appears to be a driver's license from California. Here's a JSON representation:
{
"License_Number": "DL 11234568",
"Expiration_Date": "08/31/2014",
"Class": "C",
"License_Holder": "Iva Cardholder",
"Address": "2570 24th Street, Anytown, CA 95818",
"Veteran": "YES",
"Sex": "F",
"Hair_Color": "Brown",
"Eyes": "Brown",
"Height": "125 lb",
"Weight": "125 lb",
"Issue_Date": "08/31/2009",
"Birth_Date": "08/31/1977",
"VISUALS": {
"left": "Iva Cardholder",
"right": "Iva Cardholder",
"top": "California Bear",
"bottom": "Gold Rush Miner"
}
}
Here is what QWEN-VL does:
{
"id": "123456",
"issue_date": "08/31/2014",
"expiration_date": "08/31/2024",
"cardholder": {
"name": "Ina Cordero",
"gender": "F",
"race": "A",
"eye_color": "B",
"hair_color": "R",
"eyebrow_color": "N",
"height": "5'6"",
"weight": "126 lb"
},
"class": "C",
"type": "DRIVER LICENSE",
"state": "CA",
"country": "USA"
}
Here is llava1.5-13B:
{
"image": "https://i.imgur.com/39vZv.jpg",
"description": "A California driver's license with a woman's picture on it. The license is blue and white and has a picture of a bear on it. The license number is 11324567890."
}
I've not yet looked into architectural challenges. But this is literally game changer..
That's seriously good OCR and its image detection abilities are beyond anything I've remotely seen from llava 1.5/ShareGPT4V