Enhancement: We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward.

### Discussed in https://github.com/ggerganov/llama.cpp/discussions/4350

<div type='discussions-op-text'>

<sup>Originally posted by **cmp-nct** December  7, 2023</sup>
I've just seen CovVLM which is a Vicuna 7B language model behind a 9B vision tower (laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) on a opensource license.
I've compared it with llava-1.5 (not even compareable) and Qwen-VL and it beats Qwen-VL by a margin in OCR abilities, detection of details and no or almost no hallucinations.
It understands handwritten as well as typed letters, context, fine details, background graphics
It can also locate tiny visual targets with pixel coordinates
I'm quite blown away that I didn't know it before..

I believe that this is what we need, it has similarities to llava but adds an additional expert model, so that's not super quick to implement.
In addition the ViT needs K-type quantization support.
Definitely worth a close look

URL: https://github.com/THUDM/CogVLM
Webdemo: http://36.103.203.44:7861/
Paper: https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf

Look at this example, I asked for a JSON representation - not cherry picked, it can actually extract all of the content with minimal errors:
![image](https://github.com/ggerganov/llama.cpp/assets/78893154/4834e936-4bd5-4997-bde5-c58283556772)
```
The image appears to be a driver's license from California. Here's a JSON representation:
{
  "License_Number": "DL 11234568",
  "Expiration_Date": "08/31/2014",
  "Class": "C",
  "License_Holder": "Iva Cardholder",
  "Address": "2570 24th Street, Anytown, CA 95818",
  "Veteran": "YES",
  "Sex": "F",
  "Hair_Color": "Brown",
  "Eyes": "Brown",
  "Height": "125 lb",
  "Weight": "125 lb",
  "Issue_Date": "08/31/2009",
  "Birth_Date": "08/31/1977",
  "VISUALS": {
      "left": "Iva Cardholder",
      "right": "Iva Cardholder",
      "top": "California Bear",
      "bottom": "Gold Rush Miner"
    }
}
```


Here is what QWEN-VL does:
```
{
  "id": "123456",
  "issue_date": "08/31/2014",
  "expiration_date": "08/31/2024",
  "cardholder": {
    "name": "Ina Cordero",
    "gender": "F",
    "race": "A",
    "eye_color": "B",
    "hair_color": "R",
    "eyebrow_color": "N",
    "height": "5'6"",
    "weight": "126 lb"
  },
  "class": "C",
  "type": "DRIVER LICENSE",
  "state": "CA",
  "country": "USA"
}
```

Here is llava1.5-13B:
```
{
"image": "https://i.imgur.com/39vZv.jpg",
"description": "A California driver's license with a woman's picture on it. The license is blue and white and has a picture of a bear on it. The license number is 11324567890."
}
```

I've not yet looked into architectural challenges. But this is literally game changer..
That's seriously good OCR and its image detection abilities are beyond anything I've remotely seen from llava 1.5/ShareGPT4V

@monatis @FSSRepo 


</div>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhancement: We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward. #4387

Discussed in #4350

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhancement: We need CogVLM support - extremely good image and text analysis, feels like a multi generational step forward. #4387

Description

Discussed in #4350

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions