Metal prompt processing / inference intermittently spins but doesn't produce output

The outward symptom is that prompt processing / inference spins up the GPU and churns up a ton of busy work but no tokens ever come out (at least not printables - I have seen a long string of \x1C before it stops responding entirely). It doesn't really "hang" forever because it eventually stops generating. It may happen immediately on initial prompt processing or during chat interaction. However once things go sour, it does not appear to recover with further input

Under the hood, I see GPU usage spike but no tokens get produced. ggml_metal_graph_compute() decides to start encoding a ton of "stuff" (the queue is flooded with nodes to process... far more than appropriate) but ggml_metal_get_tensor() never extracts anything meaningful. I would guess that something in the context is getting trashed. Unfortunately, setting threads to 1 does not avoid it. Moreover, it seems that ALL threads in the pool suddenly get very busy, not just one

UPDATE: Temp fix https://github.com/ggerganov/llama.cpp/pull/2686 doesn't appear to solve the issue, just reduce thread churn

For me this bug shows up most obviously after Aug 16 commit https://github.com/ggerganov/llama.cpp/commit/bf83bff6742c0f1795b4c18695a13a34ac7adf62 commit (see discussion) since 3ebb00935f3f0522b75df49c2769ab1774b91380 seems quite solid

Note that matrix multiply was moved to metal/GPU at the beginning of August as a way to speed up prompt processing  but then metal was slower with llama2 (gqa) and so a custom matrix solution was developed. I'm way out of my depth here and probably not accurately describing the intent of these PRs
https://github.com/ggerganov/llama.cpp/pull/2615 

Prompt processing
https://github.com/ggerganov/llama.cpp/issues/2428

I am using a 64GB m1 with a longer prompt (about 400 tokens) and the file I used to test with: upstage-llama-2-70b-instruct-v2.ggmlv3.q5_K_M.bin. I am not using MPS


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metal prompt processing / inference intermittently spins but doesn't produce output #2678

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metal prompt processing / inference intermittently spins but doesn't produce output #2678

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions