Description
The outward symptom is that prompt processing / inference spins up the GPU and churns up a ton of busy work but no tokens ever come out (at least not printables - I have seen a long string of \x1C before it stops responding entirely). It doesn't really "hang" forever because it eventually stops generating. It may happen immediately on initial prompt processing or during chat interaction. However once things go sour, it does not appear to recover with further input
Under the hood, I see GPU usage spike but no tokens get produced. ggml_metal_graph_compute() decides to start encoding a ton of "stuff" (the queue is flooded with nodes to process... far more than appropriate) but ggml_metal_get_tensor() never extracts anything meaningful. I would guess that something in the context is getting trashed. Unfortunately, setting threads to 1 does not avoid it. Moreover, it seems that ALL threads in the pool suddenly get very busy, not just one
UPDATE: Temp fix #2686 doesn't appear to solve the issue, just reduce thread churn
For me this bug shows up most obviously after Aug 16 commit bf83bff commit (see discussion) since 3ebb009 seems quite solid
Note that matrix multiply was moved to metal/GPU at the beginning of August as a way to speed up prompt processing but then metal was slower with llama2 (gqa) and so a custom matrix solution was developed. I'm way out of my depth here and probably not accurately describing the intent of these PRs
#2615
Prompt processing
#2428
I am using a 64GB m1 with a longer prompt (about 400 tokens) and the file I used to test with: upstage-llama-2-70b-instruct-v2.ggmlv3.q5_K_M.bin. I am not using MPS