Skip to content

Metal prompt processing / inference intermittently spins but doesn't produce output #2678

Closed
@ProjectAtlantis-dev

Description

@ProjectAtlantis-dev

The outward symptom is that prompt processing / inference spins up the GPU and churns up a ton of busy work but no tokens ever come out (at least not printables - I have seen a long string of \x1C before it stops responding entirely). It doesn't really "hang" forever because it eventually stops generating. It may happen immediately on initial prompt processing or during chat interaction. However once things go sour, it does not appear to recover with further input

Under the hood, I see GPU usage spike but no tokens get produced. ggml_metal_graph_compute() decides to start encoding a ton of "stuff" (the queue is flooded with nodes to process... far more than appropriate) but ggml_metal_get_tensor() never extracts anything meaningful. I would guess that something in the context is getting trashed. Unfortunately, setting threads to 1 does not avoid it. Moreover, it seems that ALL threads in the pool suddenly get very busy, not just one

UPDATE: Temp fix #2686 doesn't appear to solve the issue, just reduce thread churn

For me this bug shows up most obviously after Aug 16 commit bf83bff commit (see discussion) since 3ebb009 seems quite solid

Note that matrix multiply was moved to metal/GPU at the beginning of August as a way to speed up prompt processing but then metal was slower with llama2 (gqa) and so a custom matrix solution was developed. I'm way out of my depth here and probably not accurately describing the intent of these PRs
#2615

Prompt processing
#2428

I am using a 64GB m1 with a longer prompt (about 400 tokens) and the file I used to test with: upstage-llama-2-70b-instruct-v2.ggmlv3.q5_K_M.bin. I am not using MPS

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghigh priorityVery important issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions