Skip to content

Misc. bug: GGML_ASSERT(n <= tokens.size()) failed - Memory in use ('/completion' endpoint and 'cache_prompt=false') #13484

Closed
@broadbit-hu

Description

@broadbit-hu

Name and Version

llama.cpp version: b5359 (compiled with -DGGML_RPC=ON)

Model: Mistral-Nemo-12B-Instruct-2407-Q8_0.gguf

Command line arguments:

--flash-attn --temp 0 --seed 1 -c 22000 -ngl 99 --mlock --chat-template mistral-v3-tekken

Error: GGML_ASSERT(n <= tokens.size()) failed

  • Memory critical error by agent node-0 (Agent handle: 0x59fc5fabc930) on address 0x7cbd6cc00000. Reason: Memory in use.
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:        ROCm0 model buffer size = 11731.58 MiB
load_tensors:   CPU_Mapped model buffer size =   680.00 MiB
...........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 22000
llama_context: n_ctx_per_seq = 22000
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (22000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.50 MiB
llama_kv_cache_unified: kv_size = 22016, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1, padding = 256
llama_kv_cache_unified:      ROCm0 KV buffer size =  3440.00 MiB
llama_kv_cache_unified: KV self size  = 3440.00 MiB, K (f16): 1720.00 MiB, V (f16): 1720.00 MiB
llama_context:      ROCm0 compute buffer size =   266.00 MiB
llama_context:  ROCm_Host compute buffer size =    53.01 MiB
llama_context: graph nodes  = 1207
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 22016
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Failed to infer a tool call example (possible template bug)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 22016
main: model loaded
main: chat template, chat_template: mistral-v3-tekken, example_format: '[INST]You are a helpful assistant

Hello[/INST]Hi there</s>[INST]How are you?[/INST]'
main: server is listening on http://0.0.0.0:18080 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /props 192.168.253.130 200
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 22016, n_keep = 0, n_prompt_tokens = 8241
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.248514
/opt/text/llama.cpp/tools/server/utils.hpp:1157: GGML_ASSERT(n <= tokens.size()) failedslot update_slots: id  0 | task 0 | kv cache rm [2048, end)

Memory critical error by agent node-0 (Agent handle: 0x59fc5fabc930) on address 0x7cbd6cc00000. Reason: Memory in use. 
Aborted (core dumped)

Last working version: b5329

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:        ROCm0 model buffer size = 11731.58 MiB
load_tensors:   CPU_Mapped model buffer size =   680.00 MiB
...........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 22000
llama_context: n_ctx_per_seq = 22000
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (22000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.50 MiB
llama_kv_cache_unified: kv_size = 22016, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1, padding = 256
llama_kv_cache_unified:      ROCm0 KV buffer size =  3440.00 MiB
llama_kv_cache_unified: KV self size  = 3440.00 MiB, K (f16): 1720.00 MiB, V (f16): 1720.00 MiB
llama_context:      ROCm0 compute buffer size =   266.00 MiB
llama_context:  ROCm_Host compute buffer size =    53.01 MiB
llama_context: graph nodes  = 1207
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 22016
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Failed to infer a tool call example (possible template bug)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 22016
main: model loaded
main: chat template, chat_template: mistral-v3-tekken, example_format: '[INST]You are a helpful assistant

Hello[/INST]Hi there</s>[INST]How are you?[/INST]'
main: server is listening on http://0.0.0.0:18080 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /props 192.168.253.130 200
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 22016, n_keep = 0, n_prompt_tokens = 8241
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.248514
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.497027
slot update_slots: id  0 | task 0 | kv cache rm [4096, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 6144, n_tokens = 2048, progress = 0.745541
slot update_slots: id  0 | task 0 | kv cache rm [6144, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 8192, n_tokens = 2048, progress = 0.994054
slot update_slots: id  0 | task 0 | kv cache rm [8192, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 8241, n_tokens = 49, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 8241, n_tokens = 49
slot      release: id  0 | task 0 | stop processing: n_past = 8638, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =    5977.53 ms /  8241 tokens (    0.73 ms per token,  1378.66 tokens per second)
       eval time =   12231.95 ms /   398 tokens (   30.73 ms per token,    32.54 tokens per second)
      total time =   18209.48 ms /  8639 tokens
srv  update_slots: all slots are idle

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server -m Mistral-Nemo-12B-Instruct-2407-Q8_0.gguf --flash-attn --temp 0 --seed 1 -c 22000 -ngl 99 --mlock --chat-template mistral-v3-tekken

Problem description & steps to reproduce

Error GGML_ASSERT(n <= tokens.size()) failedslot update_slots when the input text is long (8241 tokens with 22000 context size)

First Bad Commit

33eff40

Relevant log output

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:        ROCm0 model buffer size = 11731.58 MiB
load_tensors:   CPU_Mapped model buffer size =   680.00 MiB
...........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 22000
llama_context: n_ctx_per_seq = 22000
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (22000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.50 MiB
llama_kv_cache_unified: kv_size = 22016, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1, padding = 256
llama_kv_cache_unified:      ROCm0 KV buffer size =  3440.00 MiB
llama_kv_cache_unified: KV self size  = 3440.00 MiB, K (f16): 1720.00 MiB, V (f16): 1720.00 MiB
llama_context:      ROCm0 compute buffer size =   266.00 MiB
llama_context:  ROCm_Host compute buffer size =    53.01 MiB
llama_context: graph nodes  = 1207
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 22016
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Failed to infer a tool call example (possible template bug)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 22016
main: model loaded
main: chat template, chat_template: mistral-v3-tekken, example_format: '[INST]You are a helpful assistant

Hello[/INST]Hi there</s>[INST]How are you?[/INST]'
main: server is listening on http://0.0.0.0:18080 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /props 192.168.253.130 200
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 22016, n_keep = 0, n_prompt_tokens = 8241
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.248514
/opt/text/llama.cpp/tools/server/utils.hpp:1157: GGML_ASSERT(n <= tokens.size()) failedslot update_slots: id  0 | task 0 | kv cache rm [2048, end)

Memory critical error by agent node-0 (Agent handle: 0x59fc5fabc930) on address 0x7cbd6cc00000. Reason: Memory in use. 
Aborted (core dumped)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions