Closed
Description
Name and Version
llama.cpp version: b5359 (compiled with -DGGML_RPC=ON)
Model: Mistral-Nemo-12B-Instruct-2407-Q8_0.gguf
Command line arguments:
--flash-attn --temp 0 --seed 1 -c 22000 -ngl 99 --mlock --chat-template mistral-v3-tekken
Error: GGML_ASSERT(n <= tokens.size()) failed
- Memory critical error by agent node-0 (Agent handle: 0x59fc5fabc930) on address 0x7cbd6cc00000. Reason: Memory in use.
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors: ROCm0 model buffer size = 11731.58 MiB
load_tensors: CPU_Mapped model buffer size = 680.00 MiB
...........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 22000
llama_context: n_ctx_per_seq = 22000
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (22000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.50 MiB
llama_kv_cache_unified: kv_size = 22016, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1, padding = 256
llama_kv_cache_unified: ROCm0 KV buffer size = 3440.00 MiB
llama_kv_cache_unified: KV self size = 3440.00 MiB, K (f16): 1720.00 MiB, V (f16): 1720.00 MiB
llama_context: ROCm0 compute buffer size = 266.00 MiB
llama_context: ROCm_Host compute buffer size = 53.01 MiB
llama_context: graph nodes = 1207
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 22016
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Failed to infer a tool call example (possible template bug)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 22016
main: model loaded
main: chat template, chat_template: mistral-v3-tekken, example_format: '[INST]You are a helpful assistant
Hello[/INST]Hi there</s>[INST]How are you?[/INST]'
main: server is listening on http://0.0.0.0:18080 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET /props 192.168.253.130 200
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 22016, n_keep = 0, n_prompt_tokens = 8241
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.248514
/opt/text/llama.cpp/tools/server/utils.hpp:1157: GGML_ASSERT(n <= tokens.size()) failedslot update_slots: id 0 | task 0 | kv cache rm [2048, end)
Memory critical error by agent node-0 (Agent handle: 0x59fc5fabc930) on address 0x7cbd6cc00000. Reason: Memory in use.
Aborted (core dumped)
Last working version: b5329
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors: ROCm0 model buffer size = 11731.58 MiB
load_tensors: CPU_Mapped model buffer size = 680.00 MiB
...........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 22000
llama_context: n_ctx_per_seq = 22000
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (22000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.50 MiB
llama_kv_cache_unified: kv_size = 22016, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1, padding = 256
llama_kv_cache_unified: ROCm0 KV buffer size = 3440.00 MiB
llama_kv_cache_unified: KV self size = 3440.00 MiB, K (f16): 1720.00 MiB, V (f16): 1720.00 MiB
llama_context: ROCm0 compute buffer size = 266.00 MiB
llama_context: ROCm_Host compute buffer size = 53.01 MiB
llama_context: graph nodes = 1207
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 22016
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Failed to infer a tool call example (possible template bug)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 22016
main: model loaded
main: chat template, chat_template: mistral-v3-tekken, example_format: '[INST]You are a helpful assistant
Hello[/INST]Hi there</s>[INST]How are you?[/INST]'
main: server is listening on http://0.0.0.0:18080 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET /props 192.168.253.130 200
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 22016, n_keep = 0, n_prompt_tokens = 8241
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.248514
slot update_slots: id 0 | task 0 | kv cache rm [2048, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.497027
slot update_slots: id 0 | task 0 | kv cache rm [4096, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 6144, n_tokens = 2048, progress = 0.745541
slot update_slots: id 0 | task 0 | kv cache rm [6144, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 8192, n_tokens = 2048, progress = 0.994054
slot update_slots: id 0 | task 0 | kv cache rm [8192, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 8241, n_tokens = 49, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 8241, n_tokens = 49
slot release: id 0 | task 0 | stop processing: n_past = 8638, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 5977.53 ms / 8241 tokens ( 0.73 ms per token, 1378.66 tokens per second)
eval time = 12231.95 ms / 398 tokens ( 30.73 ms per token, 32.54 tokens per second)
total time = 18209.48 ms / 8639 tokens
srv update_slots: all slots are idle
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server -m Mistral-Nemo-12B-Instruct-2407-Q8_0.gguf --flash-attn --temp 0 --seed 1 -c 22000 -ngl 99 --mlock --chat-template mistral-v3-tekken
Problem description & steps to reproduce
Error GGML_ASSERT(n <= tokens.size()) failedslot update_slots
when the input text is long (8241 tokens with 22000 context size)
First Bad Commit
Relevant log output
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors: ROCm0 model buffer size = 11731.58 MiB
load_tensors: CPU_Mapped model buffer size = 680.00 MiB
...........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 22000
llama_context: n_ctx_per_seq = 22000
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (22000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.50 MiB
llama_kv_cache_unified: kv_size = 22016, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1, padding = 256
llama_kv_cache_unified: ROCm0 KV buffer size = 3440.00 MiB
llama_kv_cache_unified: KV self size = 3440.00 MiB, K (f16): 1720.00 MiB, V (f16): 1720.00 MiB
llama_context: ROCm0 compute buffer size = 266.00 MiB
llama_context: ROCm_Host compute buffer size = 53.01 MiB
llama_context: graph nodes = 1207
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 22016
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Failed to infer a tool call example (possible template bug)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 22016
main: model loaded
main: chat template, chat_template: mistral-v3-tekken, example_format: '[INST]You are a helpful assistant
Hello[/INST]Hi there</s>[INST]How are you?[/INST]'
main: server is listening on http://0.0.0.0:18080 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET /props 192.168.253.130 200
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 22016, n_keep = 0, n_prompt_tokens = 8241
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.248514
/opt/text/llama.cpp/tools/server/utils.hpp:1157: GGML_ASSERT(n <= tokens.size()) failedslot update_slots: id 0 | task 0 | kv cache rm [2048, end)
Memory critical error by agent node-0 (Agent handle: 0x59fc5fabc930) on address 0x7cbd6cc00000. Reason: Memory in use.
Aborted (core dumped)