Skip to content

GGML_ASSERT(seq_id < n_tokens && "seq_id cannot be larger than n_tokens with pooling_type == MEAN") failed #13689

Closed
@slaren

Description

@slaren

@slaren Using build 'b5404', I am encountering the same issue with:

[user@system]$ export LLAMA_ARG_HF_REPO=nomic-ai/nomic-embed-text-v2-moe-GGUF:Q4_K_M \
LLAMA_ARG_EMBEDDINGS=1 \
LLAMA_ARG_ENDPOINT_METRICS=1 \
LLAMA_ARG_NO_WEBUI=1 \
LLAMA_ARG_HOST=0.0.0.0 \
LLAMA_ARG_N_PARALLEL=4 \
LLAMA_ARG_ALIAS=embeddings-multilingual \
LLAMA_ARG_PORT=80 \
LLAMA_ARG_CACHE_TYPE_K=f16 \
LLAMA_ARG_FLASH_ATTN=0 \
LLAMA_ARG_CTX_SIZE=2048 \
LLAMA_ARG_BATCH=448 \
LLAMA_ARG_BATCH=512 \
LLAMA_ARG_THREADS=1 \
LLAMA_ARG_N_PREDICT=-1 \
LLAMA_ARG_N_GPU_LAYERS=0 \
LLAMA_ARG_NUMA=distribute \
LLAMA_ARG_MLOCK=0 \
LLAMA_ARG_ENDPOINT_SLOTS=1 \
LLAMA_ARG_NO_CONTEXT_SHIFT=0 \
LLAMA_ARG_UBATCH=512
[user@system]$ llama-server --seed 0 --temp 0.0
Full logs
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support
warning: consult docs/build.md for compilation instructions
curl_perform_with_retry: HEAD https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe-GGUF/resolve/main/nomic-embed-text-v2-moe.Q4_K_M.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /root/.cache/llama.cpp/nomic-ai_nomic-embed-text-v2-moe-GGUF_nomic-embed-text-v2-moe.Q4_K_M.gguf
build: 1 (faa0b9ba) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 1, n_threads_batch = 1, total_threads = 8

system_info: n_threads = 1 (n_threads_batch = 1) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

Web UI is disabled
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 80, http threads: 7
main: loading model
srv    load_model: loading model '/root/.cache/llama.cpp/nomic-ai_nomic-embed-text-v2-moe-GGUF_nomic-embed-text-v2-moe.Q4_K_M.gguf'
llama_model_loader: loaded meta data with 45 key-value pairs and 142 tensors from /root/.cache/llama.cpp/nomic-ai_nomic-embed-text-v2-moe-GGUF_nomic-embed-text-v2-moe.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert-moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = nomic-embed-text-v2-moe
llama_model_loader: - kv   3:                            general.version str              = 2048
llama_model_loader: - kv   4:                       general.organization str              = Nomic Ai
llama_model_loader: - kv   5:                           general.basename str              = nomic-xlm
llama_model_loader: - kv   6:                         general.size_label str              = 8x277M
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Nomic Embed Text v2 Moe Unsupervised
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Nomic Ai
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/nomic-ai/nomic...
llama_model_loader: - kv  12:                               general.tags arr[str,4]       = ["sentence-transformers", "sentence-s...
llama_model_loader: - kv  13:                          general.languages arr[str,101]     = ["en", "es", "fr", "de", "it", "pt", ...
llama_model_loader: - kv  14:                 nomic-bert-moe.block_count u32              = 12
llama_model_loader: - kv  15:              nomic-bert-moe.context_length u32              = 512
llama_model_loader: - kv  16:            nomic-bert-moe.embedding_length u32              = 768
llama_model_loader: - kv  17:         nomic-bert-moe.feed_forward_length u32              = 3072
llama_model_loader: - kv  18:        nomic-bert-moe.attention.head_count u32              = 12
llama_model_loader: - kv  19: nomic-bert-moe.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  20:            nomic-bert-moe.attention.causal bool             = false
llama_model_loader: - kv  21:                nomic-bert-moe.pooling_type u32              = 1
llama_model_loader: - kv  22:              nomic-bert-moe.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  23:          nomic-bert-moe.moe_every_n_layers u32              = 2
llama_model_loader: - kv  24:                nomic-bert-moe.expert_count u32              = 8
llama_model_loader: - kv  25:           nomic-bert-moe.expert_used_count u32              = 2
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = t5
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,250048]  = ["<s>", "<pad>", "</s>", "<unk>", ","...
llama_model_loader: - kv  29:                      tokenizer.ggml.scores arr[f32,250048]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,250048]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  31:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  32:            tokenizer.ggml.token_type_count u32              = 1
llama_model_loader: - kv  33:    tokenizer.ggml.remove_extra_whitespaces bool             = true
llama_model_loader: - kv  34:        tokenizer.ggml.precompiled_charsmap arr[u8,237539]   = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,...
llama_model_loader: - kv  35:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  36:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  37:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  38:          tokenizer.ggml.seperator_token_id u32              = 2
llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  40:               tokenizer.ggml.mask_token_id u32              = 250001
llama_model_loader: - kv  41:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  42:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  43:               general.quantization_version u32              = 2
llama_model_loader: - kv  44:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   93 tensors
llama_model_loader: - type q4_K:   18 tensors
llama_model_loader: - type q5_K:   24 tensors
llama_model_loader: - type q6_K:    7 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 321.66 MiB (5.68 BPW) 
load: model vocab missing newline token, using special_pad_id instead
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 4
load: token to piece cache size = 2.1668 MB
print_info: arch             = nomic-bert-moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 512
print_info: n_embd           = 768
print_info: n_layer          = 12
print_info: n_head           = 12
print_info: n_head_kv        = 12
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 768
print_info: n_embd_v_gqa     = 768
print_info: f_norm_eps       = 1.0e-05
print_info: f_norm_rms_eps   = 0.0e+00
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 3072
print_info: n_expert         = 8
print_info: n_expert_used    = 2
print_info: causal attn      = 0
print_info: pooling type     = 1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 512
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 475M
print_info: model params     = 475.29 M
print_info: general.name     = nomic-embed-text-v2-moe
print_info: vocab type       = UGM
print_info: n_vocab          = 250048
print_info: n_merges         = 0
print_info: BOS token        = 0 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 3 '<unk>'
print_info: SEP token        = 2 '</s>'
print_info: PAD token        = 1 '<pad>'
print_info: MASK token       = 250001 '[PAD250000]'
print_info: LF token         = 0 '<s>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:  CPU_AARCH64 model buffer size =   102.52 MiB
load_tensors:   CPU_Mapped model buffer size =   321.66 MiB
...........................
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 2048
llama_context: n_ctx_per_seq = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 0
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context:        CPU  output buffer size =     0.00 MiB
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
decode: cannot decode batches with this context (use llama_encode() instead)
srv          init: initializing slots, n_slots = 4
slot         init: id  0 | task -1 | new slot n_ctx_slot = 512
slot         init: id  1 | task -1 | new slot n_ctx_slot = 512
slot         init: id  2 | task -1 | new slot n_ctx_slot = 512
slot         init: id  3 | task -1 | new slot n_ctx_slot = 512
main: model loaded
main: chat template, chat_template: {%- for message in messages -%}
  {{- '<|im_start|>' + message.role + '
' + message.content + '<|im_end|>
' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
  {{- '<|im_start|>assistant
' -}}
{%- endif -%}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://0.0.0.0:80 - starting the main loop
slot launch_slot_: id  0 | task 499 | processing task
slot update_slots: id  0 | task 499 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 512
slot update_slots: id  0 | task 499 | kv cache rm [0, end)
slot update_slots: id  0 | task 499 | prompt processing progress, n_past = 512, n_tokens = 512, progress = 1.000000
slot update_slots: id  0 | task 499 | prompt done, n_past = 512, n_tokens = 512
slot      release: id  0 | task 499 | stop processing: n_past = 512, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.77 200
slot launch_slot_: id  0 | task 526 | processing task
slot update_slots: id  0 | task 526 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 512
slot update_slots: id  0 | task 526 | kv cache rm [0, end)
slot update_slots: id  0 | task 526 | prompt processing progress, n_past = 512, n_tokens = 512, progress = 1.000000
slot update_slots: id  0 | task 526 | prompt done, n_past = 512, n_tokens = 512
slot      release: id  0 | task 526 | stop processing: n_past = 512, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.77 200
slot launch_slot_: id  0 | task 1047 | processing task
slot update_slots: id  0 | task 1047 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 512
slot update_slots: id  0 | task 1047 | kv cache rm [0, end)
slot update_slots: id  0 | task 1047 | prompt processing progress, n_past = 512, n_tokens = 512, progress = 1.000000
slot update_slots: id  0 | task 1047 | prompt done, n_past = 512, n_tokens = 512
slot      release: id  0 | task 1047 | stop processing: n_past = 512, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.104 200
slot launch_slot_: id  1 | task 1164 | processing task
slot update_slots: id  1 | task 1164 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 94
slot update_slots: id  1 | task 1164 | kv cache rm [0, end)
slot update_slots: id  1 | task 1164 | prompt processing progress, n_past = 94, n_tokens = 94, progress = 1.000000
slot update_slots: id  1 | task 1164 | prompt done, n_past = 94, n_tokens = 94
slot      release: id  1 | task 1164 | stop processing: n_past = 94, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.104 200
slot launch_slot_: id  1 | task 1171 | processing task
slot update_slots: id  1 | task 1171 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 94
slot update_slots: id  1 | task 1171 | kv cache rm [0, end)
slot update_slots: id  1 | task 1171 | prompt processing progress, n_past = 94, n_tokens = 94, progress = 1.000000
slot update_slots: id  1 | task 1171 | prompt done, n_past = 94, n_tokens = 94
slot      release: id  1 | task 1171 | stop processing: n_past = 94, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.104 200
slot launch_slot_: id  1 | task 1570 | processing task
slot update_slots: id  1 | task 1570 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 94
slot update_slots: id  1 | task 1570 | kv cache rm [0, end)
slot update_slots: id  1 | task 1570 | prompt processing progress, n_past = 94, n_tokens = 94, progress = 1.000000
slot update_slots: id  1 | task 1570 | prompt done, n_past = 94, n_tokens = 94
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot      release: id  1 | task 1570 | stop processing: n_past = 94, truncated = 0
slot launch_slot_: id  2 | task 2487 | processing task
slot update_slots: id  2 | task 2487 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 23
slot update_slots: id  2 | task 2487 | kv cache rm [0, end)
slot update_slots: id  2 | task 2487 | prompt processing progress, n_past = 23, n_tokens = 23, progress = 1.000000
slot update_slots: id  2 | task 2487 | prompt done, n_past = 23, n_tokens = 23
slot      release: id  2 | task 2487 | stop processing: n_past = 23, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot launch_slot_: id  3 | task 2546 | processing task
slot launch_slot_: id  0 | task 2547 | processing task
slot launch_slot_: id  1 | task 2548 | processing task
slot launch_slot_: id  2 | task 2549 | processing task
slot update_slots: id  0 | task 2547 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 502
slot update_slots: id  0 | task 2547 | kv cache rm [0, end)
slot update_slots: id  0 | task 2547 | prompt processing progress, n_past = 502, n_tokens = 502, progress = 1.000000
slot update_slots: id  0 | task 2547 | prompt done, n_past = 502, n_tokens = 502
slot update_slots: id  1 | task 2548 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 502
slot update_slots: id  2 | task 2549 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 502
slot update_slots: id  3 | task 2546 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 501
slot      release: id  0 | task 2547 | stop processing: n_past = 502, truncated = 0
slot launch_slot_: id  0 | task 2550 | processing task
slot update_slots: id  0 | task 2550 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 2
slot update_slots: id  0 | task 2550 | kv cache rm [0, end)
slot update_slots: id  0 | task 2550 | prompt processing progress, n_past = 2, n_tokens = 2, progress = 1.000000
slot update_slots: id  0 | task 2550 | prompt done, n_past = 2, n_tokens = 2
slot update_slots: id  1 | task 2548 | kv cache rm [0, end)
slot update_slots: id  1 | task 2548 | prompt processing progress, n_past = 502, n_tokens = 504, progress = 1.000000
slot update_slots: id  1 | task 2548 | prompt done, n_past = 502, n_tokens = 504
slot      release: id  0 | task 2550 | stop processing: n_past = 2, truncated = 0
slot      release: id  1 | task 2548 | stop processing: n_past = 502, truncated = 0
slot update_slots: id  2 | task 2549 | kv cache rm [0, end)
slot update_slots: id  2 | task 2549 | prompt processing progress, n_past = 502, n_tokens = 502, progress = 1.000000
slot update_slots: id  2 | task 2549 | prompt done, n_past = 502, n_tokens = 502
slot      release: id  2 | task 2549 | stop processing: n_past = 502, truncated = 0
slot update_slots: id  3 | task 2546 | kv cache rm [0, end)
slot update_slots: id  3 | task 2546 | prompt processing progress, n_past = 501, n_tokens = 501, progress = 1.000000
slot update_slots: id  3 | task 2546 | prompt done, n_past = 501, n_tokens = 501
slot      release: id  3 | task 2546 | stop processing: n_past = 501, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot launch_slot_: id  0 | task 2557 | processing task
slot update_slots: id  0 | task 2557 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 24
slot update_slots: id  0 | task 2557 | kv cache rm [0, end)
slot update_slots: id  0 | task 2557 | prompt processing progress, n_past = 24, n_tokens = 24, progress = 1.000000
slot update_slots: id  0 | task 2557 | prompt done, n_past = 24, n_tokens = 24
slot      release: id  0 | task 2557 | stop processing: n_past = 24, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot launch_slot_: id  1 | task 2633 | processing task
slot update_slots: id  1 | task 2633 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 2
slot update_slots: id  1 | task 2633 | kv cache rm [0, end)
slot update_slots: id  1 | task 2633 | prompt processing progress, n_past = 2, n_tokens = 2, progress = 1.000000
slot update_slots: id  1 | task 2633 | prompt done, n_past = 2, n_tokens = 2
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot      release: id  1 | task 2633 | stop processing: n_past = 2, truncated = 0
slot launch_slot_: id  1 | task 2635 | processing task
slot update_slots: id  1 | task 2635 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 2
slot update_slots: id  1 | task 2635 | kv cache rm [0, end)
slot update_slots: id  1 | task 2635 | prompt processing progress, n_past = 2, n_tokens = 2, progress = 1.000000
slot update_slots: id  1 | task 2635 | prompt done, n_past = 2, n_tokens = 2
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot      release: id  1 | task 2635 | stop processing: n_past = 2, truncated = 0
slot launch_slot_: id  2 | task 2637 | processing task
slot update_slots: id  2 | task 2637 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 21
slot update_slots: id  2 | task 2637 | kv cache rm [0, end)
slot update_slots: id  2 | task 2637 | prompt processing progress, n_past = 21, n_tokens = 21, progress = 1.000000
slot update_slots: id  2 | task 2637 | prompt done, n_past = 21, n_tokens = 21
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot      release: id  2 | task 2637 | stop processing: n_past = 21, truncated = 0
slot launch_slot_: id  3 | task 11488 | processing task
slot update_slots: id  3 | task 11488 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 10
slot update_slots: id  3 | task 11488 | kv cache rm [0, end)
slot update_slots: id  3 | task 11488 | prompt processing progress, n_past = 10, n_tokens = 10, progress = 1.000000
slot update_slots: id  3 | task 11488 | prompt done, n_past = 10, n_tokens = 10
slot      release: id  3 | task 11488 | stop processing: n_past = 10, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot launch_slot_: id  3 | task 11513 | processing task
slot update_slots: id  3 | task 11513 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 10
slot update_slots: id  3 | task 11513 | kv cache rm [0, end)
slot update_slots: id  3 | task 11513 | prompt processing progress, n_past = 10, n_tokens = 10, progress = 1.000000
slot update_slots: id  3 | task 11513 | prompt done, n_past = 10, n_tokens = 10
slot      release: id  3 | task 11513 | stop processing: n_past = 10, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot launch_slot_: id  0 | task 11524 | processing task
slot update_slots: id  0 | task 11524 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 12
slot update_slots: id  0 | task 11524 | kv cache rm [0, end)
slot update_slots: id  0 | task 11524 | prompt processing progress, n_past = 12, n_tokens = 12, progress = 1.000000
slot update_slots: id  0 | task 11524 | prompt done, n_past = 12, n_tokens = 12
slot      release: id  0 | task 11524 | stop processing: n_past = 12, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot launch_slot_: id  1 | task 11547 | processing task
slot launch_slot_: id  2 | task 11548 | processing task
slot update_slots: id  1 | task 11547 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 502
slot update_slots: id  1 | task 11547 | kv cache rm [0, end)
slot update_slots: id  1 | task 11547 | prompt processing progress, n_past = 502, n_tokens = 502, progress = 1.000000
slot update_slots: id  1 | task 11547 | prompt done, n_past = 502, n_tokens = 502
slot update_slots: id  2 | task 11548 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 2
slot update_slots: id  2 | task 11548 | kv cache rm [0, end)
slot update_slots: id  2 | task 11548 | prompt processing progress, n_past = 2, n_tokens = 504, progress = 1.000000
slot update_slots: id  2 | task 11548 | prompt done, n_past = 2, n_tokens = 504
slot      release: id  1 | task 11547 | stop processing: n_past = 502, truncated = 0
slot      release: id  2 | task 11548 | stop processing: n_past = 2, truncated = 0
slot launch_slot_: id  3 | task 11550 | processing task
slot launch_slot_: id  0 | task 11551 | processing task
slot launch_slot_: id  1 | task 11552 | processing task
slot launch_slot_: id  2 | task 11553 | processing task
slot update_slots: id  0 | task 11551 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 502
slot update_slots: id  0 | task 11551 | kv cache rm [0, end)
slot update_slots: id  0 | task 11551 | prompt processing progress, n_past = 502, n_tokens = 502, progress = 1.000000
slot update_slots: id  0 | task 11551 | prompt done, n_past = 502, n_tokens = 502
slot update_slots: id  1 | task 11552 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 502
slot update_slots: id  2 | task 11553 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 503
slot update_slots: id  3 | task 11550 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 501
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot      release: id  0 | task 11551 | stop processing: n_past = 502, truncated = 0
slot launch_slot_: id  0 | task 11554 | processing task
slot update_slots: id  0 | task 11554 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 502
slot update_slots: id  0 | task 11554 | kv cache rm [0, end)
slot update_slots: id  0 | task 11554 | prompt processing progress, n_past = 502, n_tokens = 502, progress = 1.000000
slot update_slots: id  0 | task 11554 | prompt done, n_past = 502, n_tokens = 502
slot      release: id  0 | task 11554 | stop processing: n_past = 502, truncated = 0
slot update_slots: id  1 | task 11552 | kv cache rm [0, end)
slot update_slots: id  1 | task 11552 | prompt processing progress, n_past = 502, n_tokens = 502, progress = 1.000000
slot update_slots: id  1 | task 11552 | prompt done, n_past = 502, n_tokens = 502
slot      release: id  1 | task 11552 | stop processing: n_past = 502, truncated = 0
slot update_slots: id  2 | task 11553 | kv cache rm [0, end)
slot update_slots: id  2 | task 11553 | prompt processing progress, n_past = 503, n_tokens = 503, progress = 1.000000
slot update_slots: id  2 | task 11553 | prompt done, n_past = 503, n_tokens = 503
slot      release: id  2 | task 11553 | stop processing: n_past = 503, truncated = 0
slot update_slots: id  3 | task 11550 | kv cache rm [0, end)
slot update_slots: id  3 | task 11550 | prompt processing progress, n_past = 501, n_tokens = 501, progress = 1.000000
slot update_slots: id  3 | task 11550 | prompt done, n_past = 501, n_tokens = 501
slot      release: id  3 | task 11550 | stop processing: n_past = 501, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot launch_slot_: id  0 | task 11595 | processing task
slot update_slots: id  0 | task 11595 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 2
slot update_slots: id  0 | task 11595 | kv cache rm [0, end)
slot update_slots: id  0 | task 11595 | prompt processing progress, n_past = 2, n_tokens = 2, progress = 1.000000
slot update_slots: id  0 | task 11595 | prompt done, n_past = 2, n_tokens = 2
slot      release: id  0 | task 11595 | stop processing: n_past = 2, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot launch_slot_: id  3 | task 11597 | processing task
slot launch_slot_: id  1 | task 11598 | processing task
slot launch_slot_: id  2 | task 11599 | processing task
slot launch_slot_: id  0 | task 11600 | processing task
slot update_slots: id  0 | task 11600 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 503
slot update_slots: id  0 | task 11600 | kv cache rm [0, end)
slot update_slots: id  0 | task 11600 | prompt processing progress, n_past = 503, n_tokens = 503, progress = 1.000000
slot update_slots: id  0 | task 11600 | prompt done, n_past = 503, n_tokens = 503
slot update_slots: id  1 | task 11598 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 502
slot update_slots: id  2 | task 11599 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 502
slot update_slots: id  3 | task 11597 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 501
slot      release: id  0 | task 11600 | stop processing: n_past = 503, truncated = 0
slot launch_slot_: id  0 | task 11603 | processing task
slot update_slots: id  0 | task 11603 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 502
slot update_slots: id  0 | task 11603 | kv cache rm [0, end)
slot update_slots: id  0 | task 11603 | prompt processing progress, n_past = 502, n_tokens = 502, progress = 1.000000
slot update_slots: id  0 | task 11603 | prompt done, n_past = 502, n_tokens = 502
slot      release: id  0 | task 11603 | stop processing: n_past = 502, truncated = 0
slot launch_slot_: id  0 | task 11604 | processing task
slot update_slots: id  0 | task 11604 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 2
slot update_slots: id  0 | task 11604 | kv cache rm [0, end)
slot update_slots: id  0 | task 11604 | prompt processing progress, n_past = 2, n_tokens = 2, progress = 1.000000
slot update_slots: id  0 | task 11604 | prompt done, n_past = 2, n_tokens = 2
slot update_slots: id  1 | task 11598 | kv cache rm [0, end)
slot update_slots: id  1 | task 11598 | prompt processing progress, n_past = 502, n_tokens = 504, progress = 1.000000
slot update_slots: id  1 | task 11598 | prompt done, n_past = 502, n_tokens = 504
slot      release: id  0 | task 11604 | stop processing: n_past = 2, truncated = 0
slot      release: id  1 | task 11598 | stop processing: n_past = 502, truncated = 0
slot launch_slot_: id  0 | task 11601 | processing task
slot update_slots: id  0 | task 11601 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 502
slot update_slots: id  0 | task 11601 | kv cache rm [0, end)
slot update_slots: id  0 | task 11601 | prompt processing progress, n_past = 502, n_tokens = 502, progress = 1.000000
slot update_slots: id  0 | task 11601 | prompt done, n_past = 502, n_tokens = 502
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot      release: id  0 | task 11601 | stop processing: n_past = 502, truncated = 0
slot update_slots: id  2 | task 11599 | kv cache rm [0, end)
slot update_slots: id  2 | task 11599 | prompt processing progress, n_past = 502, n_tokens = 502, progress = 1.000000
slot update_slots: id  2 | task 11599 | prompt done, n_past = 502, n_tokens = 502
slot      release: id  2 | task 11599 | stop processing: n_past = 502, truncated = 0
slot update_slots: id  3 | task 11597 | kv cache rm [0, end)
slot update_slots: id  3 | task 11597 | prompt processing progress, n_past = 501, n_tokens = 501, progress = 1.000000
slot update_slots: id  3 | task 11597 | prompt done, n_past = 501, n_tokens = 501
slot      release: id  3 | task 11597 | stop processing: n_past = 501, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot launch_slot_: id  1 | task 11613 | processing task
slot launch_slot_: id  0 | task 11614 | processing task
slot launch_slot_: id  2 | task 11615 | processing task
slot update_slots: id  0 | task 11614 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 503
slot update_slots: id  0 | task 11614 | kv cache rm [0, end)
slot update_slots: id  0 | task 11614 | prompt processing progress, n_past = 503, n_tokens = 503, progress = 1.000000
slot update_slots: id  0 | task 11614 | prompt done, n_past = 503, n_tokens = 503
slot update_slots: id  1 | task 11613 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 501
slot update_slots: id  2 | task 11615 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 2
slot update_slots: id  2 | task 11615 | kv cache rm [0, end)
slot update_slots: id  2 | task 11615 | prompt processing progress, n_past = 2, n_tokens = 505, progress = 1.000000
slot update_slots: id  2 | task 11615 | prompt done, n_past = 2, n_tokens = 505
slot      release: id  0 | task 11614 | stop processing: n_past = 503, truncated = 0
slot      release: id  2 | task 11615 | stop processing: n_past = 2, truncated = 0
slot update_slots: id  1 | task 11613 | kv cache rm [0, end)
slot update_slots: id  1 | task 11613 | prompt processing progress, n_past = 501, n_tokens = 501, progress = 1.000000
slot update_slots: id  1 | task 11613 | prompt done, n_past = 501, n_tokens = 501
slot      release: id  1 | task 11613 | stop processing: n_past = 501, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot launch_slot_: id  3 | task 11619 | processing task
slot update_slots: id  3 | task 11619 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 12
slot update_slots: id  3 | task 11619 | kv cache rm [0, end)
slot update_slots: id  3 | task 11619 | prompt processing progress, n_past = 12, n_tokens = 12, progress = 1.000000
slot update_slots: id  3 | task 11619 | prompt done, n_past = 12, n_tokens = 12
slot      release: id  3 | task 11619 | stop processing: n_past = 12, truncated = 0
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200
slot launch_slot_: id  2 | task 11647 | processing task
slot update_slots: id  2 | task 11647 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 2
slot update_slots: id  2 | task 11647 | kv cache rm [0, end)
slot update_slots: id  2 | task 11647 | prompt processing progress, n_past = 2, n_tokens = 2, progress = 1.000000
slot update_slots: id  2 | task 11647 | prompt done, n_past = 2, n_tokens = 2
/app/src/llama-graph.cpp:185: GGML_ASSERT(seq_id < n_tokens && "seq_id cannot be larger than n_tokens with pooling_type == MEAN") failed
srv  cancel_tasks: cancel task, id_task = 11647
srv  log_server_r: request: POST /v1/embeddings 10.2.0.132 200

Note: It is not deterministic, but it seems to happen more frequently when enough slots are used. If wanting to reproduce, you should reduce LLAMA_ARG_N_PARALLEL to 2, for instance.

Originally posted by @aviallon in #9000 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingembeddingsembedding related topicsserver

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions