Skip to content

CUDA error: invalid configuration argument for MoEs - --ubatch-size 8192 exceeds INT_MAX #13376

Closed
@danielhanchen

Description

@danielhanchen

Tagging @JohannesGaessler for visibility!

TLDR:

I'm running imatrix.cpp (latest llama.cpp) with --ubatch-size 8192, but am getting CUDA errors. My suspicion is CUDA needs arguments < INT_MAX (2^31-1), but large physical batch sizes causes CUDA launch errors for MoEs. --ubatch-size 8191 works fine. 8192 does not.

Long form:

I'm running imatrix.cpp with large physical batch sizes (8192), but sadly I get errors with:

CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_mul_mat_id at llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2062
  cudaGetLastError()

ie the error is here:

get_rows_cuda(src1->data, src1->type, ids_to_sorted, src1_sorted.ptr, type_src1_sorted,
        ne10, nb11, nb12, nb13,
        ne_get_rows, 1, 1, sizeof(int32_t), ne_get_rows*sizeof(int32_t), ne_get_rows*sizeof(int32_t),
        ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, stream);
CUDA_CHECK(cudaGetLastError());

Using --ubatch-size 8192 causes the error to occur on Qwen 3 30B MoE.

--ubatch-size 8191 works fine.

My suspicion is because CUDA I think requires arguments to be < INT_MAX It's because Qwen has 128 experts, 2048 in dim, so 8192 * 2048 * 128 = 2147483648 > 2147483647 (INT_MAX).

8191 * 2048 * 128 = 2147221504, so less than INT_MAX.

Ie one of the arguments:

ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted

is exceeding INT_MAX, thus causing CUDA to error out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions