Description
Tagging @JohannesGaessler for visibility!
TLDR:
I'm running imatrix.cpp (latest llama.cpp) with --ubatch-size 8192
, but am getting CUDA errors. My suspicion is CUDA needs arguments < INT_MAX (2^31-1), but large physical batch sizes causes CUDA launch errors for MoEs. --ubatch-size 8191
works fine. 8192 does not.
Long form:
I'm running imatrix.cpp with large physical batch sizes (8192), but sadly I get errors with:
CUDA error: invalid configuration argument
current device: 0, in function ggml_cuda_mul_mat_id at llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2062
cudaGetLastError()
ie the error is here:
get_rows_cuda(src1->data, src1->type, ids_to_sorted, src1_sorted.ptr, type_src1_sorted,
ne10, nb11, nb12, nb13,
ne_get_rows, 1, 1, sizeof(int32_t), ne_get_rows*sizeof(int32_t), ne_get_rows*sizeof(int32_t),
ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, stream);
CUDA_CHECK(cudaGetLastError());
Using --ubatch-size 8192
causes the error to occur on Qwen 3 30B MoE.
--ubatch-size 8191
works fine.
My suspicion is because CUDA I think requires arguments to be < INT_MAX
It's because Qwen has 128 experts, 2048 in dim, so 8192 * 2048 * 128 = 2147483648 > 2147483647 (INT_MAX)
.
8191 * 2048 * 128 = 2147221504, so less than INT_MAX
.
Ie one of the arguments:
ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted
is exceeding INT_MAX
, thus causing CUDA to error out.