WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found)

Solution: https://github.com/ggerganov/llama.cpp/issues/1230#issuecomment-1532562940

**UPD:
Confirmed working just fine on Windows.**

Issue below happened only on WSL.

https://github.com/ggerganov/llama.cpp/pull/1207


First i pull and clean

```
(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# git pull
Already up to date.
(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# make clean
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
removed 'common.o'
removed 'ggml-cuda.o'
removed 'ggml.o'
removed 'llama.o'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'
```

Build fresh with cuBLAS

```
(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include   -c ggml.c -o ggml.o
ggml.c: In function ‘ggml_compute_forward_mul_mat_use_blas’:
ggml.c:7921:36: warning: unused parameter ‘src0’ [-Wunused-parameter]
 7921 |         const struct ggml_tensor * src0,
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
ggml.c: In function ‘ggml_compute_forward_mul_mat_q_f32’:
ggml.c:8520:31: warning: unused variable ‘y’ [-Wunused-variable]
 8520 |                 const float * y = (float *) ((char *) src1->data + i02*nb12 + i03*nb13);
      |                               ^
ggml.c: In function ‘ggml_compute_forward_alibi_f32’:
ggml.c:9104:15: warning: unused variable ‘n_past’ [-Wunused-variable]
 9104 |     const int n_past = ((int32_t *) src1->data)[0];
      |               ^~~~~~
ggml.c: In function ‘ggml_compute_forward_alibi_f16’:
ggml.c:9165:15: warning: unused variable ‘n_past’ [-Wunused-variable]
 9165 |     const int n_past = ((int32_t *) src1->data)[0];
      |               ^~~~~~
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c examples/common.cpp -o common.o
nvcc --forward-unknown-to-host-compiler -arch=native -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/main/main.cpp ggml.o llama.o common.o ggml-cuda.o -o main  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib

====  Run ./main -h for help.  ====

g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize/quantize.cpp ggml.o llama.o ggml-cuda.o -o quantize  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-cuda.o -o quantize-stats  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-cuda.o -o perplexity  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-cuda.o -o embedding  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include pocs/vdot/vdot.cpp ggml.o ggml-cuda.o -o vdot  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
```
Trying to load model that worked before update

```
(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# ./main -m /mnt/wsl/ggml-vic13b-q5_0.bin -b 512 -t 12 --no-mmap
main: seed = 1682770733
llama.cpp: loading model from /mnt/wsl/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 8740093.73 KB
llama_model_load_internal: mem required  = 10583.25 MB (+ 1608.00 MB per state)
CUDA error 2 at ggml-cuda.cu:359: out of memory
```

I haven't updated my libllama.so for llama-cpp-python yet, so it uses previous version, and works with this very model just fine. Smth happened.

RTX 3050 8GB

UPD 2:
Issue persists on WSL. I did full clean and yet it doesn't work after being built with current version.

UPD 3:
I found some years old version llama.cpp and did exactly same thing and everything worked fine. So i guess it's not me be especially dumb today.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions