Skip to content

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

Closed
@Priestru

Description

@Priestru

Solution: #1230 (comment)

UPD:
Confirmed working just fine on Windows.

Issue below happened only on WSL.

#1207

First i pull and clean

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# git pull
Already up to date.
(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# make clean
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
removed 'common.o'
removed 'ggml-cuda.o'
removed 'ggml.o'
removed 'llama.o'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'

Build fresh with cuBLAS

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include   -c ggml.c -o ggml.o
ggml.c: In function ‘ggml_compute_forward_mul_mat_use_blas’:
ggml.c:7921:36: warning: unused parameter ‘src0’ [-Wunused-parameter]
 7921 |         const struct ggml_tensor * src0,
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
ggml.c: In function ‘ggml_compute_forward_mul_mat_q_f32’:
ggml.c:8520:31: warning: unused variable ‘y’ [-Wunused-variable]
 8520 |                 const float * y = (float *) ((char *) src1->data + i02*nb12 + i03*nb13);
      |                               ^
ggml.c: In function ‘ggml_compute_forward_alibi_f32’:
ggml.c:9104:15: warning: unused variable ‘n_past’ [-Wunused-variable]
 9104 |     const int n_past = ((int32_t *) src1->data)[0];
      |               ^~~~~~
ggml.c: In function ‘ggml_compute_forward_alibi_f16’:
ggml.c:9165:15: warning: unused variable ‘n_past’ [-Wunused-variable]
 9165 |     const int n_past = ((int32_t *) src1->data)[0];
      |               ^~~~~~
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c examples/common.cpp -o common.o
nvcc --forward-unknown-to-host-compiler -arch=native -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/main/main.cpp ggml.o llama.o common.o ggml-cuda.o -o main  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib

====  Run ./main -h for help.  ====

g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize/quantize.cpp ggml.o llama.o ggml-cuda.o -o quantize  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-cuda.o -o quantize-stats  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-cuda.o -o perplexity  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-cuda.o -o embedding  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include pocs/vdot/vdot.cpp ggml.o ggml-cuda.o -o vdot  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib

Trying to load model that worked before update

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# ./main -m /mnt/wsl/ggml-vic13b-q5_0.bin -b 512 -t 12 --no-mmap
main: seed = 1682770733
llama.cpp: loading model from /mnt/wsl/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 8740093.73 KB
llama_model_load_internal: mem required  = 10583.25 MB (+ 1608.00 MB per state)
CUDA error 2 at ggml-cuda.cu:359: out of memory

I haven't updated my libllama.so for llama-cpp-python yet, so it uses previous version, and works with this very model just fine. Smth happened.

RTX 3050 8GB

UPD 2:
Issue persists on WSL. I did full clean and yet it doesn't work after being built with current version.

UPD 3:
I found some years old version llama.cpp and did exactly same thing and everything worked fine. So i guess it's not me be especially dumb today.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions