Closed
Description
Hi there, I was reading and saw:
Enabling flash attention reduces memory usage by at least 400 MB. At the moment, it is not supported when CUBLAS is enabled because the kernel implementation is missing.
But I'm curious, would it make sense to set -DSD_FLASH_ATTN=ON
for the Mac, Linux, and other non-CUBLAS builds:
- build: "noavx"
defines: "-DGGML_AVX=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF -DSD_BUILD_SHARED_LIBS=ON"
- build: "avx2"
defines: "-DGGML_AVX2=ON -DSD_BUILD_SHARED_LIBS=ON"
- build: "avx"
defines: "-DGGML_AVX2=OFF -DSD_BUILD_SHARED_LIBS=ON"
- build: "avx512"
defines: "-DGGML_AVX512=ON -DSD_BUILD_SHARED_LIBS=ON"
- build: "cuda12"
Thanks!
Metadata
Metadata
Assignees
Labels
No labels