Hi, thanks for submitting this package (and llama.cpp-cuda-f16
too)!
Is it possible to add GGML_CUDA_FA_ALL_QUANTS=ON
to the build options (for this package and llama.cpp-cuda-f16
)? This option gives more flexible KV cache quantization options (and combinations). The build time is indeed longer, but I personally don't think it's too bad. It'll be a nice addition for users seeking lower VRAM usage.
Pinned Comments
txtsd commented on 2024-10-26 20:17 (UTC) (edited on 2024-12-06 14:15 (UTC) by txtsd)
Alternate versions
llama.cpp
llama.cpp-vulkan
llama.cpp-sycl-fp16
llama.cpp-sycl-fp32
llama.cpp-cuda
llama.cpp-cuda-f16
llama.cpp-hip