This package now uses system libggml so it should work alongside whisper.cpp
Tests and examples building has been turned off.
kompute is removed.
Git Clone URL: | https://aur.archlinux.org/llama.cpp.git (read-only, click to copy) |
---|---|
Package Base: | llama.cpp |
Description: | Port of Facebook's LLaMA model in C/C++ |
Upstream URL: | https://github.com/ggerganov/llama.cpp |
Licenses: | MIT |
Submitter: | txtsd |
Maintainer: | txtsd |
Last Packager: | txtsd |
Votes: | 9 |
Popularity: | 1.51 |
First Submitted: | 2024-10-26 15:38 (UTC) |
Last Updated: | 2025-07-02 17:35 (UTC) |
This package now uses system libggml so it should work alongside whisper.cpp
Tests and examples building has been turned off.
kompute is removed.
Conflicts with i.e. libggml-git AUR package (needed by whisper.cpp package), as this includes its own version of libggml. Maybe some paths separation (i.e. move this to /usr/local) could help if internal libggml is required and shared version could not be used, or linking to libggml in dynamic fashion? As for now atleast the "conflicts" section of PKGBUILD should be filled in with libggml as the first measure I think :)
Why build tests? switch off tests by default
@kelvie
(llama.cpp assertions via GGML_ASSERT are always enabled in all build types.)
2 questions:
@cfillion I got rid of OpenBLAS. Please let me know if it's working as expected.
Fair enough. Is OpenBLAS actually faster for most users though? It's unbearably slow here.
export TEXT="Arch Linux defines simplicity as without unnecessary additions or modifications. It ships software as released by the original developers (upstream) with minimal distribution-specific (downstream) changes: patches not accepted by upstream are avoided, and Arch's downstream patches consist almost entirely of backported bug fixes that are obsoleted by the project's next release."
# GGML_BLAS=ON GGML_BLAS_VENDOR=OpenBLAS
$ time llama-cli -m Mistral-Small-Instruct-2409-Q6_K_L.gguf -tb 16 -p "$TEXT" -n 1 -no-cnv
llama_perf_context_print: prompt eval time = 10825.25 ms / 76 tokens ( 142.44 ms per token, 7.02 tokens per second)
291.99s user 1.91s system 2544% cpu 11.551 total
# GGML_BLAS=OFF
$ time llama-cli -m Mistral-Small-Instruct-2409-Q6_K_L.gguf -tb 16 -p "$TEXT" -n 1 -no-cnv
llama_perf_context_print: prompt eval time = 2248.24 ms / 76 tokens ( 29.58 ms per token, 33.80 tokens per second)
40.49s user 0.54s system 1395% cpu 2.940 total
# GGML_BLAS=OFF (w/ multiprocessing to imitate OpenBLAS's behavior, ruling it out as a single cause)
$ time llama-cli -m Mistral-Small-Instruct-2409-Q6_K_L.gguf -tb 32 -p "$TEXT" -n 1 -no-cnv
llama_perf_context_print: prompt eval time = 2350.10 ms / 76 tokens ( 30.92 ms per token, 32.34 tokens per second)
85.28s user 0.78s system 2784% cpu 3.091 total
@cfillion I'm already maintaining several variants of this package. I don't think I'll add another variant unless it's popularly requested.
Thanks for the feedback though!
GGML_BLAS=ON with OpenBLAS is significantly slower at prompt processing on my CPU (9950X w/ n_threads = n_threads_batch = 16) than OFF. Could there be a variant of the package without it?
Also, isn't the kompute submodule unnecessary/unused unless GLML_KOMPUTE=ON is set? (which would be for a separate llama.cpp-kompute package?)
@abitrolly Because of the maintenance overhead. It's easier to main as separate packages.
Pinned Comments
txtsd commented on 2024-10-26 20:14 (UTC) (edited on 2024-12-06 14:14 (UTC) by txtsd)
Alternate versions
llama.cpp
llama.cpp-vulkan
llama.cpp-sycl-fp16
llama.cpp-sycl-fp32
llama.cpp-cuda
llama.cpp-cuda-f16
llama.cpp-hip