| Age | Commit message (Collapse) | Author |
|
|
|
The text-only NVFP4 sibling of Qwen3.6-27B-VL ships without
preprocessor_config.json, which sglang's loader still expects. Add a
generic, idempotent cache-fixup helper invoked via ExecStartPre that
injects missing files from /usr/share/sglang/cache-fixups/ into the
matching HF Hub snapshot dirs, and ship the preprocessor_config.json
needed by Qwen3.6-27B-Text-NVFP4-MTP as the first consumer.
Also bumps pkgver to r12460.b6b9145c9.
|
|
OpenAI review of the working setup flagged five packaging gaps. Apply
all five so the SM120 NVFP4 path survives fresh installs and AUR pulls:
1. Add SM120-NVFP4-NOTES.md and gemma_4_31b_nvfp4.env.example to
source=() and install both under /usr/share/doc/sglang-git/.
Operator copies the env example into /etc/sglang/ and fills in the
HF snapshot hash; package upgrades don't clobber operator edits.
2. Add NVFP4-specific optdepends so fresh installs see the runtime
prerequisites: cuda, gcc15, cutlass, python-nvidia-cudnn-frontend,
python-compressed-tensors-git>=0.15.0.
3. Update prepare() comment to describe all four patch hunks (clamp,
3D reshape across compressed-tensors / modelopt / jit kernel,
/usr/include/cutlass discovery), not just the original 3D reshape.
4. Use patch -Np1 -F0 to prevent fuzzy misapply when upstream context
drifts.
5. Document the Python-minor-version hardcode in the env example
(LD_LIBRARY_PATH includes python3.14/site-packages/tvm_ffi/lib).
Add fresh-box checklist + cutlass_fp4_gemm import sanity-check to
the notes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
Documents the full stack of fixes that got RedHatAI/gemma-4-31B-it-NVFP4
running coherently at ~47 tok/s on RTX 5090. 13 distinct failure modes
encountered in chronological order, each with its root cause and the
specific resolution.
Companion env-file example (gemma_4_31b_nvfp4.env.example) shows the
working config: LD_PRELOAD libcuda, LD_LIBRARY_PATH for tvm_ffi,
--fp4-gemm-backend=cutlass (sglang JIT, not flashinfer_cutlass), and
the model snapshot's chat_template.jinja override.
Drop the patch + this doc when sglang upstream PR #22927 lands and the
JIT-compiled .so picks up libcuda link flags upstream.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
Two fixes for NVFP4 dense inference on RTX 5090 (SM120) with the
RedHatAI compressed-tensors NVFP4 path. Both are required to get
coherent output from RedHatAI/gemma-4-31B-it-NVFP4; either alone
produces broken output.
1. 3D-activation flatten for flashinfer.mm_fp4
sglang's CUDA-graph-capture path passes [1, max_seq, K] activations
to flashinfer.mm_fp4, which has a 2D contract. Flatten on entry,
restore on return, in BOTH dispatch paths (ModelOpt and
compressed-tensors). Switch final view to reshape since the local
fp4_gemm wrapper does not enforce contiguity across backends.
2. SM120 E4M3 scale-byte NaN clamp
CUTLASS FP4 GEMM kernels on SM120 produce NaN when an E4M3
scale-factor byte equals 0x7f. Clamp scale bytes to 0x7e at the top
of fp4_gemm() so BOTH cutlass and flashinfer dispatch paths receive
sanitised scales. Tracks sglang upstream PR #22927 but extends to
the flashinfer branch (we hit identical garbage on both backends
without this).
Drop both patches once upstream sglang accepts equivalents.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
Tier-2 RedHatAI quants of Google's Gemma 4 family — fills the matrix
for 32 GB VRAM-class cards:
- gemma_4_26b_a4b_nvfp4 (RedHatAI) — ~16 GB, MoE 4B-active.
Recommended pick alongside Parakeet ASR on a 5090: MMLU-Pro 82.6,
GPQA 82.3, fast inference, comfortable memory margin.
- gemma_4_26b_a4b_fp8 (RedHatAI) — ~29 GB FP8-Dynamic, fits alone
on a 32 GB card, no headroom for co-residency.
- gemma_4_31b_nvfp4 (RedHatAI) — ~23 GB NVFP4 dense. Highest
quality (MMLU-Pro 85.2, GPQA 84.3, AIME 89.2). Tight with Parakeet
but viable; cap context to 16 K to leave KV-cache headroom.
All three carry RedHatAI's published recipe.yaml so the partial-quant
boundary (vision/embed/lm_head BF16) is auditable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
|
|
Adds four model configs to fill the quant matrix:
- qwen3.6_27b_nvfp4 (mmangkad) — caveat: MLP-only quant per
maintainer, 30 GB on disk, 29.79 GB GPU measured on RTX 5090, OOMs
alongside Parakeet at default mem_fraction_static
- qwen3.6_35b_a3b_awq_int4 (QuantTrio)
- qwen3.6_35b_a3b_gptq_int4 (palmfuture)
- qwen3.6_35b_a3b_autoround_int4 (Intel — gated)
Rounds out the 27B and 35B-A3B quant matrices to mirror each other
(BF16 / FP8 / AWQ-Int4 / GPTQ-Int4 / AutoRound-Int4 / NVFP4).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
Adds Environment=NVCC_PREPEND_FLAGS=--compiler-bindir=/usr/bin/gcc-15
to sglang@.service so flashinfer / sgl_kernel JIT extensions stop
failing to build under Arch's gcc 16 default (nvcc 13.x can't parse
libstdc++ 16). Lives in the unit (not in backup= conf) so pacman
freely updates without requiring a .pacnew merge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
DeepSeek-V4-Flash ships tokenizer_config.json without a chat_template
field, and sglang has no built-in deepseek-v4 chat-template. Without
one, /v1/chat/completions returns BadRequest. Install a chat-template
at /etc/sglang/deepseek_v4.jinja (lifted from DeepSeek-V3.2-Exp,
which uses the same special tokens) and wire it into the V4-Flash
service conf via --chat-template /etc/sglang/deepseek_v4.jinja.
pkgver bumped to r12032.a4f63b6ca to match the freshly built tree.
Mirrors the same change in the stable sglang AUR package.
|
|
Includes v3.6.0 triton_kernels migration, gate_scal expert-sorted reorder,
empty-batch + padded-masking lifecycle guards, and re-added pow-2 padding
for the topk_forward kernel constraint that v3.6.0 left in place.
|
|
Add etc/sglang/<model>.conf for every entry in _models to backup= so
pacman preserves user edits at upgrade time and emits .pacnew for any
package-side changes. Previously per-model confs were silently
overwritten on every upgrade, which has bitten us repeatedly during
V4-Flash sm_120 work — the conf carried smoke-test env-vars and
non-default --moe-runner-backend, etc., and reset to the shipped
defaults on every install.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
Pull from the fork branch wjh/v4-flash-mxfp4-routed-experts which
carries the V4-Flash-on-sm_120 patches (MXFP4 routed-expert plumbing,
biasless triton-kernels MoE path, HashTopK+TopK pow-2 zero-gate
padding, V4 KV pool SWA-mapping wiring, set_swa_loc, is_prefill kwarg
upstream bug fix, mhc tilelang wg_wait removal). Branch tracks
upstream sgl-project/sglang amd/deepseek_v4 + 8 wjh patches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
sglang-git was missing nine confs that landed in the stable sglang AUR
package across the day's work:
- gemma_4_31b_fp8 (RedHatAI FP8-Dynamic, ~31 GB; 32 GB cards)
- qwen3.6_27b (BF16), qwen3.6_27b_fp8
- qwen3.6_27b_awq_int4 (cyankiwi) — ~14 GB
- qwen3.6_27b_gptq_int4 (groxaxo) — ~14 GB
- qwen3.6_27b_autoround_int4 (Lorbus) — ~14 GB
- qwen3.6_35b_a3b (BF16), qwen3.6_35b_a3b_fp8
- qwen3.6_35b_a3b_nvfp4 (RedHatAI) — ~18 GB, Blackwell-native
All Qwen confs use --reasoning-parser qwen3 --tool-call-parser qwen3_coder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
Arch ships two confusingly-named multipart packages:
python-multipart -> defnull/multipart (wrong)
python-python-multipart -> Kludex/python-multipart (correct, what FastAPI uses)
Selecting the wrong one means FastAPI raises at endpoint registration:
RuntimeError: Form data requires "python-multipart" to be installed.
It seems you installed "multipart" instead.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
Move proven-required modules (fastapi, starlette, openai, huggingface-hub,
pillow, packaging, psutil, scipy, sentencepiece, pyzmq, multipart, uvicorn,
flashinfer) from optdepends to depends. They are imported unconditionally
during sglang.launch_server bootstrap, so the package crash-loops at startup
without them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
|
|
|
|
|
|
|
|
Prevents scheduler busy-wait burning a CPU core at idle.
Also updates gemma configs to use gemma4 parser and bumps pkgver.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
- Replace sglang.service with sglang@.service template unit
- Add per-model config files for Gemma 4 and Qwen 3.5 variants
- Default to --sleep-on-idle to reduce CPU usage when idle
- Update sglang.conf as global config with SGLANG_OPTS/SGLANG_ARGS split
- Point source to JustinTong0323/sglang new-model-gg branch
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
- Replace DynamicUser with static sglang user via sysusers.d/tmpfiles.d
(persistent HF_HOME at /var/lib/sglang survives restarts)
- Add sglang.env (mode 0600) for credentials, separate from sglang.conf
- Harden systemd service: NoNewPrivileges, PrivateTmp, ProtectSystem/Home
- Bind to 127.0.0.1:30000 instead of 0.0.0.0:8000
- Fix arch: any -> x86_64 (CUDA dependency)
- Fix python-python-multipart -> python-multipart
- Move config from /etc/sglang.conf to /etc/sglang/
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Replace outdated Qwen2.5 model list with Qwen3.5 dense and MoE models,
including approximate BF16 and GPTQ-Int4 VRAM estimates. Add reasoning
and tool call parser documentation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Add sglang.service and /etc/sglang.conf for systemd integration.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|