summarylogtreecommitdiffstats
AgeCommit message (Collapse)Author
2 daysRegenerate .SRCINFOWill Handley
2 daysAdd qwen3.6_27b_text_nvfp4 config + HF-cache fixup mechanismWill Handley
The text-only NVFP4 sibling of Qwen3.6-27B-VL ships without preprocessor_config.json, which sglang's loader still expects. Add a generic, idempotent cache-fixup helper invoked via ExecStartPre that injects missing files from /usr/share/sglang/cache-fixups/ into the matching HF Hub snapshot dirs, and ship the preprocessor_config.json needed by Qwen3.6-27B-Text-NVFP4-MTP as the first consumer. Also bumps pkgver to r12460.b6b9145c9.
12 daysPackage SM120 NVFP4 fixes durably for AUR pulls + upgradesWill Handley
OpenAI review of the working setup flagged five packaging gaps. Apply all five so the SM120 NVFP4 path survives fresh installs and AUR pulls: 1. Add SM120-NVFP4-NOTES.md and gemma_4_31b_nvfp4.env.example to source=() and install both under /usr/share/doc/sglang-git/. Operator copies the env example into /etc/sglang/ and fills in the HF snapshot hash; package upgrades don't clobber operator edits. 2. Add NVFP4-specific optdepends so fresh installs see the runtime prerequisites: cuda, gcc15, cutlass, python-nvidia-cudnn-frontend, python-compressed-tensors-git>=0.15.0. 3. Update prepare() comment to describe all four patch hunks (clamp, 3D reshape across compressed-tensors / modelopt / jit kernel, /usr/include/cutlass discovery), not just the original 3D reshape. 4. Use patch -Np1 -F0 to prevent fuzzy misapply when upstream context drifts. 5. Document the Python-minor-version hardcode in the env example (LD_LIBRARY_PATH includes python3.14/site-packages/tvm_ffi/lib). Add fresh-box checklist + cutlass_fp4_gemm import sanity-check to the notes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 daysAdd SM120 NVFP4 notes + working env exampleWill Handley
Documents the full stack of fixes that got RedHatAI/gemma-4-31B-it-NVFP4 running coherently at ~47 tok/s on RTX 5090. 13 distinct failure modes encountered in chronological order, each with its root cause and the specific resolution. Companion env-file example (gemma_4_31b_nvfp4.env.example) shows the working config: LD_PRELOAD libcuda, LD_LIBRARY_PATH for tvm_ffi, --fp4-gemm-backend=cutlass (sglang JIT, not flashinfer_cutlass), and the model snapshot's chat_template.jinja override. Drop the patch + this doc when sglang upstream PR #22927 lands and the JIT-compiled .so picks up libcuda link flags upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 daysPatch fp4_gemm + ModelOptFp4LinearMethod for SM120 NVFP4Will Handley
Two fixes for NVFP4 dense inference on RTX 5090 (SM120) with the RedHatAI compressed-tensors NVFP4 path. Both are required to get coherent output from RedHatAI/gemma-4-31B-it-NVFP4; either alone produces broken output. 1. 3D-activation flatten for flashinfer.mm_fp4 sglang's CUDA-graph-capture path passes [1, max_seq, K] activations to flashinfer.mm_fp4, which has a 2D contract. Flatten on entry, restore on return, in BOTH dispatch paths (ModelOpt and compressed-tensors). Switch final view to reshape since the local fp4_gemm wrapper does not enforce contiguity across backends. 2. SM120 E4M3 scale-byte NaN clamp CUTLASS FP4 GEMM kernels on SM120 produce NaN when an E4M3 scale-factor byte equals 0x7f. Clamp scale bytes to 0x7e at the top of fp4_gemm() so BOTH cutlass and flashinfer dispatch paths receive sanitised scales. Tracks sglang upstream PR #22927 but extends to the flashinfer branch (we hit identical garbage on both backends without this). Drop both patches once upstream sglang accepts equivalents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 daysAdd gemma_4_26b_a4b_{fp8,nvfp4} + gemma_4_31b_nvfp4 configsWill Handley
Tier-2 RedHatAI quants of Google's Gemma 4 family — fills the matrix for 32 GB VRAM-class cards: - gemma_4_26b_a4b_nvfp4 (RedHatAI) — ~16 GB, MoE 4B-active. Recommended pick alongside Parakeet ASR on a 5090: MMLU-Pro 82.6, GPQA 82.3, fast inference, comfortable memory margin. - gemma_4_26b_a4b_fp8 (RedHatAI) — ~29 GB FP8-Dynamic, fits alone on a 32 GB card, no headroom for co-residency. - gemma_4_31b_nvfp4 (RedHatAI) — ~23 GB NVFP4 dense. Highest quality (MMLU-Pro 85.2, GPQA 84.3, AIME 89.2). Tight with Parakeet but viable; cap context to 16 K to leave KV-cache headroom. All three carry RedHatAI's published recipe.yaml so the partial-quant boundary (vision/embed/lm_head BF16) is auditable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 daysBump pkgver to r12426.0588e9044Will Handley
14 daysAdd qwen3.6_27b_nvfp4 + qwen3.6_35b_a3b INT4 trio configsWill Handley
Adds four model configs to fill the quant matrix: - qwen3.6_27b_nvfp4 (mmangkad) — caveat: MLP-only quant per maintainer, 30 GB on disk, 29.79 GB GPU measured on RTX 5090, OOMs alongside Parakeet at default mem_fraction_static - qwen3.6_35b_a3b_awq_int4 (QuantTrio) - qwen3.6_35b_a3b_gptq_int4 (palmfuture) - qwen3.6_35b_a3b_autoround_int4 (Intel — gated) Rounds out the 27B and 35B-A3B quant matrices to mirror each other (BF16 / FP8 / AWQ-Int4 / GPTQ-Int4 / AutoRound-Int4 / NVFP4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 dayssglang-git: bump to r12421.8ee8a8f92, force gcc-15 host compiler in unitWill Handley
Adds Environment=NVCC_PREPEND_FLAGS=--compiler-bindir=/usr/bin/gcc-15 to sglang@.service so flashinfer / sgl_kernel JIT extensions stop failing to build under Arch's gcc 16 default (nvcc 13.x can't parse libstdc++ 16). Lives in the unit (not in backup= conf) so pacman freely updates without requiring a .pacnew merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 daysship deepseek-v4 chat template + bump pkgverWill Handley
DeepSeek-V4-Flash ships tokenizer_config.json without a chat_template field, and sglang has no built-in deepseek-v4 chat-template. Without one, /v1/chat/completions returns BadRequest. Install a chat-template at /etc/sglang/deepseek_v4.jinja (lifted from DeepSeek-V3.2-Exp, which uses the same special tokens) and wire it into the V4-Flash service conf via --chat-template /etc/sglang/deepseek_v4.jinja. pkgver bumped to r12032.a4f63b6ca to match the freshly built tree. Mirrors the same change in the stable sglang AUR package.
14 dayssglang-git: bump to r12014.86e3391fcWill Handley
Includes v3.6.0 triton_kernels migration, gate_scal expert-sorted reorder, empty-batch + padded-masking lifecycle guards, and re-added pow-2 padding for the topk_forward kernel constraint that v3.6.0 left in place.
14 dayssglang-git: protect per-model conf files from pacman -U overwriteWill Handley
Add etc/sglang/<model>.conf for every entry in _models to backup= so pacman preserves user edits at upgrade time and emits .pacnew for any package-side changes. Previously per-model confs were silently overwritten on every upgrade, which has bitten us repeatedly during V4-Flash sm_120 work — the conf carried smoke-test env-vars and non-default --moe-runner-backend, etc., and reset to the shipped defaults on every install. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 dayssglang-git: build from williamjameshandley/sglang V4-Flash forkWill Handley
Pull from the fork branch wjh/v4-flash-mxfp4-routed-experts which carries the V4-Flash-on-sm_120 patches (MXFP4 routed-expert plumbing, biasless triton-kernels MoE path, HashTopK+TopK pow-2 zero-gate padding, V4 KV pool SWA-mapping wiring, set_swa_loc, is_prefill kwarg upstream bug fix, mhc tilelang wg_wait removal). Branch tracks upstream sgl-project/sglang amd/deepseek_v4 + 8 wjh patches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28Sync confs from sglang: gemma_4_31b_fp8 + Qwen 3.6 lineup (BF16/FP8/INT4/NVFP4)Will Handley
sglang-git was missing nine confs that landed in the stable sglang AUR package across the day's work: - gemma_4_31b_fp8 (RedHatAI FP8-Dynamic, ~31 GB; 32 GB cards) - qwen3.6_27b (BF16), qwen3.6_27b_fp8 - qwen3.6_27b_awq_int4 (cyankiwi) — ~14 GB - qwen3.6_27b_gptq_int4 (groxaxo) — ~14 GB - qwen3.6_27b_autoround_int4 (Lorbus) — ~14 GB - qwen3.6_35b_a3b (BF16), qwen3.6_35b_a3b_fp8 - qwen3.6_35b_a3b_nvfp4 (RedHatAI) — ~18 GB, Blackwell-native All Qwen confs use --reasoning-parser qwen3 --tool-call-parser qwen3_coder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28Fix python-multipart dep — use python-python-multipartWill Handley
Arch ships two confusingly-named multipart packages: python-multipart -> defnull/multipart (wrong) python-python-multipart -> Kludex/python-multipart (correct, what FastAPI uses) Selecting the wrong one means FastAPI raises at endpoint registration: RuntimeError: Form data requires "python-multipart" to be installed. It seems you installed "multipart" instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28Promote runtime imports to dependsWill Handley
Move proven-required modules (fastapi, starlette, openai, huggingface-hub, pillow, packaging, psutil, scipy, sentencepiece, pyzmq, multipart, uvicorn, flashinfer) from optdepends to depends. They are imported unconditionally during sglang.launch_server bootstrap, so the package crash-loops at startup without them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27Ship deepseek_v4_flash.confWill Handley
2026-04-27Add python-tilelang optdep for DeepSeek V4Will Handley
2026-04-27Add python-soundfile, python-xgrammar to dependsWill Handley
2026-04-27Drop fork pin; track upstream sgl-project/sglang mainWill Handley
2026-04-14Bake --sleep-on-idle into service fileWill Handley
Prevents scheduler busy-wait burning a CPU core at idle. Also updates gemma configs to use gemma4 parser and bumps pkgver. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07Template service, per-model configs, sleep-on-idle defaultWill Handley
- Replace sglang.service with sglang@.service template unit - Add per-model config files for Gemma 4 and Qwen 3.5 variants - Default to --sleep-on-idle to reduce CPU usage when idle - Update sglang.conf as global config with SGLANG_OPTS/SGLANG_ARGS split - Point source to JustinTong0323/sglang new-model-gg branch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20Add dedicated sglang user, systemd hardening, and packaging fixesWill Handley
- Replace DynamicUser with static sglang user via sysusers.d/tmpfiles.d (persistent HF_HOME at /var/lib/sglang survives restarts) - Add sglang.env (mode 0600) for credentials, separate from sglang.conf - Harden systemd service: NoNewPrivileges, PrivateTmp, ProtectSystem/Home - Bind to 127.0.0.1:30000 instead of 0.0.0.0:8000 - Fix arch: any -> x86_64 (CUDA dependency) - Fix python-python-multipart -> python-multipart - Move config from /etc/sglang.conf to /etc/sglang/ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19Update sglang.conf with Qwen3.5 models and parser optionsWill Handley
Replace outdated Qwen2.5 model list with Qwen3.5 dense and MoE models, including approximate BF16 and GPTQ-Int4 VRAM estimates. Add reasoning and tool call parser documentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19Fix service: add CUDA_HOME and cache/home dirs for DynamicUserWill Handley
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19Add systemd service and configWill Handley
Add sglang.service and /etc/sglang.conf for systemd integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19Depend on python-sgl-kernel-git for matching versionsWill Handley
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19Fix version: filter for v* tags onlyWill Handley
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19Bump pkgrel to force AUR database updateWill Handley
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19Fix version: use upstream tag for PEP 440 complianceWill Handley
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19Initial sglang-git package tracking main branchWill Handley
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>