Portable Text-Generation-WebUI Builds
Fully self-contained Text-Generation-WebUI builds using llama.cpp backend are now available. No installation needed; unzip and run. Supports CUDA, CPU on Windows/Linux, and ARM64/x86_64 on macOS. Uses portable Python builds and communicates via llama-server. Vulkan backend requires manual replacement of the llama-server executable.
Links:
- https://github.com/oobabooga/text-generation-webui/releases/
- https://github.com/astral-sh/python-build-standalone
- https://github.com/ggml-org/llama.cpp/releases
Accelerated CPU Prompt Processing for MoE Models
The ik_llama.cpp
fork demonstrates significantly faster prompt processing on CPU compared to mainline llama.cpp, reaching ~44 t/s for Llama-4-Scout Q5_K_M prompt evaluation versus ~21 t/s. Generation speed remains similar. This offers substantial speedups for large context processing on CPU-bound setups.
Links:
- https://github.com/ikawrakow/ik_llama.cpp
- https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/tree/main/Q5_K_M
Magi-1 Autoregressive Video Generation
Sand-AI released Magi-1, an open-source autoregressive diffusion video model. It enables infinite extension for temporal continuity and claims precise control over timing and motion dynamics. Supports T2V, I2V, V2V modes. High VRAM requirements reported (e.g., 640GB mentioned for full operation).
Links:
Llama 4 MoE Models on Consumer Vulkan GPUs
Running Llama 4 Maverick on consumer GPUs with limited VRAM via llama.cpp Vulkan is possible by skipping warmup (--no-warmup
) and offloading expert weights (.ffn_.*_exps.=CPU
) using the -ot
parameter. This avoids VRAM allocation errors during initialization, allowing shared weights and selected MoE layers on GPU.
Links:
Distillkit for Model Distillation
Distillkit offers logit-based (KL divergence on teacher outputs) and hidden states-based knowledge distillation methods. Hidden state distillation allows for cross-architecture transfer by aligning intermediate layer representations. Memory requirements are higher than standard SFT. Scaling support for >70B models is under development.
Links: