Optimized MoE Inference for Large Models
Run large MoE models like Llama 4 (400B+) efficiently on limited hardware. Use llama.cpp flags (-ngl
, --override-tensor
) to offload non-MoE layers to GPU and expert weights to CPU/NVMe. Achieves high throughput even when model size exceeds total RAM+VRAM.
Gemma 3 QAT Release & Low-Bit Performance
Google released Quantization-Aware Training (QAT) checkpoints for Gemma 3, minimizing quality loss at low bitrates. Community tests show the 27B QAT model performs surprisingly well even at Q2_K (~10.5GB), offering potential for running larger models on consumer GPUs. Supported in major frameworks.
Links:
- https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/
- https://huggingface.co/collections/google/gemma-3-qat-67ee61ccacbf2be4195c265b
- https://huggingface.co/bartowski/google_gemma-3-27b-it-qat-GGUF
vLLM Integrates Transformers Backend
vLLM now supports running potentially any Hugging Face Transformers model via its new backend integration. This significantly broadens model compatibility beyond natively optimized architectures and facilitates easier community contributions for vLLM support. Vision-Language Model support is anticipated.
Links:
Real-time Streaming & Fine-tuning for CSM TTS
The open-source CSM 1B text-to-speech model now features real-time streaming inference capabilities. Fine-tuning support (LoRA and full) has also been added, allowing for customization. Provides a potent local option for low-latency, tailored speech synthesis applications.
Links:
VideoGameBench Challenges VLMs in Real-Time Gaming
A new benchmark, VideoGameBench, tests VLM capabilities by having them play classic games like DOOM II in real-time. Initial results indicate current SOTA models struggle significantly, revealing limitations in grounded reasoning, planning, and interaction under real-time constraints.
Links: