Gemma 3 QAT Performance & Stability
Recent benchmarks suggest Google's QAT Gemma 3 27B Q4 outperforms standard Q4 quants on GPQA diamond using less VRAM (16.4GB vs 17.4GB+). However, statistical significance is questioned, and some LMStudio users report instability like output loops and nonsensical words with QAT models.
Links:
- https://huggingface.co/bartowski/google_gemma-3-27b-it-qat-GGUF
- https://huggingface.co/stduhpf/google-gemma-3-27b-it-qat-q4_0-gguf-small?
Windsurf System Prompt & Function Leaks
Leaked Windsurf system prompts (dated 2025-04-20) expose internal tool structures (JSON/XML functions) and parameters like the "Yap score" modulating verbosity (up to 8192 words). These details offer insights for reverse engineering or building custom agents leveraging similar patterns.
Links:
- https://github.com/dontriskit/awesome-ai-system-prompts/blob/main/windsurf/system-2025-04-20.md
- https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools
Llama 4 Slow Prefill with Partial Offload
Users report slow prompt processing (prefill) speed compared to generation speed when running Llama 4 Maverick on llama.cpp with MoE experts offloaded to CPU (-ot ".\*ffn\_.\*\_exps.\*=CPU"
). PCIe bandwidth limitations (Gen3 vs Gen4) and layer selection (*attn.*=GPU
) are being investigated. Ktransformers fork mentioned as faster.
Links:
Hardware Choices for Local LLMs
Discussions focus on VRAM vs. compute. Used RTX 3090 (24GB) often preferred over new AMD RX 7900 XTX (24GB) or lower VRAM current-gen cards due to CUDA ecosystem maturity. Multi-GPU setups (2x/3x 3060 12GB) are viable budget options for >30B models. M1 Max (64GB) favored over M4 (32GB).
LlamaIndex Docstore & Tool Calling
Experienced users question the role and necessity of the docstore
component in LlamaIndex when nodes are already in the vector store. Additionally, techniques are sought to enable tool calling for LLMs lacking native support (e.g., Snowflake Cortex) potentially via structured output parsing within LlamaIndex.
Links: