Quantization Advances and Challenges
Quantization Aware Training (QAT) for models like Gemma 3 shows mixed results, with some users reporting better prompt following but others seeing regressions compared to post-training quantization like Q4_K_M. New QAT GGUF quants using the ik_llama.cpp fork claim perplexity improvements over official releases. Extreme quantization experiments push towards 1.58bit.
Links:
- https://www.reddit.com/r/LocalLLaMA/comments/1jygj7m/gemma312bqat_bad/
- https://www.reddit.com/r/LocalLLaMA/comments/1k52r4r/ubergarmgemma327bitqatgguf/
- https://huggingface.co/blog/1_58_llm_extreme_quantization
Inference Engine Performance and Hardware
SGLang is compared against vLLM for production throughput, leveraging FlashInfer and data parallelism. Initial RTX 5080 benchmarks disappointed (3090 level), but revised tests place the 5070 Ti near the 4090, with the 5080 slightly faster. NVMe setups show viability for running large MoE models like Llama 4 Maverick.
Links:
- https://www.reddit.com/r/LocalLLaMA/comments/1k4w86s/sglang_vs_vllm/
- https://www.reddit.com/r/LocalLLaMA/comments/1k0z43q/rtx_5080_is_about_a_3090_but_with_less_vram/
- https://www.reddit.com/r/LocalLLaMA/comments/1k28j02/llama_4_maverick_mlx_performance_on_m3_ultra/
Llama 4: Context, Performance, and VRAM
Achieving Llama 4's advertised 10M context window requires substantial VRAM purely for the KV cache, estimated at ~240GB with iSWA or ~960GB (FP8) with standard methods, excluding model weights. Initial performance issues were linked to inference bugs, now reportedly fixed in implementations like Unsloth and recent llama.cpp/vLLM versions.
Links:
- https://www.reddit.com/r/LocalLLaMA/comments/1k1xd4b/how_much_vram_for_10_millions_context_tokens_with/
- https://www.reddit.com/r/LocalLLaMA/comments/1jta5vj/vram_requirement_for_10m_context/
- https://www.reddit.com/r/LocalLLaMA/comments/1k5arbu/llama_4_after_inferencing_bug_fixes_aftermath/
Hallucination Detection for RAG
LettuceDetect offers an open-source, encoder-based framework for token-level hallucination detection in RAG pipelines. Built on ModernBERT, it handles up to 4K context without requiring an LLM for verification, aiming for lower latency and cost compared to LLM-based detectors, achieving competitive F1 scores on RAGTruth.
Links:
- https://github.com/KRLabsOrg/LettuceDetect
- https://huggingface.co/blog/adaamko/lettucedetect
- https://arxiv.org/abs/2502.17125
Training and Fine-tuning Developments
Methods for uncensored model fine-tuning include data curation from specific RP datasets and ablation techniques like ABliterated. Efficient fine-tuning of smaller models (3B-8B) on large datasets benefits from services like Runpod/Kaggle or QLoRA on capable local GPUs. Custom MLLM development notes potential gradient flow issues with projection layers.
Links: