Exploring Quantization, Inference, and Training Breakthroughs in AI

Quantization Advances and Challenges

Quantization Aware Training (QAT) for models like Gemma 3 shows mixed results, with some users reporting better prompt following but others seeing regressions compared to post-training quantization like Q4_K_M. New QAT GGUF quants using the ik_llama.cpp fork claim perplexity improvements over official releases. Extreme quantization experiments push towards 1.58bit.

Links:

Inference Engine Performance and Hardware

SGLang is compared against vLLM for production throughput, leveraging FlashInfer and data parallelism. Initial RTX 5080 benchmarks disappointed (3090 level), but revised tests place the 5070 Ti near the 4090, with the 5080 slightly faster. NVMe setups show viability for running large MoE models like Llama 4 Maverick.

Links:

Llama 4: Context, Performance, and VRAM

Achieving Llama 4's advertised 10M context window requires substantial VRAM purely for the KV cache, estimated at ~240GB with iSWA or ~960GB (FP8) with standard methods, excluding model weights. Initial performance issues were linked to inference bugs, now reportedly fixed in implementations like Unsloth and recent llama.cpp/vLLM versions.

Links:

Hallucination Detection for RAG

LettuceDetect offers an open-source, encoder-based framework for token-level hallucination detection in RAG pipelines. Built on ModernBERT, it handles up to 4K context without requiring an LLM for verification, aiming for lower latency and cost compared to LLM-based detectors, achieving competitive F1 scores on RAGTruth.

Links:

Training and Fine-tuning Developments

Methods for uncensored model fine-tuning include data curation from specific RP datasets and ablation techniques like ABliterated. Efficient fine-tuning of smaller models (3B-8B) on large datasets benefits from services like Runpod/Kaggle or QLoRA on capable local GPUs. Custom MLLM development notes potential gradient flow issues with projection layers.

Links:

Quantization Advances and Challenges

Inference Engine Performance and Hardware

Llama 4: Context, Performance, and VRAM

Hallucination Detection for RAG

Training and Fine-tuning Developments

Read more