Llama 4 Maverick Performance
Achieved >45 tk/s locally on a single RTX 4090 using a Xeon QYFS CPU (56c/112t), 512GB RAM, and K-Transformers. High system RAM is critical; dual GPU setups may not be necessary with sufficient RAM for this specific configuration.
Links:
PyTorch 2.7 Release
Stable release adds support for Nvidia Blackwell (RTX 5090, B200). Includes Mega Cache for portable torch.compile
artifacts via save/load_cache_artifacts()
. Also brings improvements for Intel AMX and aarch64 support, potentially simplifying deployment on platforms like GH200.
Production Quantization on A100
Experiments suggest common INT8/INT4 quantization methods (w8a8, w4a16, HQQ) can yield lower throughput than bf16 for smaller models (~3B) on A100 GPUs. Effective production deployment may rely more heavily on methods like AWQ, batching, and potentially speculative decoding for larger models.
AMD ROCm Software Ecosystem
SemiAnalysis notes recent ROCm progress but highlights AMD's lower compensation hindering AI talent acquisition compared to Nvidia. AMD also lacks competitive Python Kernel DSLs, while Nvidia offers multiple options like Triton, CuTe Python, cuTile Python, Numba, and Warp.
Distilling Dense Models to Sparse MoEs
Explored techniques involve self-logit distillation to convert dense models (e.g., Deephermes 24B) into efficient, sparse Llama 4-style MoEs. A major bottleneck is extracting logits at scale, as inference engines like vLLM currently lack robust support for this feature.