Llama-4-Maverick Performance Boost via llama.cpp Fix
Llama-4-Maverick benchmarks significantly improved after correcting a llama.cpp bug. The relevant fix involved adjusting the QK Norm epsilon value from 1e-6 to 1e-5. Updated GGUF quants incorporating this fix are available (e.g., from unsloth), reflecting the improved performance observed in community benchmarks.
Links:
Inference Beyond RAM Limits with mmap
llama.cpp enables running models exceeding physical RAM capacity using memory mapping (mmap
). Tested on Llama-4-Maverick IQ2_M (143GB) on 64GB RAM + 24GB VRAM, achieving 3.45 t/s generation after slow prompt processing (16 t/s pp512). Disable this behavior with the --no-mmap
flag if needed.
Links:
- https://huggingface.co/unsloth/Llama-4-Maverick-17B-128E-Instruct-GGUF/tree/main/UD-IQ2_M
- https://github.com/ggml-org/llama.cpp/discussions/1876
Skywork-OR1 Open Reasoning Model
Skywork-OR1 models (32B, 7B) were released with open weights, training code, and data. Fine-tuned from DeepSeek-R1-Distill-Qwen base models, they were trained for 32k context. The models claim strong performance on math (AIME) and coding (LiveCodeBench) benchmarks, reportedly rivalling larger models like Deepseek-R1.
Links:
- https://github.com/SkyworkAI/Skywork-OR1
- https://capricious-hydrogen-41c.notion.site/Skywork-Open-Reasoner-Series-1d0bc9ae823a80459b46c149e4f51680
- https://huggingface.co/collections/Skywork/skywork-or1-67fa1bcb41b436ef2def76b9
CPU-Only Inference Optimization with ktransformers
Intel Granite Rapids CPUs (e.g., 6944P) utilising AMX instructions show potential for high-throughput CPU inference via ktransformers. Benchmarks suggest competitive prompt processing (330t/s) and generation speeds (17t/s for DeepSeek R1), leveraging high DDR5 bandwidth (614.4GB/s). Overall system cost-effectiveness versus GPU solutions remains debated.
Links:
- https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepseekR1\_V3\_tutorial.md
- https://en.wikipedia.org/wiki/Granite\_Rapids
- https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
DeepCoder Training Enhancements
DeepCoder-14B implements GRPO+ training, incorporating DAPO insights: offline difficulty filtering, removal of entropy and KL loss terms for stability, and using DAPO's overlong filtering for context generalization (trained 32K, eval 64K). Iterative context lengthening (16K→32K) also improved performance.
Links: