AI Model Innovations: Performance, Efficiency, and Open Releases

Llama-4-Maverick Performance Boost via llama.cpp Fix

Llama-4-Maverick benchmarks significantly improved after correcting a llama.cpp bug. The relevant fix involved adjusting the QK Norm epsilon value from 1e-6 to 1e-5. Updated GGUF quants incorporating this fix are available (e.g., from unsloth), reflecting the improved performance observed in community benchmarks.

Links:

https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF

Inference Beyond RAM Limits with `mmap`

llama.cpp enables running models exceeding physical RAM capacity using memory mapping (mmap). Tested on Llama-4-Maverick IQ2_M (143GB) on 64GB RAM + 24GB VRAM, achieving ~~3.45 t/s generation after slow prompt processing (~~16 t/s pp512). Disable this behavior with the --no-mmap flag if needed.

Links:

Skywork-OR1 Open Reasoning Model

Skywork-OR1 models (32B, 7B) were released with open weights, training code, and data. Fine-tuned from DeepSeek-R1-Distill-Qwen base models, they were trained for 32k context. The models claim strong performance on math (AIME) and coding (LiveCodeBench) benchmarks, reportedly rivalling larger models like Deepseek-R1.

Links:

CPU-Only Inference Optimization with ktransformers

Intel Granite Rapids CPUs (e.g., 6944P) utilising AMX instructions show potential for high-throughput CPU inference via ktransformers. Benchmarks suggest competitive prompt processing (~~330t/s) and generation speeds (~~17t/s for DeepSeek R1), leveraging high DDR5 bandwidth (614.4GB/s). Overall system cost-effectiveness versus GPU solutions remains debated.

Links:

DeepCoder Training Enhancements

DeepCoder-14B implements GRPO+ training, incorporating DAPO insights: offline difficulty filtering, removal of entropy and KL loss terms for stability, and using DAPO's overlong filtering for context generalization (trained 32K, eval 64K). Iterative context lengthening (16K→32K) also improved performance.

Links:

https://huggingface.co/agentica-org/DeepCoder-14B-Preview#grpo

Llama-4-Maverick Performance Boost via llama.cpp Fix

Inference Beyond RAM Limits with mmap

Skywork-OR1 Open Reasoning Model

CPU-Only Inference Optimization with ktransformers

DeepCoder Training Enhancements

Read more

Inference Beyond RAM Limits with `mmap`