Multi-GPU Configuration Insights
For multi-GPU fine-tuning without NVLink, high PCIe bandwidth (e.g., Gen4 x16) is critical to mitigate inter-GPU communication latency via the CPU/chipset. For inference, adding a lower-spec GPU can significantly boost performance if VRAM is the bottleneck, enabling full model offload, but hinders performance if the model fits entirely onto the faster GPU.
Links:
- https://www.reddit.com/r/LocalLLaMA/comments/1jxbj7a/comment/mmp50q4/
- https://www.reddit.com/r/LocalLLaMA/comments/1jykvwz/3090_2070_experiments/
Llama.cpp Updates: Copilot Integration and Llama 4 Fixes
VSCode's GitHub Copilot can now utilize local models via llama.cpp, configured through the existing Ollama setting. Separately, llama.cpp received fixes for Llama 4's RoPE implementation and normalization issues, improving output quality. Re-quantized GGUF models converted post-fix are required.
Links:
- https://github.com/ggml-org/llama.cpp/pull/12896
- https://github.com/ggml-org/llama.cpp/pull/12889
- https://github.com/ggml-org/llama.cpp/pull/12882
HIGGS: Advanced LLM Compression Method
Researchers introduced HIGGS, a novel quantization technique achieving high performance with minimal quality loss, particularly effective at 4, 3, and even 2 bits per weight. Notably, it compressed the 671B DeepSeek R1 model effectively. Further comparisons against existing llama.cpp quantizations are warranted.
Links:
- https://arxiv.org/pdf/2411.17525
- https://github.com/HanGuo97/flute
- https://huggingface.co/docs/transformers/main/en/quantization/higgs
Early AMD Radeon RX 9070 XT Benchmarks
Initial llama.cpp benchmarks for the AMD Radeon RX 9070 XT on Windows (via HIP SDK 6.2, targeting gfx1201) show promising results. For gemma-2-9b-it-Q6_K_L
, generation speed reached ~55 t/s. Note that Flash Attention optimizations are currently non-functional under Windows/HIP for this GPU.
Links:
New Tools: Docker Local LLMs & Android Agent Control
Docker introduced experimental support for running local LLMs, potentially wrapping llama.cpp, including Apple Silicon compatibility. Additionally, the DroidRun project enables AI agents, powered by vision models like Gemini Flash, to interact with and control Android devices programmatically.
Links: