Dual-GPU Inference Performance with vLLM
Benchmarks show dual consumer GPUs (e.g., 2x4070TiS) can outperform single higher-end cards (4090) for QwQ-32B-AWQ inference using vLLM/sglang, challenging assumptions about sequential bottlenecks. Even 2x5090s surpass H100 performance. Efficient AWQ quantization is key.
Links:
DeepSeek MLA KV Cache Optimization
DeepSeek's Multi-Head Latent Attention (MLA) drastically reduces KV cache size compared to MHA. Inference frameworks like vLLM, SGLang, and ktransformers support MLA, optimizing memory usage. Llama.cpp support is pending; verify framework compatibility for efficient inference.
Links:
- https://github.com/ggml-org/llama.cpp/pull/11446
- https://github.com/vllm-project/vllm/releases/tag/v0.7.1
- https://github.com/sgl-project/sglang/issues/2591
Local LLMs via Peer-to-Peer Distribution
An experiment using torrents to distribute Qwen2.5-VL-3B-Instruct explores decentralized model sharing. This method requires redistribution-friendly licenses (e.g., Apache-2.0). IPFS is suggested as an alternative; canonical hashes are crucial for verifying model integrity.
Links:
Prompting for Reasoning in Non-Reasoning Models
Employing techniques like "think step-by-step" or explicit <think>
tags can enhance reasoning performance even in models not specifically trained for it. This improvement is likely due to context priming or enabling additional computational steps during generation.
Links:
Llama.cpp Cross-Generation GPU Support
Llama.cpp continues robust development, recently passing 5000 commits. A key strength remains its support for heterogeneous GPU setups, including older architectures like Pascal and Volta alongside newer ones (Ampere/Ada), a feature often lacking in frameworks focused only on recent hardware.