Llama 4 Inference Optimization
Run large Llama 4 MoE models efficiently, even exceeding RAM+VRAM, by offloading only MoE expert layers (ffn_.*_exps.
) to CPU/NVMe using --override-tensor
in llama.cpp. Keep base layers on GPU (-ngl
). Tune --ubatch-size
(try removing or increasing it) for optimal prompt processing speed.
Links:
BitNet b1.58 2B4T Release
Microsoft released BitNet b1.58 2B4T, an open-source native 1-bit LLM (2B params, 4T tokens). Achieving efficiency benefits reportedly requires the dedicated bitnet.cpp
implementation. Model features include squared ReLU activation and a 4k context window. Performance is comparable to similar-sized full-precision models.
Links:
- https://arxiv.org/abs/2504.12285
- https://huggingface.co/microsoft/BitNet-b1.58-2B-4T
- https://github.com/microsoft/BitNet
Serving Frameworks: vLLM vs TensorRT-LLM
TensorRT-LLM can offer significantly higher throughput (20-100%) than vLLM, particularly with FP8 precision and under high load conditions. Deployment is simplified using trtllm-serve
over traditional Triton setups. Some users report suboptimal performance with GPTQ/AWQ quantizations under load in both frameworks.
Knowledge Graphs for Enhanced Code Agents
Explore using dynamic, project-specific Knowledge Graphs (KGs) representing code structure (functions, classes, dependencies) for local coding agents. Queried alongside vector search, this structured context aims to provide deeper understanding for complex cross-file generation and refactoring, potentially exceeding standard RAG capabilities.
Byte-Latent Transformer (BLT) Weights
Meta Research has released the weights for their 1B and 7B parameter Byte-Latent Transformer (BLT) models. This architecture operates directly on byte sequences, differing from typical subword tokenization methods. The associated paper details the model architecture and training methodology.
Links: