GLM-4 32B Performance Overview
GLM-4 32B Q8 shows strong local performance, reportedly outperforming similar sized models and even 72B models, particularly in code generation without truncation. Runs well locally (~22t/s on 3x 3090s). Use requires a specific llama.cpp PR due to GGUF issues. Reasoning variant (GLM-Z1) also available.
Links:
Optimizing Local Inference for Large MoE Models (Llama 4)
Llama 4 MoE models (Maverick/Scout) can run locally (e.g., 16GB VRAM + >64GB RAM) using llama.cpp tensor overrides (-ot
) to offload conditional expert parameters to CPU. Performance (~10 t/s) possible with dynamic quants (IQ2/Q4_k_XL) and Mmap on Linux. Sensitive to samplers; avoid penalties, consider DRY/XTC.
Inference Engine Benchmark Reproducibility (SGLang vs vLLM)
Benchmarks comparing inference engines like SGLang and vLLM demonstrate fragility. Results can contradict based on subtle implementation differences or parameter changes (e.g., number of prompts). Reproducibility requires consistent infrastructure deployment (e.g., SkyPilot) and precise flag settings. Production use requires tailored benchmarks.
Links:
- https://github.com/Michaelvll/llm-ie-benchmarks
- https://github.com/skypilot-org/skypilot
- https://github.com/Michaelvll/llm-ie-benchmarks/pull/1
Addressing RAG Retrieval Latency with Scale
Large knowledge bases can slow RAG retrieval. Consider pre-filtering candidates with BM25 before vector reranking. Evaluate vector database performance under load; alternatives like Elasticsearch with HNSW or pgvector may offer better scalability than default ChromaDB setups. Optimize chunking strategies.
Links:
New Multimedia Generation Models: InstantCharacter and SkyReels-V2
InstantCharacter offers character-preserving image generation from a single image, compatible with Flux; VRAM needs ~20-30GB. SkyReels-V2 (1.3B, 14B) enables infinite-length video generation (T2V/I2V), benchmarked favorably against alternatives like HunyuanVideo and Wan2.1. Both provide code and papers.
Links: