DeepSeek V3 (0324) Update Analysis
The updated Deepseek V3 (0324) features a significant reasoning boost, with base model responses resembling Chain-of-Thought outputs, potentially due to RL with GRPO. It operates under an MIT license. Benchmarks on M3 Ultra (512GB, q4_K_M GGUF) show ~6 T/s generation and ~9 T/s processing (KoboldCpp), though MLX reports faster PP.
Links:
- https://composio.dev/blog/deepseek-v3-0324-the-sonnet-3-5-at-home/
- https://www.reddit.com/r/LocalLLaMA/comments/1j9vjf1/deepseek_r1_671b_q4_m3_ultra_512gb_with_mlx/
- https://venturebeat.com/ai/deepseek-v3-now-runs-at-20-tokens-per-second-on-mac-studio-and-thats-a-nightmare-for-openai
Qwen Model Developments
Qwen 2.5 Omni 7B, a new multimodal model, has been released, supporting audio I/O but showing benchmark regressions compared to the base text model. GGUF support is awaited. Qwen 2.5 VL models (72B/32B) now top OCR benchmarks (~75% accuracy), outperforming specialized OCR models like mistral-ocr.
Links:
- https://huggingface.co/Qwen/Qwen2.5-Omni-7B
- https://github.com/getomni-ai/benchmark
- https://huggingface.co/datasets/getomni-ai/ocr-benchmark
Autoregressive Image Generation Techniques
Recent improvements in image generation (GPT-4o, Gemini) utilize autoregressive models instead of diffusion. The LLM itself generates images token-by-token, enabling finer control over details and text rendering. Implementations might involve techniques like DeepSeek Janus (transformer + rectified flow) or OmniGen (LLM connected to VAE).
Links:
Embedding Layer Parameter Efficiency
Discussion arises on the parameter inefficiency of large embedding matrices (e.g., 25% of Llama3-1B). Potential optimizations include learned low-rank projections (MLP-based), smaller tokenizers, hashing methods (Bloom embeddings/murmurhash, Faiss for candidate selection), byte/character-level models (Canine, Charformer), or dynamic hashing.
Links:
- https://arxiv.org/abs/2501.16975
- https://explosion.ai/blog/bloom-embeddings
- https://github.com/facebookresearch/faiss
Model Performance and Benchmarking Notes
Nemotron-Super-49B requires significantly less KV cache (~70% less) than its Llama 70B base due to non-self-attention layers, benefiting long context scenarios. Gemini 2.5 Pro tops LiveBench but shows weakness on ARC AGI reasoning tests. Mismatches between official LiveBench scores and local runs (e.g., DeepSeek-V3.1) are noted, potentially due to private test sets.
Links: