Llama 4 Release Overview
Meta released Llama 4 Scout (109B, 16 experts) and Maverick (400B, 128 experts), both MoE with 17B active parameters. Initial community reception indicates disappointment regarding performance, particularly in coding, when compared to existing models like Qwen and DeepSeek, despite the claimed 10M context window.
Links:
- https://ai.meta.com/blog/llama-4-multimodal-intelligence/
- https://huggingface.co/collections/meta-llama/llama-4-67f0c30d9fe03840bc9d0164
- https://www.llama.com/llama4/
iRoPE Architecture for Long Context
Llama 4 utilizes the iRoPE architecture for its long context capabilities. It employs local chunked attention with RoPE up to 8K, while global attention layers handle longer contexts (>8K) without position embeddings. Inference-time temperature scaling is applied at global layers to enhance long-range reasoning.
Links:
Inference-Time Scaling for Reward Modeling
DeepSeek's GRM paper explores inference-time scalability for generalist reward models (GRMs). Their Self-Principled Critique Tuning (SPCT) method trains GRMs to generate principles and critiques, allowing smaller models (e.g., 27B) with parallel sampling to potentially match or exceed the performance of much larger RMs (e.g., 671B).
Links:
Neural Graffiti Technique
A novel technique termed "Neural Graffiti" proposes splicing a new neuron layer into pre-trained LLMs to reshape token prediction based on memory recall. This aims to add capabilities like neuroplasticity or context modification during generation without requiring full finetuning.
Links:
o200k_base Tokenizer Bug
The Quasar-Alpha model exhibits identical incorrect responses as early GPT-4o versions for specific Chinese prompts (e.g., "给主人留下些什么吧"). This is linked to token ID 177431 in the shared o200k_base tokenizer, suggesting Quasar-Alpha potentially originates from OpenAI or utilizes the same flawed tokenizer.
Links: