Llama 4 Architecture and Inference
Llama 4 utilizes interleaved dense and MoE layers with chunked attention (sliding window variant). Inference faces stability issues across providers. Optimisation discussions include vertical splitting (shared layers on GPU, routed experts on CPU) for memory efficiency or leveraging dedicated backends like KTransformers for hybrid execution.
Links:
- https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF
- https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/llama4.md
- https://github.com/ggml-org/llama.cpp/pull/12791
Cogito Models and Iterated Distillation & Amplification (IDA)
DeepCogito released Cogito V1 Preview models (3B-70B) trained via Iterated Distillation and Amplification (IDA). IDA involves amplifying model intelligence via computation-heavy subroutines (e.g., CoT, verification) and distilling this capability back into model parameters iteratively. Models reported to outperform counterparts.
Links:
- https://www.deepcogito.com/research/cogito-v1-preview
- https://huggingface.co/collections/deepcogito/cogito-v1-preview-67eb105721081abe4ce2ee53
- https://ollama.com/library/cogito
Neural Graffiti: Transformer Neuroplasticity Layer
Neural Graffiti introduces a technique for adding neuroplasticity to transformers. A recurrent "Spray Layer" evolves a fused memory vector from prior prompts and injects it into the model's output logic pre-softmax, influencing token generation based on past interactions without retraining core weights.
Links:
Modern LLM-Based Text-to-Speech Architecture
Current TTS systems often use an LLM backbone (e.g., Transformer, SSM) to autoregressively predict discrete acoustic tokens from text input. These tokens, representing compressed audio frames, are then decoded into waveform audio by a separate neural audio codec model (e.g., EnCodec, Lyra, DAC).
Links:
Qwen 3 Preparations and Reasoning Benchmarks
Support for upcoming Qwen 3 dense and MoE models has been merged into vLLM and llama.cpp, indicating imminent release. MATH-Perturb benchmark introduced to test reasoning vs. memorization by applying minor vs. fundamental changes to MATH problems, showing performance drops indicating over-reliance on learned methods.
Links: