Exploring Innovations in AI Model Architecture and Intelligence

Llama 4 Architecture and Inference

Llama 4 utilizes interleaved dense and MoE layers with chunked attention (sliding window variant). Inference faces stability issues across providers. Optimisation discussions include vertical splitting (shared layers on GPU, routed experts on CPU) for memory efficiency or leveraging dedicated backends like KTransformers for hybrid execution.

Links:

Cogito Models and Iterated Distillation & Amplification (IDA)

DeepCogito released Cogito V1 Preview models (3B-70B) trained via Iterated Distillation and Amplification (IDA). IDA involves amplifying model intelligence via computation-heavy subroutines (e.g., CoT, verification) and distilling this capability back into model parameters iteratively. Models reported to outperform counterparts.

Links:

Neural Graffiti: Transformer Neuroplasticity Layer

Neural Graffiti introduces a technique for adding neuroplasticity to transformers. A recurrent "Spray Layer" evolves a fused memory vector from prior prompts and injects it into the model's output logic pre-softmax, influencing token generation based on past interactions without retraining core weights.

Links:

https://github.com/babycommando/neuralgraffiti

Modern LLM-Based Text-to-Speech Architecture

Current TTS systems often use an LLM backbone (e.g., Transformer, SSM) to autoregressively predict discrete acoustic tokens from text input. These tokens, representing compressed audio frames, are then decoded into waveform audio by a separate neural audio codec model (e.g., EnCodec, Lyra, DAC).

Links:

https://nonint.com/2022/04/25/tortoise-architectural-design-doc/

Qwen 3 Preparations and Reasoning Benchmarks

Support for upcoming Qwen 3 dense and MoE models has been merged into vLLM and llama.cpp, indicating imminent release. MATH-Perturb benchmark introduced to test reasoning vs. memorization by applying minor vs. fundamental changes to MATH problems, showing performance drops indicating over-reliance on learned methods.

Links:

Llama 4 Architecture and Inference

Cogito Models and Iterated Distillation & Amplification (IDA)

Neural Graffiti: Transformer Neuroplasticity Layer

Modern LLM-Based Text-to-Speech Architecture

Qwen 3 Preparations and Reasoning Benchmarks

Read more