Tech Innovations: Unleashing AI Potential with GPUs and CPUs

Llama 4 Maverick Performance

Achieved >45 tk/s locally on a single RTX 4090 using a Xeon QYFS CPU (56c/112t), 512GB RAM, and K-Transformers. High system RAM is critical; dual GPU setups may not be necessary with sufficient RAM for this specific configuration.

Links:

PyTorch 2.7 Release

Stable release adds support for Nvidia Blackwell (RTX 5090, B200). Includes Mega Cache for portable torch.compile artifacts via save/load_cache_artifacts(). Also brings improvements for Intel AMX and aarch64 support, potentially simplifying deployment on platforms like GH200.

Production Quantization on A100

Experiments suggest common INT8/INT4 quantization methods (w8a8, w4a16, HQQ) can yield lower throughput than bf16 for smaller models (~3B) on A100 GPUs. Effective production deployment may rely more heavily on methods like AWQ, batching, and potentially speculative decoding for larger models.

AMD ROCm Software Ecosystem

SemiAnalysis notes recent ROCm progress but highlights AMD's lower compensation hindering AI talent acquisition compared to Nvidia. AMD also lacks competitive Python Kernel DSLs, while Nvidia offers multiple options like Triton, CuTe Python, cuTile Python, Numba, and Warp.

Distilling Dense Models to Sparse MoEs

Explored techniques involve self-logit distillation to convert dense models (e.g., Deephermes 24B) into efficient, sparse Llama 4-style MoEs. A major bottleneck is extracting logits at scale, as inference engines like vLLM currently lack robust support for this feature.

Llama 4 Maverick Performance

PyTorch 2.7 Release

Production Quantization on A100

AMD ROCm Software Ecosystem

Distilling Dense Models to Sparse MoEs

Read more