Optimizing AI Workflows: High-Throughput and Cost-Effective Solutions

High-Throughput Inference with Qwen2.5-7B on 2x3090

Achieve ~4500 prompt tokens/s and ~825 generation tokens/s using Qwen2.5-7B W8A8 quantization on 2x 3090 GPUs. Utilizes Aphrodite engine with tensor parallelism (-tp 2) and optimized max_num_seqs=32. W8A8 quantization provided significant speedup over unquantized, AWQ, or GPTQ.

Novel Iterative Search Model: ReZero

ReZero model utilizes GRPO and tool-calling, rewarding retries to improve search result relevance. Achieved 46% performance vs 20% baseline, suggesting repetition isn't necessarily hallucination. Potential use as query generator or abstraction layer. Paper, model, and code available.

Links:

SFT Can Hinder RL Reasoning in LVLMs

Early SFT can induce "pseudo reasoning paths" imitated from expert models, hindering genuine RL progress in Large Vision-Language Models. While SFT teaches format, GRPO-based RL with mixed rewards fosters more adaptive reasoning, per recent research.

Links:

https://ucsc-vlaa.github.io/VLAA-Thinking/

Building High VRAM Rigs with MI50 GPUs

Achieve 160GB VRAM under $1200 using AMD MI50 GPUs (~$90 each) in Octominer cases. Requires ROCm 6.3.0 for MI50 support. Llama.cpp benchmarks show ~5 t/s on Llama3.3 70B Q8 across 5x MI50 (RPC), limited by prompt processing and RPC overhead.

Links:

https://www.reddit.com/r/LocalLLaMA/comments/1jy5p12/another_budget_build_160gb_of_vram_for_1000_maybe/

Local Agent Orchestration with LocalAGI

LocalAGI enables building and running AI agent workflows locally, using any OpenAI-compatible API (like LocalAI) for LLM access. Features WebUI, tool use, connectors (Telegram, Discord), persistent memory via LocalRecall, and Go backend. MIT licensed.

Links:

High-Throughput Inference with Qwen2.5-7B on 2x3090

Novel Iterative Search Model: ReZero

SFT Can Hinder RL Reasoning in LVLMs

Building High VRAM Rigs with MI50 GPUs

Local Agent Orchestration with LocalAGI

Read more