VibeThinker-3B: Small Model, Frontier Reasoning | Generated by AI

Home 2026.06

Question: What is WeiboAI/VibeThinker-3B?

Answer:

VibeThinker-3B is a small language model (SLM) from Weibo AI focused on pushing verifiable reasoning to the frontier at 3B parameters. Published ~5 days ago (paper: arXiv 2606.16140). Here’s the breakdown:

Base model: Qwen2.5-3B (and Qwen2.5-Coder-3B), fine-tuned with a multi-stage pipeline.

Core hypothesis — Parametric Compression-Coverage Hypothesis:

Verifiable reasoning (math, code, STEM) is a highly compressible, parameter-dense capability. If feedback signals are reliable and the task space is structured, small models can reach near-frontier performance. Open-domain knowledge, by contrast, needs scale to cover the long tail.

This is a meaningful distinction — they’re not claiming 3B replaces 671B for general knowledge, just for verifiable tasks.


Training pipeline (Spectrum-to-Signal Principle, SSP):

  1. Curriculum SFT (2 stages) — Stage 1: broad coverage (math, code, STEM, dialogue). Stage 2: harder, longer-horizon samples. Uses Diversity-Exploring Distillation to preserve multiple valid solution paths.
  2. Multi-domain RL — MaxEnt-Guided Policy Optimization (MGPO), applied sequentially to math → code → STEM. 64K context window to keep full reasoning trajectories intact.
  3. Offline Self-Distillation — filters high-quality RL trajectories, distills back into a unified student. Prioritizes traces that are correct but not yet well-modeled (learning-potential score).
  4. Instruct RL — final stage for user-facing controllability, using rule-based validators + rubric-based reward models.

Performance numbers:


What’s notable for you:


References:


Back Donate