DeepSeek AI Unveils DeepSeekMath-V2: Open-Weight Self-Verifying Math Model Scoring 118/120 on Putnam 2024

How can an AI system prove complex olympiad-level math problems in clear natural language while verifying its own reasoning? DeepSeek AI has launched DeepSeekMath-V2, an open weights large language model optimized for self-verified natural language theorem proving. Built on DeepSeek-V3.2-Exp-Base, it runs as a 685B parameter mixture of experts (MoE) and is available on Hugging Face under the Apache 2.0 license.
In evaluations, DeepSeekMath-V2 achieves gold-level scores on IMO 2025 and CMO 2024, and scores 118/120 on Putnam 2024 with scaled test-time compute.

Why Final Answer Rewards Are Not Enough?

Most recent math models use reinforcement learning (RL) that rewards only final answers on benchmarks like AIME and HMMT—pushing them to near saturation on short-answer contests in a year (Hugging Face). However, the DeepSeek team identifies two critical flaws:
  • A correct numeric answer doesn’t guarantee sound reasoning (models may use canceling algebraic mistakes to reach the right result).
  • Proof-based tasks (olympiad theorems) require full natural language arguments—no single numeric answer exists, so standard rewards don’t apply.
DeepSeekMath-V2 prioritizes proof quality over answer accuracy: it evaluates whether a proof is complete and logically sound as its core learning signal.

Training a Verifier Before the Generator

The model’s core design is verifier-first:
  1. Data Collection: 17,503 proof-style problems from Art of Problem Solving (olympiads, team tests, post-2010 proof-required tasks).
  2. Candidate Proofs: Generated by a DeepSeek-V3.2 model (prompted for iterative refinement) and human-labeled (0/0.5/1 rubric: rigor/completeness).
  3. Verifier Training: Using Group Relative Policy Optimization (GRPO) with two rewards:
    1. Format Reward: Ensures output follows a fixed template (analysis + boxed score).
    2. Score Reward: Penalizes differences between predicted and expert scores.
This yields a verifier that consistently grades olympiad-style proofs.

Meta Verification to Control Hallucinated Critiques

Base verifiers can game rewards: they may output correct scores but invent fake issues in analysis. To fix this, the team adds a meta verifier:
  • Reads the problem, proof, and verifier analysis to evaluate if the analysis is faithful (restates steps, identifies real defects, aligns narrative with score).
  • Trained with GRPO (format + score rewards) and used as an extra reward term for the base verifier.
Experiments show this raises meta-evaluated analysis quality from ~0.85 to 0.96 (validation split) while keeping proof score accuracy stable.

Self-Verifying Proof Generator & Sequential Refinement

Once the verifier is robust, the team trains a proof generator that outputs:
  • A solution to the problem.
  • A self-analysis (following the verifier’s rubric).
The generator’s reward combines three signals (weights: α=0.76, β=0.24):
  1. Proof Score: From the base verifier (α=0.76).
  2. Self-Analysis Component: Agreement between self-reported and verifier scores + meta verification score of the analysis (β=0.24).
For hard problems, the generator uses sequential refinement (leveraging the base model’s 128K token context):
  • Generates a proof + self-analysis.
  • Feeds the output back as context.
  • Asks to fix detected issues (repeats until context budget is exhausted).

Scaling Verification & Auto-Labeling

As the generator improves, manual labeling of its harder proofs becomes costly. The team uses an auto-labeling pipeline:
  • For each candidate proof, sample multiple independent verifier analyses.
  • Evaluate each with the meta verifier.
  • Label as incorrect if high-quality analyses converge on issues; label as correct if no valid issues are found.
Spot checks confirm this pipeline aligns with expert labels.

Competition & Benchmark Results

DeepSeekMath-V2 delivers state-of-the-art results:
  • Internal CNML Set: Top mean proof score across all categories (algebra, geometry, number theory, combinatorics, inequalities) vs Gemini 2.5 Pro/GPT-5 Thinking High.
  • IMO Shortlist 2024: Sequential refinement improves “pass at 1” and “best of 32” metrics.
  • IMO 2025: Gold level (solves 5/6 problems).
  • CMO 2024: Gold level (4 full solutions + partial credit on 1).
  • Putnam 2024: 118/120 (surpasses the best human score of 90).
Access the model: Paper PDF | GitHub Repo

Key Takeaways

  • Model Specs: 685B MoE, built on DeepSeek-V3.2-Exp-Base, open weights (Apache 2.0).
  • Core Innovation: Verifier-first pipeline (base + meta verifier) that prioritizes proof rigor over final answers.
  • Generator Training: Rewards for proof quality, self-analysis honesty, and meta-verified faithfulness.
  • Performance: Gold-level results on IMO/CMO 2025/2024 and a near-perfect Putnam 2024 score.