SWE-Review: Closing the Loop on Issue Resolution with Agentic Code Review

▶ SWE-Review in action — resolving a real GitHub issue with the generate–review–revise loop in Claude Code.

AI coding agents can propose PRs, but often lack a reliable way to diagnose failures and guide revision. SWE-Review closes this loop — agentic code review turns one-shot generation into iterative improvement.

Figure 1: SWE-Review turns one-shot PR generation into closed-loop issue resolution. The generate–review–revise loop continuously raises resolve rate across three PR generators of varying capability.

Abstract

We introduce SWE-Review, a framework for closing the issue-resolution loop with agentic code review. A reviewer agent explores the repository, produces a structured diagnosis, and drives revision — turning one-shot PR generation into a generate–review–revise loop. Key results:

Closed-loop pipeline — The generate–review–revise loop continuously improves PRs, raising resolve rate by up to +29.4pp.
Test-time scaling — Review-guided iterative revision achieves 3.7× the TTS gain at 6.7× the efficiency of independent resampling.
Agentic > single-turn — Outperforms fixed-context review in both decision accuracy and revision usefulness, with the largest gains on harder tasks.
Training transfer — Mixed training improves direct resolve rate and enables a self-contained review-then-revise loop.

To support reproducible research, we release:

SWE-Review-Bench — 1,384 candidate PRs across three quality tiers with executable verification.
SWE-Review-Traj — 8,914 decision-correct agentic review trajectories for open reviewer training.

+29.4

pp Resolve Rate
closed-loop gain

3.7× / 6.7×

Gain / Efficiency
test-time scaling

1,384

Benchmark PRs
3 quality tiers

8,914

Trajectories
for open training

Review-Guided Test-Time Scaling for Issue Resolution

At test time, trained reviewers outperform dedicated verifiers in both effectiveness and efficiency. The approval decision gates sampling; the structured diagnosis is fed back for targeted revision. Structured diagnoses enable iterative refinement that scalar scorers cannot implement, achieving higher resolve rates at a fraction of the sample cost.

Figure 2(a): Resolve rate vs sample budget.

Figure 2(b): Actual samples spent.

Agentic Beats Single-Turn Review

Holding the reviewer model fixed (Claude Opus 4.6), agentic review consistently outperforms single-turn review (diff-only and diff+context) in both DA and RRR. The main advantage is not merely having more context, but being able to adaptively gather the correct evidence for the specific PR under review. Interactive evidence gathering is especially valuable when candidate PRs are lower quality and more likely to contain partial or non-local fixes.

Review Mode	GLM-5 (high)		Qwen3-Coder-30B-A3B -Instruct (medium)		Qwen3-30B-A3B -Instruct-2507 (low)
	DA ↑	RRR ↑	DA ↑	RRR ↑	DA ↑	RRR ↑
Single-turn (diff-only)	71.3	71.4	72.0	56.7	80.8	41.2
Single-turn (diff+context)	69.8	72.6	73.8	57.6	82.7	44.1
Agentic review	75.6	75.2	80.5	67.3	89.4	52.6

Table 1: All settings use Claude Opus 4.6 as reviewer. Gains are largest on the hardest split: +6.7pp DA, +8.5pp RRR over best single-turn. DA (Decision Accuracy ↑) = fraction of correct approve/reject decisions. RRR (Resolve Rate after Revision ↑) = final resolve rate after the review-then-revise loop.

Review Trajectories Generalize to Issue Resolution

Review trajectories can be mixed with issue-resolution data to produce a single model that is better at both roles. Review and resolution share a transferable repository-reasoning skill: mixed training simultaneously improves one-shot issue resolution and enables a self-contained generate–review–revise loop.

Training Data	RR ↑ one-shot resolve	CR ↑ review ability	DA ↑ review quality	RRR ↑ self-review + revise
Issue-resolution 1k	27.6	9.4	50.0	27.6
+ Review 1k	28.4 +0.8	67.6	67.4	34.6 +7.0
Issue-resolution 2k	31.2	13.4	51.7	31.2
+ Review 2k	36.8 +5.6	85.4	69.5	41.8 +10.6
Issue-resolution 3k	34.0	33.5	51.6	34.0
+ Review 3k	37.8 +3.8	87.4	72.3	41.2 +7.2

Table 2: Qwen3-8B fine-tuned with issue-resolution trajectories alone vs. paired with an equal volume of review trajectories. RR (Resolve Rate ↑) = one-shot resolve rate on SWE-bench Verified. CR (Completion Rate ↑) = fraction of instances with a parseable review.

SWE-Review-Bench

We construct SWE-Review-Bench from 500 SWE-bench Verified issues with executable test suites. Three models of varying capability generate 1,384 candidate PRs spanning high-, medium-, and low-quality distributions. We evaluate review with three complementary metrics: Completion Rate (CR) measures whether the reviewer produces a parseable final review; Decision Accuracy (DA) measures whether the approve/request-changes decision is correct; Resolve Rate after Revision (RRR) measures whether the review improves the final patch outcome.

PR Generator	Resolve Rate	# PRs
GLM-5 (high)	72.2%	500
Qwen3-Coder-30B-A3B -Instruct (medium)	50.9%	462
Qwen3-30B-A3B -Instruct-2507 (low)	27.5%	422

Reviewer Model	CR (%)	DA (%)	RRR (%)	Δ RR

Table 3: SWE-Review-Bench Leaderboard. SWE-Review-8B and SWE-Review-30B-A3B are Qwen3 models fine-tuned on SWE-Review-Traj. Click column headers to sort. Δ RR = improvement over no-review baseline.

SWE-Review-Traj

High-quality trajectories for agentic code review remain scarce in the open-source community. We construct SWE-Review-Traj, a dataset of 8,914 decision-correct agentic review trajectories to support training and evaluation of open reviewers:

1 Candidate PR Generation — Three models generate patches for ~6,000 SWE-rebench issues (excluding any repository in SWE-Review-Bench to prevent leakage), producing 14,156 candidate PRs.
2 Teacher Review — Open-weight GLM-5 with thinking enabled reviews each PR agentically via OpenHands-SDK. The prompt asks the teacher to first understand the issue and trace the root cause before inspecting the PR, avoiding confirmation bias.
3 Decision-Correct Filtering — We retain trajectories where the reviewer correctly approves a resolving patch or correctly requests changes on a non-resolving one, yielding 8,914 trajectories.
4 Quality Validation — Semantic: Two judges (Claude Opus 4.6, GPT-5.4) rate diagnosis quality (mean >3.0/5, Cohen's κ = 0.72). Functional: Diagnoses carry actionable information beyond the binary decision — revision RRR rises from 3% (no review) → 8% (decision only) → 21% (teacher review) → 32% (oracle).

Key Takeaways

Closing the Loop Works — The generate–review–revise loop continuously improves PRs, raising resolve rate by up to +29.4 percentage points on SWE-bench Verified.

Structured Diagnoses Enable Efficient Test-Time Scaling — Review-guided iterative revision achieves 3.7× the test-time scaling gain at 6.7× the efficiency of independent resampling.

Review Trajectories Transfer to Issue Resolution — Mixed training with review data improves one-shot resolve rate (+5.6pp) and enables self-contained review-revise loops within a single model (+10.6pp).

Citation

@misc{wang2026swereview,
      title={SWE-Review: Closing the Loop on Issue Resolution with Agentic Code Review},
      author={Ruoyu Wang and Jierun Chen and Shaowei Wang and Chaofan Tao and Sidi Yang and Yuxin Jiang and Kim-Hui Yap and Lifeng Shang and Xiaohui Li and Haoli Bai},
      year={2026},
      eprint={2607.06065},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2607.06065},
}