Benchmarks

EvoCode-Bench

Claude-Opus-4.7: 54.0 MT@4 · Coding

Iteration · Multi-turn · Harbor + Terminus-2

EvoCode-Bench tests whether coding agents can keep a project working as user requests change. It evaluates 13 agents on 26 stateful coding tasks and 227 rounds, with the same workspace and agent session preserved for 5-15 rounds.

26
Tasks
227
Rounds
5-15
Rounds / Task
13
Agents
MT@4
Main Metric

Leaderboard

The main score is MT@4: a best-of-four fail-stop multi-round score. A round receives credit only if at least one attempt reaches that round with a workspace that still satisfies all active cumulative tests.

54.0
Opus 4.7High reasoning
MT@4: 54.0
SR: 76.7
Comp: 42.3%
Setting: high reasoning effort
52.4
GPT 5.5High reasoning
MT@4: 52.4
SR: 74.4
Comp: 38.5%
Setting: high reasoning effort
44.0
Opus 4.6Default
MT@4: 44.0
SR: 78.9
Comp: 34.6%
Setting: default configured reasoning
36.2
GLM 5.1Thinking
MT@4: 36.2
SR: 63.9
Comp: 15.4%
Setting: thinking enabled
31.9
Kimi K2.6Thinking
MT@4: 31.9
SR: 59.0
Comp: 23.1%
Setting: thinking enabled
30.6
DS V4 ProHigh reasoning
MT@4: 30.6
SR: 56.4
Comp: 19.2%
Setting: high reasoning effort
29.4
Qwen 3.6Thinking
MT@4: 29.4
SR: 57.3
Comp: 15.4%
Setting: thinking enabled
17.3
MiMo V2.5High reasoning
MT@4: 17.3
SR: 7.9
Comp: 11.5%
Setting: high reasoning effort
13.7
Gemini 3.1High reasoning
MT@4: 13.7
SR: 46.7
Comp: 11.5%
Setting: high reasoning effort
9.4
DS V4 FlashHigh reasoning
MT@4: 9.4
SR: 46.3
Comp: 0.0%
Setting: high reasoning effort
4.6
Qwen 3.5Thinking
MT@4: 4.6
SR: 44.1
Comp: 0.0%
Setting: thinking enabled
3.7
MiniMax M2.7Reasoning split
MT@4: 3.7
SR: 30.0
Comp: 0.0%
Setting: reasoning split enabled
1.9
Doubao 2.0High reasoning
MT@4: 1.9
SR: 23.8
Comp: 0.0%
Setting: high reasoning effort
MT@4 score — Harbor + Terminus-2 scaffold — four attempts per multi-round task

Full Results

Metrics. MT@4 is the four-attempt fail-stop multi-round score. SR is single-round pass rate after earlier rounds have been completed by the reference solution. Gap is SR minus MT@4, showing how much isolated round-solving overstates persistent execution. Comp is full-task completion through the final round in at least one attempt. Model sublabels show the evaluation setting from the paper appendix.
#AgentMT@4SRGapCompAvg TurnsOutput Tok.
1
Claude-Opus-4.7-High
Anthropic
high reasoning effort
54.076.7+22.742.3590.650.0K
2
GPT-5.5-High
OpenAI
high reasoning effort
52.474.4+22.038.5456.374.1K
3
Claude-Opus-4.6
Anthropic
default configured reasoning
44.078.9+34.934.6747.5734.2K
4
GLM-5.1
Zhipu AI
thinking enabled
36.263.9+27.715.4859.8104.2K
5
Kimi-K2.6
Moonshot AI
thinking enabled
31.959.0+27.123.11155.592.5K
6
DeepSeek-V4-Pro
DeepSeek
high reasoning effort
30.656.4+25.819.21134.8168.8K
7
Qwen3.6-Plus
Alibaba
thinking enabled
29.457.3+27.915.4629.3103.1K
8
Xiaomi-MiMo-V2.5-Pro
Xiaomi
high reasoning effort
17.37.9-9.411.5754.8125.7K
9
Gemini-3.1-Pro-Preview
Google
high reasoning effort
13.746.7+33.011.5261.372.7K
10
DeepSeek-V4-Flash
DeepSeek
high reasoning effort
9.446.3+36.90.01104.7148.7K
11
Qwen3.5-397B-A17B
Alibaba
thinking enabled
4.644.1+39.50.0587.853.0K
12
MiniMax-M2.7
MiniMax
reasoning split enabled
3.730.0+26.30.0600.459.2K
13
Doubao-Seed-2.0-Pro
ByteDance
high reasoning effort
1.923.8+21.90.0211.118.5K
EvoCode-Bench main results
Main results. SR measures isolated single-round success from a reference-fast-forwarded workspace; MT@4 measures whether the agent's own workspace keeps passing as rounds accumulate.

How It Works

Each task is a sequence of user rounds. At round i, the agent receives a new instruction, edits the same Docker workspace, and is evaluated by cumulative tests for all still-active requirements through round i. If the cumulative verifier fails, the multi-turn attempt stops and later rounds receive zero credit.

Overview of EvoCode-Bench
Overview of EvoCode-Bench. Task construction pairs every round with an instruction, reference solution, and cumulative tests. MT@4 keeps one container and one agent session across rounds; SR fast-forwards to the reference state before the target round.

Evaluation Scaffold

EvoCode-Bench is evaluated with harbor_multiturn, the multi-turn Harbor fork released at github.com/UniPat-AI/harbor_multiturn. The scaffold adds persistent Docker workspaces, continuous agent sessions, round-boundary verifier swaps, reference fast-forwarding for SR, snapshot/resume lineage, and fail-stop reward aggregation.

Where it fits. A stock Harbor task runs one instruction followed by one verifier. EvoCode-Bench needs one task to run as a sequence of rounds. harbor_multiturn handles that protocol: it delivers each round instruction, preserves the workspace, swaps in cumulative tests for the current round, records verifier/multiround_results.json, and writes the aggregate reward used by MT@4.

Dataset Overview

The paper groups EvoCode-Bench tasks along two axes: interaction style, or how users communicate across rounds, and engineering activity, or what kind of code change the round asks for. Each cell reports tasks / rounds.

ActivityCapability MeasuredExplorativeContractualDocument-DrivenTotal
ConstructionBuilding a system incrementally while preserving earlier features and interfaces.9 / 803 / 371 / 713 / 124
Spec EvolutionUpdating an implementation after a later round overturns a core assumption.1 / 81 / 71 / 73 / 22
ReviewImproving non-functional properties such as performance, security, and observability without regression.3 / 211 / 71 / 95 / 37
MigrationMoving a legacy system to a new implementation style while keeping backward compatibility.3 / 291 / 71 / 85 / 44
Total16 / 1386 / 584 / 3126 / 227
Round Length
Tasks span 5-15 rounds of evolving state, with 227 cumulative verification points. The longer tasks expose late-stage state accumulation and regression risk.
Requirement Pressure
Rounds are annotated with non-exclusive change types: 198 extensions, 69 corrections, and 42 conflicts. In total, 110 rounds carry at least one correction or conflict.
Technical Coverage
Tasks cover MLOps, data engineering, systems programming, scientific computing, testing and automation, infrastructure, and security.
Behavioral Verification
Tests check observable behavior rather than the reference implementation path, so different valid internal designs can pass the same cumulative contract.

Analysis

Single-Round Skill Does Not Imply Persistent Reliability

SR exceeds MT@4 by 22 to 40 points for most agents. Claude-Opus-4.6 has the highest SR at 78.9, but ranks third on persistent execution at 44.0 MT@4. The reranking shows that solving an isolated round from a clean reference state is different from living with one's own earlier edits.

Single-round versus multi-turn score by round
SR vs. MT@4 by round. SR stays comparatively stable while MT@4 falls with depth, showing the cost of accumulated workspace state.

Workspace State Drives a Large Share of Failures

A controlled comparison shows that 57.0% of failed MT@4 round records are solvable under SR from a reference-completed state. The state penalty grows with depth: only 15.0% of round-1 MT failures are SR-solvable, rising above 80% beyond round 12.

Workspace state penalty
Workspace state penalty. Many multi-turn failures are not isolated instruction failures. They happen because the agent's accumulated workspace no longer satisfies the active contract.

Failure Patterns Are Tier-Dependent

Missed active requirements dominate every tier, but the secondary modes differ. Lower-tier agents fail early: 57.4% of lower-tier trial failures occur in the first 20% of rounds. Stronger agents survive long enough for stale behavior, regressions, and conflict-resolution errors to appear.

First failure distribution
First failure distribution. Lower-tier failures concentrate early; top-tier failures are spread deeper into trajectories.
Failure mode summary
Failure modes. Regressions and stale behavior become visible only after agents build enough working functionality for later rounds to break or replace it.

Failing Rounds Consume More Tokens

At the same round index, failed trials usually produce more output tokens than passing trials. Across rounds 1-9, failing trials emit 1.1x to 3.1x as many generated tokens as passing trials at the same depth. This is a diagnostic association rather than a causal claim.

Within-round token usage
Within-round token usage. Higher effort often accompanies harder or degraded workspace states, but more output does not guarantee recovery.