SWE-bench

Overview

Unlike the oracle-context tasks in standard Code Generation, Heavy Context Code Generation problems embed challenges inside real-world open-source repositories. Instead of providing only the minimal context, these tasks expose models to larger codebases where relevant modules, dependencies, and hierarchy must often be navigated before producing correct solutions. They are designed to evaluate agentic systems only, emphasizing repository exploration, tool usage, and integration.

Task Categories

Shares the same categories as Code Generation, but with much more emphasis on debug.

Formats

Agentic only : multi-step tasks requiring repository inspection, navigation, and simulator/synthesis tool interaction

Evaluation Criteria

Success measured by repository build and testbench execution.
pass@k metrics across sampled runs.
Consistency checks to ensure integration without breaking dependencies.