SWE-bench

Overview

Code Generation tasks evaluate whether models can produce correct RTL and verification code. These problems mirror real workflows such as module design, modification, refinement, and testbench development, testing both syntax and functional correctness. All tasks are constructed in an oracle-context setting, where only the minimal relevant information is provided. This ensures evaluation focuses on code reasoning and correctness rather than retrieval.

Task Categories

RTL Code Completion – fill in missing segments of RTL.
Specification to RTL Translation – implement modules from natural language descriptions.
RTL Code Improvement - refine code for lint-clean results or better quality-of-results (QoR).
Design Verification - generate testbench stimulus, checkers, assertions, or perform bug fixing.

Formats

Non-Agentic : single-turn prompts with direct outputs.
Agentic : multi-step problems with tool use (e.g., simulators).

Evaluation Criteria

Pass/fail via simulation harnesses.
pass@k metrics (typically pass@1).
Syntax and lint checks for validity.