CRSBench

A unified, full-pipeline benchmark for OSS-CRS.

Overview

CRSBench is the benchmark suite for OSS-CRS. It evaluates the full bug-finding and bug-fixing pipeline of any OSS-CRS-compatible CRS under production-style infrastructure (pre-collected corpora, incremental builds, RTS), and ships back into OSS-CRS as its standard evaluation framework.

CRSBench architecture: benchmark construction, builder, executor, and verifier.

Supports every CRS

Fuzzers, LLM agents, and hybrid systems run on the same sanitizer-based harness with the same resource limits. Any OSS-CRS-compatible CRS can run without changes.

Full-pipeline evaluation

The framework takes the PoVs found by the bug-finding CRS and sends them to patching, so bug finding and patching are evaluated as one connected flow.

Faster evaluation

Redis/RQ workers run trials across machines. Docker snapshot-based incremental builds skip full project rebuilds after each patch attempt, giving CRSs more tries within the same LLM budget.

Production-style infra

Pre-collected fuzzing corpora and Regression Test Selection (RTS) reflect the setup real deployments already maintain, so scores focus on CRS performance instead of infrastructure overhead.

Statistics

CRSBench comprises C/C++ and Java projects with both manually curated synthetic vulnerabilities and real-world bugs, packaged with ground-truth PoVs, patches, and functionality tests.

124Projects

315Vulnerabilities

91Unique CWEs

21of CWE Top 25 (2025)

C/C++, JavaLanguage

1-day vs synthetic complexity

CRSBench spans a wide range of difficulty. Across crash-stack depth, the number of files involved, and ground-truth patch size, the benchmark mixes easy single-line cases with deep multi-file ones, so CRSs are evaluated over the full difficulty spectrum rather than a single difficulty level.

Results

Every CRS runs 3 trials per task with a $30 LLM budget per trial and 16 cores / 64 GB RAM. Bug finding has an 8-hour timeout and bug fixing a 2-hour timeout; end-to-end runs chain the two stages, each under its own limit. The full evaluation used 245,330 CPU-hours and cost $31K ($10K compute + $21K LLM API spend). Headline results below, or explore the full interactive results.

Bug-Finding

We ran a fuzzer-only CRS and an LLM agent CRS (Claude Code, Opus 4.6) on 304 CPVs across 117 benchmarks, then a hybrid of the two on the hard subset neither fully solved. Each style finds bugs the others miss: the agent solves 244 CPVs to the fuzzer's 80, and the hybrid recovers 12 of the 54 CPVs missed by both.

74170654

1242

Found by bothAgent onlyFuzzer onlyRecovered by hybridMissed by both

Bug-Fixing

Three frontier coding agents patch every benchmark vulnerability (912 tasks, 3 trials each). Success rates are close, but every patch must survive CRSBench's multi-PoV and functionality-test verification, and the agents differ sharply in speed and cost.

CRS	Delta mode	Full mode	Overall	Time/trial	$/trial
🥇 Codex GPT-5.4	88%	85.7%	87.3%	589s	$1.29
🥈 Gemini CLI Gemini 3.1 Pro	87.9%	84%	86.6%	1,255s	$0.89
🥉 Claude Code Opus 4.6	88.3%	82%	86.3%	607s	$1.43

End-to-End

Five agent-based CRSs run find-then-fix end to end on a 51-vulnerability subset. The finding stage decides the outcome and dominates the cost.

CRS	Find	Fix	End-to-End	E2E $/trial
🥇 Claude Code Opus 4.6	92%	89%	82% (42/51)	$9.52
🥈 Opencode GLM-5.1	84%	81%	69% (35/51)	$1.73
🥉 Gemini CLI Gemini 3 Flash	61%	94%	57% (29/51)	$1.55
Codex GPT-5.4-mini	59%	93%	55% (28/51)	$1.15
Claude Code Haiku 4.5	45%	70%	31% (16/51)	$0.91

Explore Full Results →

Quick Start

CRSBench runs on Linux hosts with Docker. The smallest first run installs CRSBench, pulls the managed dependencies, downloads the sanity benchmark suite, and runs one experiment with a local queue-backed worker.

0. Request dataset access

The benchmark dataset on HuggingFace is gated. Before anything else, open huggingface.co/datasets/sslab-gatech/crsbench-dataset, request access, and wait for approval. Without it, crsbench download will fail.

1. Install and prepare

git clone --recurse-submodules https://github.com/sslab-gatech/CRSBench.git
cd CRSBench
uv sync
./scripts/setup-third-party.sh

uv run crsbench prepare
uv run crsbench prepare --coverage

# Gated dataset: accept the DUA on HuggingFace first
uv run hf auth login
uv run crsbench download --benchmark-suite sanity

2. Write a first-run config

Save the following as first-run.yaml. atlantis-multilang-given_fuzzer is the bundled starter CRS, and litellm.skip: true means no external LLM keys are required.

experiment:
  name: first-run
  task: bugfinding
  mode: full
  benchmark_suite: sanity
  sanitizers: [address]

runtime:
  trials: 1
  max_total_time: 3600
  redis_host: localhost:6379
  litellm:
    skip: true

storage:
  experiment_filestore: ./results/experiment-data
  report_filestore: ./results/report-data

crs_compose:
  atlantis-multilang-given_fuzzer:
    num_cores: 4

3. Launch worker + orchestrator

uv run python scripts/valkey-helper.py start

# Terminal 1: worker executes CRS trial jobs
uv run crsbench worker --experiment-config first-run.yaml

# Terminal 2: orchestrator enqueues jobs
uv run crsbench run --experiment-config first-run.yaml

The CRS lifecycle reuses oss-crs prepare, oss-crs build-target, oss-crs artifacts, and oss-crs run, so any CRS listed in the OSS-CRS Registry plugs straight into crs_compose. For the distributed-experiment guide and configuration reference, see the upstream README and docs/.