Benchmarking AI Code Review at Scale

 

Measuring AI code review performance

Qodo’s code review benchmark is a public, large-scale benchmark designed to measure how well AI systems perform real code review. It evaluates performance across both functional bugs and best practice violations, reflecting the range of problems found in real code review.

Results are based on precision, recall, and F1 score, capturing issue coverage, noise, and overall balance across many concurrent issues in production-grade pull requests.

Based on these results, Qodo demonstrates the strongest overall performance among evaluated AI code review tools.

Qodo: A balance of precision and recall

Most tools cluster toward high precision and low recall. This means they comment only on the most obvious issues and miss a large portion of problems in a pull request. The result is quiet reviews with limited coverage.

When precision and recall are considered together, this balance translates into the strongest overall F1 score among evaluated tools.

Representative of real-world codebases

The selected repositories span a broad and representative distribution of programming languages and system disciplines.

The selected repositories cover a wide range of languages and system types, including full-stack applications, distributed systems, databases, runtimes, and mobile apps. They span both single-language and polyglot codebases, reflecting the diversity and complexity of real-world software projects.

About the benchmark methodology

The benchmark is built on real, merged pull requests from active, production-grade open-source repositories.

It was built by injecting verified bugs and best practice violations into real, merged pull requests from production-grade open-source repositories. This approach evaluates both code correctness and code quality in full pull request contexts, going beyond benchmarks that focus on isolated bug types.