Qodo Outperforms Claude Code in Code Review Benchmark.
→ Read about the methodology and results

Qodo in the NVIDIA GTC Keynote. Here’s Why…

NVIDIA recently launched Nemotron 3 Super, a 120-billion parameter open-source model designed to close the gap with frontier proprietary models. And, when they needed a benchmark rigorous enough to validate the claim, they turned to Qodo’s Code Review Benchmark. That’s what put Qodo on the GTC stage.

Why NVIDIA used Qodo’s Code Review Benchmark

When new foundation models like NVIDIA’s Nemotron 3 Super are released, evaluating them on standard benchmarks (like HumanEval or MBPP) is no longer enough to prove their enterprise readiness. Writing a standalone function is a “low-cognitive” task; performing a code review is a “high-cognitive” one.

NVIDIA utilized Qodo’s Code Review Benchmark for exactly this reason. As they optimized Nemotron 3 Super for agentic workflows, they needed a rigorous stress test that mirrored the complexity of real-world software engineering.

The Qodo Code Review Benchmark evaluates models on two metrics that map directly to production reality:

  • Precision: when the system flags an issue, is it actually an issue? A low-precision tool buries engineers in noise and gets ignored.
  • Recall: of all the real issues in the code, how many does the system actually catch? Low recall means things slip through to production.

When we ran Nemotron 3 Super through the benchmark, it achieved 73.4% precision, the highest of any open-source model tested, outperforming models nearly twice to three times its size. That number means something: when Nemotron flags an issue in a production PR, developers can act on it. The recall gap to frontier models remains real, but the trajectory is clear.

For Qodo’s enterprise customers running self-hosted models in air-gapped environments (regulated industries, financial services, government ) this is a direct path to production-grade independent code review inside their own infrastructure. No code leaving the firewall. No external API dependencies. Full control over data residency.

Establishing a Standard for AI Code Review

We developed this benchmark to address a lack of standardized, non-static evaluations for AI code review. Unlike standard coding benchmarks that focus on isolated code generation, this evaluates a model’s ability to function as a reviewer within a complex, multi-file pull request.

Methodology and Dataset

  • Dataset: 100 real-world, production-grade pull requests.
  • Defect Injection: 580 verified defects (logic errors, security vulnerabilities, and best-practice violations) injected across 8 repositories.
  • Language Coverage: TypeScript, Python, JavaScript, C, C#, Rust, and Swift.
  • Execution: Every tool is evaluated under identical conditions: default configurations, no manual tuning, and a consistent LLM-as-a-judge system to verify findings against a human-validated ground truth.
  • Open source: The methodology, the dataset, and the evaluation scripts are publicly available

In the latest run of the benchmark, we also evaluated Anthropic’s newly released Claude Code Review under the same conditions. The result: Qodo outperforms Claude by 12 F1 points. Precision was identical. Both tools are accurate when they flag something. The difference is recall: Qodo catches significantly more of the real issues. In a workflow where AI is writing the code, the issues that don’t get flagged are the ones that make it to production.

As the industry advances, we will continue to add new models and proprietary tools to this benchmark, ensuring it remains a living, transparent standard for code reasoning.

Code review as the governance layer

NVIDIA’s use of the Qodo Code Review Benchmark to validate Nemotron 3 Super signals where the AI development stack is heading: generation is accelerating, but verification is becoming the layer that determines whether that speed is safe to ship.

This is the layer Qodo focuses on, acting as an independent quality gate that bridges the gap between raw model intelligence and production-grade reliability.

When AI generates a change in minutes, someone still needs to confirm that the logic is correct, bugs are caught and fixed, and that changes won’t break things. In some AI workflows today, the same model or agent that generates the code is also asked to review it. That approach works for simple examples, but it has an obvious limitation.

This is why independence matters. NVIDIA did not validate Nemotron using its own internal evaluation alone. It used an external benchmark designed specifically to test code review performance. The same principle applies inside the development workflow.

Operationalizing the Governance Layer with Qodo

The benchmark results are a vital data point, but the broader argument is architectural. We are entering an era where the majority of code in production will be written, at least in part, by AI. This shift fundamentally changes the role of code review.

Code review is no longer just about catching human typos; it is becoming the governance layer for the AI-augmented SDLC. Its purpose is to verify that AI-generated output actually reflects organizational intent; ensuring logic is sound, conventions are respected, and changes align with the broader system architecture.

That requires three things working together:

Detection: Catching “silent killers” requires reasoning beyond the immediate code diff. Qodo utilizes a specialized agentic architecture designed to identify failure modes, such as logical regressions, security vulnerabilities, or performance bottlenecks, that monolithic models often overlook when reviewing their own output.

Governance: knowing what “good” looks like for your specific organization, not just in general. Qodo’s Rules System turns scattered engineering standards into a centralized, versioned, enforceable layer. Standards that live in senior engineers’ heads get codified. Tribal knowledge doesn’t leave when people do. Rules are defined once and enforced everywhere — across every PR, every team, every agent.

Trust : giving engineering leaders confidence that what gets merged reflects intent, not just passing tests. This is the hardest part to measure and the most important. 73.8% of Qodo’s suggestions get accepted by developers. That number matters because it’s a signal of signal — developers are reading the comments and acting on them, not dismissing them as noise.

The question for every engineering organization shipping AI-generated code is no longer just “how fast can we generate?” It’s “how do we know what we shipped is what we meant to build?”

What this moment signals

Jensen Huang’s inclusion of Qodo at GTC reflects a broader shift in software engineering. We are moving from AI as an assistant to AI as part of the core development infrastructure.

Code generation will continue to improve, and teams will likely switch between models based on performance, much like they switch IDEs today.

What does not change is the need to verify what gets merged. As generation accelerates, organizations need an independent layer that can review changes with the same rigor.

The collaboration between NVIDIA and Qodo points to that model: separating the system that writes code from the system that reviews it. That separation is what makes AI-driven development reliable at production scale.

Start to test, review and generate high quality code

Get Started

More from our blog