How to Catch AI-Generated Bugs with Layered Code Review
AI coding assistants made developers genuinely faster.
A developer using Codex or Claude Code can produce thousands of lines of code in just a few days. Jellyfish’s 2025 research put a number to this. Teams with high AI adoption saw PRs per engineer jump by 113%, more than doubling output per engineer.
More PRs also mean more code to review, and review capacity doesn’t scale with output. As PRs come quicker than review capacity, teams devise ways to prevent backlogs. PRs are skimmed. Reviewers check that tests and linters pass, or have their coding agent review the code. Then they approve and the code ships.
Tests confirm the code runs, linters catch style and syntax, and a coding agent review validates patterns it already recognizes. None of them reliably catch logic errors or unintended behavioral changes, or architectural choices that compound quietly across sprints.
Why Your Coding Agent Can’t Review Its Own Code
Having your coding agent review its own code is a logical approach.
It wrote the code, so asking it to review it makes sense. The problem is the agent operates from the same context, the same assumptions, and the same training distribution it had when writing the code.
When it reviews that same pattern, it confirms it looks correct, because it’s looking at something familiar.
What you need is an independent perspective of your code.
According to a 2025 study across 14 open source non-reasoning LLMs, models failed to correct errors in their own outputs nearly two-thirds of the time, while successfully correcting identical errors attributed to external sources.
Your agent is still a valid part of your code quality workflow, but it should be one layer in the review stack, not the whole stack.
The 3 Review Levels for AI-Generated Code
A layered code review approach reviews AI-generated code at three levels of the workflow, before the code is written, after the code is written, and before the PR is merged.
Generation Guardrails are the instructions you add to your agent’s instruction files, whether that’s AGENTS.md or CLAUDE.md. When you define naming conventions, security rules, and architectural constraints there, the coding agent applies them during generation, meaning the code that reaches review is already cleaner.
Independent Agent Review puts a separate agent on the diff after the code is written. This can be a sub-agent spawned from the same tool with a different system prompt, or a different model entirely. The point is that it carries none of the prior reasoning that shaped the implementation, no assumptions, no decisions already made. The agent that wrote the code has already committed to those choices. A reviewer with no memory of making them will question them.
PR Review Gate uses a specialized review tool at merge time. Tools like Qodo, CodeRabbit, and Graphite review use full-repo context, training on production defect patterns, and operate at merge time. The difference between your coding agent and a purpose-built review tool is that your coding agent only knows what you’ve told it to look for. Dedicated review tools, on the other hand, are trained on production defect patterns across thousands of codebases and on exploited vulnerabilities, providing a more extensive review of your codebase.
Each review level catches what the others can’t, because each one sees the code from a different point in the workflow. Stack them, and you cover blind spots that no single pass, human, or AI would surface on its own.
Setting Up Generation Guardrails
Your AGENTS.md (or CLAUDE.md) is the earliest possible intervention point in the stack.
Your coding agent reads its instructions file before generating any code, which means any standard you define there is considered as it writes code.
Add review-focused guidelines that tell the agent what to watch for, such as:
- Naming and structure conventions
- Security rules (no string interpolation in queries, no secrets in env-accessible code, required input validation)
- Architectural constraints (which layers own which responsibilities, banned patterns)
- Testing requirements (minimum coverage per module, required edge cases for external inputs)
Here are some instructions you can drop into your agent file right now:
## Code Review Standards ### Security - Never use string interpolation or concatenation in SQL queries. Use parameterized queries only. - Validate and sanitize all inputs received from external sources before processing. - Do not log sensitive fields: passwords, tokens, PII, payment data. - Use environment variables for secrets. Never hardcode credentials. ### Architecture - Service layer handles business logic. Controllers handle routing only. - Do not import database models directly in route handlers. - Shared utilities live in /lib. Do not duplicate utility functions across modules. ### Testing - Every function that handles external input requires a test for invalid/malformed input. - New API endpoints require integration tests covering the success path and at least two failure paths. - Do not mock the database in integration tests. ### Code Style - Functions over 40 lines should be split. Add a comment explaining the split if the reason isn't obvious. - Prefer explicit error returns over silent catches. - Type all function parameters and return values.
Pro Tip: Add nested AGENTS.md or CLAUDE.md files to specific directories if a subdirectory has additional or domain-specific constraints. This is especially useful when working in a monorepo. Agents resolve the most specific applicable file first and traverse upward.
Running an Independent Agent Review
Most agentic code tools have a subagent feature.
When your coding agent is done writing code, it can spin up subagents to review the code, each focusing on a single criterion.
Examples of these criteria are:
- TypeSafety: check for Type errors and unsafe casts.
- Style & Clean Code: catches style guide violations and code smells.
- Consistency: checks for inconsistent patterns across the codebase.
- Defensive Programming: catches missing validation and error handling.
- Security: check for injection surfaces and exposed secrets.
- Performance: check for inefficient queries, resource usage, and bottlenecks.
Package the instructions for each criterion as a skill, and invoke it at the end of every code session. By having one subagent per criterion, you catch more issues than a single agent trying to assess everything at once.
The listed criteria are not exhaustive. You can add more criteria and refine the instructions as you find more issues slipping through at this review level.
The easiest place to get this signal is at the PR Review Gate.
Using Qodo as the PR Review Gate
The PR Review Gate catches issues that slip through Generation Guardrails and Independent Agent Review. It also gives you feedback you can use to improve your independent review prompts.
You want a tool with high recall, one that prioritizes flagging coverage over precision, reducing the chance of bugs slipping into production.
In Martian’s independent code review benchmark, Qodo achieved an F1 score of 60.1%, the highest of any tool tested, with recall as the attributed gap. Qodo’s Extended mode runs specialized sub-agents in parallel, each focused on a specific issue category, giving it broader coverage per PR than a single-pass review.
At the final gate, coverage matters more than precision because you can filter a noisy comment, but you can’t recover a bug the tool never flagged.
To start using Qodo, install it on your repository. It supports GitHub, GitLab, and Bitbucket. Select the repositories you want covered, and it reviews every code PR your team opens, leaving them as comments that your coding agent can read and fix.
Think of Qodo as your final wall of defense, guarded by your army of Armadillo soldiers ready to sound an alarm for anything that even remotely looks like a threat.
With all three review levels stacked, most defects get caught before they’re expensive to fix.
Next Steps
Start with Generation Guardrails. Adding review-focused instructions to your AGENTS.md is a single-file edit covering your security rules, architectural constraints, and testing requirements.
Once your guardrails are in place, add the dedicated review skill. Starting with cleaner code means the reviewing sub-agents spend their attention budget on logic issues and edge cases.
Add Qodo as the PR Review Gate, then use the issues it catches to improve your independent review prompts.
With these three review levels stacked together, PRs are created with fewer issues, preventing PR queues and maintaining your shipping velocity.
See Also
- Qodo installation guide (GitHub, GitLab, Bitbucket)
- Qodo State of AI Code Quality 2026
- How we built a real-world benchmark for AI code review
- 2025 AI Metrics in Review: What 12 Months of Data Tell Us About Adoption and Impact