Closing the gap: AI agent hype and production reality
Summary
In this episode, Dexter Horthy, CEO of HumanLayer, shares hard-won lessons from building AI agents in production: what breaks, what the “12-factor agents” framework gets right, and why more engineering teams are pulling back on automation to rescue architectures that went sideways.
this episode’s guest
Dex Horthy is CEO and Co-Founder of HumanLayer, a YC-backed AI Context Lab. He coined “context engineering,” authored the 12-Factor Agents manifesto, and spent seven years learning how enterprises actually adopt software. Now he’s closing the gap between what AI agents promise and what they ship.
Key takeaways
- Unsupervised AI agents don’t just write bad code — they silently destroy your architecture until it’s too expensive to fix
- The industry swings between “stop reading code” and “oh no, the codebase is a mess” on a predictable 5-month cycle
- Generating low-quality code is nearly free; defending against it is expensive — that asymmetry is only getting worse
- A 200-line design doc before writing code reduces PR rework from 50% to under 5% — the leverage is upstream, not at review
- Qodo’s Rules System captures the institutional knowledge and engineering standards that prevent AI agents from drifting — turning senior engineer judgment into enforceable guardrails at scale
Chapters
- Closing the Hype-Reality Gap
- The Code-Reading Pendulum
- The Asymmetric War on Slop
- Context Engineering as a Definitive Skill
- Shifting Quality Left with Design Docs
- The Evolution of Senior Engineering
Transcript
[00:00:00] Dex: The models are really good at certain things, but they’re not good at codebase architecture. And if you leave them unattended, they will make a big mess.
[00:00:08] Itamar: Welcome to Agentic Review, the podcast where we explore what good code really means in the age of AI software development.
[00:00:16] Nnenna: I’m Nnenna Ndukwe, developer relations lead.
[00:00:18] Itamar: And I’m Itamar Friedman, the cofounder and CEO of Qodo.
[00:00:22] Nnenna: So let’s get into it.
[00:00:27] Itamar: Today, we’re joined by Dex, CEO and cofounder of HumanLayer, a YC-backed AI context lab and the author of the 12-Factor Agents manifesto, which has become one of the most widely referenced framework for building reliable AI agents in production.
[00:00:44] Nnenna: Dex coined the term context engineering and spent nearly seven years at Replicated learning how enterprises actually adopt software in lockdown environments and is now on a very public mission to close the gap between what AI agents promise and what they actually ship.
[00:01:01] Itamar: So without further ado, I didn’t mention your family name because I feel like Dex is like Madonna. Welcome to the show, and please tell us more about yourself, especially those things that we have missed.
[00:01:11] Dex: No. I love that. Thank you. I’m super stoked to be here, and appreciate the invitation. I like I’ve never heard it in that in that, uh, exact phrasing, but, yeah, very public mission to close the gap between what we were told agents can do and what they can actually do. And that comes from my own journey. Right? Like, a year ago, we published the 12-Factor Agents essay or paper or manifesto, if you want to call it that. And it was kind of my journey of, like, getting sucked into the AI hype machine and building for this whole ecosystem that if you were paying attention to most of what was happening online, you were told, hey. This is the future. This is how it’s going to be done. It’s going to be like, I’m not going to name specific names, but a lot of open source frameworks that were very, very popular. And the kind of promise was like, hey. If you build for this ecosystem, then you will be able to distribute to lots and lots of people, and enterprises will be able to use your stuff. It was kind of like building to a standard that hadn’t actually been adopted yet, and then we actually went to go sell tools that were built for that ecosystem. And we found out that everybody building production AI and, like, actually selling six-figure contracts to enterprises wasn’t actually building in that way and wasn’t using any of that stuff because it didn’t really work, and it was all just hype. And so I’m now like, okay. I got burned by that. How can I share the lessons? And then that’s just been, like, do that and then just repeat every two or three months. Go get in the trenches with customers, learn stuff, and then teach people about it and try to save folks from hitting all of the mistakes and landmines that we hit in the last six months or year or whatever it is.
[00:02:41] Nnenna: Yeah. I think that’s awesome because that I felt like that was a big part of the energy, the build-in-public, and the, I guess, the honesty that you showed for your talk at AI Engineer Miami. So I would love for you to explain or give a recap of how that also went and what it was about.
[00:03:00] Itamar: And were you referring to AutoGPT? I’m not going to get you, like No.
[00:03:04] Dex: It wasn’t AutoGPT. It was and, basically, all of the things now that are I don’t know. People are getting completely obsessed with context engineering, and it’s like, okay. 80% of what you guys are excited about is a thing that people have been doing since 2023, which is, like, agent calling tools in a loop. And no one is focused on the cool part of context engineering, which is, like, RL’ing a model on the harness that’s built. Anyways, you asked about Miami.
[00:03:30] Nnenna: Yes. Yeah. A recap of, uh, what you spoke about because, uh, you know, I was there. I saw it. And then also, you know, the Internet saw it too. So there was a lot of excitement and discussion about your some of your, I guess, controversial slides about reading the code. So
[00:03:47] Dex: And if people saw it, I apologize. And if you haven’t seen it yet, I apologize in advance. We gave a talk back in November that was really, really popular. I think it has, like, 500,000 views on YouTube, which was our, like, RPI, research-plan-implement system. And we had been rolling that out to companies of all sizes from small startups to Fortune 500, you know, thousands of engineers. And in that kind of two-, three-, four-month period, we learned a lot about what we got wrong about it in terms of, like, the difference between building for power users and building for average users who can’t be bothered, the difference between, like, building again, like, what is needed to take something from one person’s workstation and workflow and make it work across everybody’s workstation and workflow? And, like, how do you basically as the models change in character, how do you do context engineering, split your workflow up into smaller sessions to get better results and to kind of raise the ceiling about what is the hardest problem you can solve. Although I will say, I guess, the most exciting part for most people or controversial perhaps was, like, in August, we were like, stop reading the code. Just write really good specs and let the model cook and, like, hope that it’s smart enough, and you’ll be fine. And Ralph Wiggum made an entire programming language in June and, like, yada yada yada. And then we did that for five-ish months, and then we suddenly were like, you know what? This code is terrible, and it’s really hard to work in. And, like, we’re going to go back to reading it. And we, like, ripped out and replaced almost all of our product and, like, rebuilt it from scratch. And I tell you, my cofounder was in VS Code, not even Cursor, typing every character by hand for about two weeks to, like, make sure that the bones of the project were really, really good. And that was around November, December. And what basically happened since then was, like, December, the Opus 4.5 thing happened, and there were way more people like, Opus 4.5, I actually think, is not as smart a model as Opus 4.1, but it was a little bit faster, and it was better for, like, people who didn’t know what they were doing. And that’s what caught and it was much cheaper, and they gave everybody up, like, double limits over the holidays. And the whole thing blew up, and everyone got really into it. And then Stainless published their, like, lights off software factory. We’re going to build a bunch of products. Dark factory. No one’s going to write the code. No one’s going to read the code. The only thing humans are allowed to interact with is, like, tickets and specs, but no one reads the code.
[00:06:09] Itamar: So until now, you described, like, pendulum. Like, we were there. We went back and back again. Like, that’s So no.
[00:06:16] Dex: So, like, we were there in August, and it took us about four or five months to swing back to, like, nope. The models are really good at certain things, but they’re not good at codebase architecture. And if you leave them unattended, they will make a big mess. And the whole rest of the industry is, like, swinging maybe they’re higher up to the left or maybe they’re further down, but the whole the rest of the industry is, like, they just hit this point of, like, hey. Let’s see. Like, what if we just don’t read the code? And that was in December, January. And you actually see now in, like, April like, in the last week or two, there is a huge, huge outpouring of people, at least on Twitter, is like, oh, actually, like, we’re going to go back to reading the code. There’s kind of this, like, five month it takes you five months from, like, we don’t read the code anymore to, like, damn. You know what? We actually can’t go 10 times faster. Let’s go back. Let’s guarantee stability, and then maybe we can, like, incrementally go up to, like, two or three times faster.
[00:07:07] Itamar: Yeah. As the CEO and cofounder of Qodo, I am on a streak of, like, one or two PRs a day. Don’t worry. I’m not touching, like, core functionality. I’m leaving that for the engineering team, but it’s very important for me to day to day. And I feel like I’m doing that pendulum, like, every day or so. And it is kind of converging. Like, sometimes I find myself, like, being much better. And what I think you described in some of your talks, like, doing the research part prior to the coding, like, I do know what to ask for the research. Like, hey. Look at the software patterns there. Look at the design patterns there. Look like I’m but then, like, I feel like I have a really good spec. And in many cases, like, I’m I’m feeling I didn’t write one human token in in the code itself. And then, like, I get maybe a little bit too confident, and then I kind of find myself, like, opening up a PR that probably could have a half the size or so because I didn’t think thoroughly, for example, on the architecture, like you mentioned, although I gave it a prompt. So I’m kind of, like, having this pendulum and myself almost almost daily admitting that leads to, like, kind of, like, a tagline. I think you mentioned that we are in a an asymmetric war on slop. Is that related? And, like, who’s actually winning right now?
[00:08:23] Dex: Yeah. So that was actually from Swyx’s talk in November. He gave, like, the opening talk for AI engineer code, and it was kind of like the war on slop. And what he encouraged was, like, a very short talk, but he just encouraged the entire audience to just like, hey. When your boss asks you to ship this even though you know it’s not good, like, what do you say? No more slop. When you want to publish the blog post, but you know it’s no more slop. Like, that is like 2026 has to be the year where we pull back and we say no more slop. And he pulled it up. I forget the original quote, but
[00:08:53] Itamar: you can go watch the talk.
[00:08:54] Dex: But he was basically, like, the I think it was actually I think the original word was bullshit. It was like the work required to, like, combat and refute bullshit is, like, a 100 times harder than the work required to produce it. And I think Slop has the same thing of, like, it’s an asymmetric war where, like, it’s basically almost free for people to generate bullshit comments on LinkedIn or, like, generate terrible trash code that’s not going to be usable in three months or whatever it is. And it’s much harder to defend against these things. I mean, the OpenCode guys and the Pydantic team, they’re all having this issue where they’re drowning in PRs from bots that are terrible, And it’s way more work for them to kind of, like, just keep people happy and keep the reputation okay and, actually, like, let through the good PRs than it is for all of the people who are just, like, barraging the entire repo with garbage.
[00:09:46] Itamar: Yeah. So what do you think, by the way, on AI for code review? And after that, Naina, I’d love to hear your your thoughts about it.
[00:09:53] Dex: AI for code review, I’m I’m kind of on the fence. I’ve, like, publicly kind of talked about, like, it’s very easy to accidentally, like, have AI for code review be kind of useless. Like, obviously, the most naive one is like, hey. GPT, review my coworker’s PR and tell me everything that’s wrong with it. And it’s like, okay. It’s going to come up with a list of, like, 30 issues. If you’re like, hey. Is this PR good? You’re going to get the opposite. You’re going to be, oh, yes. Comprehensive. It’s got unit tests. It’s, like, fully it does the whole it’s like it’s way too easy to oversteer these models to have opinions on things. And so, like, for code review specifically, I think the right like, there are ways to do it well, and it’s, like, part of it, I’ve seen code review bots do things like explain the PR, like, explain the surface and basically, like, compress it down where instead of reading thousands of lines of code, I’m reading 100 or 200 lines of Markdown, and that at least grounds me before I go look. I know which parts I actually want to go check. But it’s also, like, doing classifications is really easy. Like, if you ask the model, is it x or y, it’s pretty good at saying x or y, but it’s up to you to apply a value function to those things. Like, you can’t ask a model on a scale of one to 100, how good is this PR? But you could say, does it have unit tests? Do the unit tests actually cover the thing that matters? And, like, answer all of those, and then, like, you create the score deterministically based on like, the model doesn’t know what you can’t let you can’t leak the understanding of what is good or bad into the model because then it’s going to try to tell you what you want to hear.
[00:11:19] Nnenna: So there’s, like, a bias is what you’re what you’re saying. Or I guess there needs to be intentional blindness or independence in that process.
[00:11:27] Dex: Yeah. I don’t know. Itamar, what’s your take?
[00:11:29] Itamar: So first of all, of course, we’re biased at Qodo. We’re the headquarter view and governance platform. So we are noticing that the models themselves as an l LLMs are very they follow instructions. They’re they’re trained to follow instructions. And therefore, if you give them that bias of, like, uh, hey. Can you check if this actually works and it’s really, really wonderful? Then they will tell you, yes. This is super wonderful. Although, I do see, like, them getting better on that, but still, of course. But I think, like, companies and it’s not only Qodo, just I don’t want to make this a commercial for Qodo, like, that are focused on building, like, AI code review systems that like objectivity, that opinion like, building opinionated system, opinionated agents is exactly the job of the tool to take, like, different sources of information, take different considerations, actually many types, architecture, performance, maintainability, etcetera, and bake it into the agent to be very thoughtful about it is is what what we do. And I think that’s that’s, like, maybe a little bit, like, the difference is that if you ask have you ever seen Claude Code tell you, no. I’m not going to do this for you. I’m not completing this for you. It’s not. It’s not how it’s built. Right? Like, correct me if I’m wrong. If AI code review tools are built the other way around. Like
[00:12:50] Dex: I tried to get Claude Code to help me play poker with my buddies last night by taking screenshots of the browser and telling me what’s going on, and it outright refused to help me gamble. So I had to tell it I had to convince it that it was a play money game.
[00:13:04] Nnenna: Yeah. There are certain things they won’t let me do. I think I always have that problem with Gemini when I’m like, generate this image, and it just gets so sensitive, which is just ridiculous. But well, actually, you when I was when Itamar was mentioning the whole code review, different types of specialized agents, and it made me think about, like, the whole context concept that we really talk about a lot here in the context engine that we have and how there’s different context types that you need to bring in to refine and improve the actual code review system, like, customized to teams and developers. And you’re a part of how the term the phrase context engineering exists today. And I’m, like, curious, like, do you think that this has been maybe overused? The term has been, I guess, diluted in some way since since then?
[00:13:56] Dex: That’s interesting. I want to say no. I mean, one of the things that actually like, I was going back and watching I do a podcast with my buddy, Vaibhav, called AI That Works, and he’s actually one of the, like, the person who shaped the way I came up with 12-Factor Agents and context engineering, a lot of it was just spending way too much time with Vaibhav. He’s one of the best kind of thinkers on yeah. One of the best thinkers on AI systems and AI pipelines, and how do you how do you get a small model to perform as well as a big model, or
[00:14:25] Itamar: how do you get a big model
[00:14:25] Dex: to perform way better on a specific task? He’s just, like, I think, one of the best thinkers on those kinds of things. And I was reviewing I was watching an old podcast from, like, a year ago about 12 I was like, you know what? I still believe almost all of this. So either I have I’m getting sloppy and I’ve been I’ve left myself behind or it’s one of those things that is, like I think the takeaway there would be, like, as long as you are working with transformer-based LLMs, yes, the models will get smarter, and they’ll be able to do a harder task with less context engineering required. You can just YOLO-prompt it, and they’ll be able to do it as the models get smarter. But, overall, like, whatever the task is, if you’re able to apply context engineering to it, you will always be able to get better results than someone who is just YOLO prompting.
[00:15:08] Nnenna: Yeah. That’s a very clear, like, difference.
[00:15:10] Itamar: Actually, plus one on the context. I double down. I think, like, if context engineering was needed, like, for a year or two ago, I think now it’s actually one of the biggest things. Like, of course, workflow, etcetera. But but the more agents are better in tool usage and developing their workflow as they go, which is getting better and better. I’m not saying, like, you should not do workflow. You should. But the more they’re better in designing and building their workflow, one of the most important concepts in building the solution is the context. And it’s not simply, like, throwing everything into files. Like, the file system is all you need. I’ll just, like, generate 1,000 files to give each one of them a header, and that’s it. You need to properly engineer a context. Like, what would you say about that?
[00:15:56] Dex: Yeah. Again, like, agentic search is great. Tools in a loop is great. It’s much better than it was a year ago. But, like, even if you’re doing agentic tools in a loop, like, the smaller you can make those tool cycles, the more you can break down, like, hey. Your you agent rather than being responsible for taking in a 100 different types of inputs and calling a thousand different tools, if you can break that down into, like, okay, classify the input and hand it to one of 20 different agents or whatever it is where each agent only has, like, 30 tools, you can it becomes more brittle. Right? It’s a trade Like, agent with tools in a loop is like, okay. Give it as many tools as possible or, I don’t know, thousand is overkill. But, like, the point is, like, the more open it is, the more different things you can solve, but the more specific you make it, it’s like, the better you can solve those specific things.
[00:16:46] Itamar: So what I’m actually hearing from you is that you’re kind of, like, still arguing that building workflows. I think there are two topics here. First of all, I’m hearing from you that are single, building workflows, like step one, classify, step two, according to choose the next agent or so, etcetera, is still very important. Did you think am I right? Like, that’s what you’re saying, and do you think it’s still going to be like that? And the second thing you’re I think, like, a context is one of those tools. Context is one of those those agents. And, like, would you treat it as as the same? Like, you feel as the same category as another agent, or would you give it a treatment by itself?
[00:17:24] Dex: I mean, maybe this is where the term is maybe is starting to get a little overloaded then is, like, context can mean when I say context engineering, I mean, like, what do you deterministically push into the prompt? Or but, like, or, like, how are you controlling and being mindful of, like, which tools you steer the agent to get context. But the word context has become a little bit bigger with MCP and Model Context Protocol. If you open up the Claude Desktop app, they have a context section, and the context section says it is a list of the MCPs and tool sets you have connected. And so some people call context, and they refer to the bundle of things you have given the model access to, which I would call tools, but or tools or integrations or harness or whatever it is. So when I say context engineering, it’s I’m thinking more at the granular level of, like, a single call that is going to the LLM, whether it’s the initial, like, system prompt tools, user message or it’s turn 50 of an agentic conversation. You’re still caring about, like, every token in is the only thing that influences, like, whether the token you get back, which is a tool called edit a file or to go search, look something up in a knowledge base or to execute a dangerous operation, whatever it is. Like, that those tokens coming back, the only thing that controls how good they are and how likely they are to be correct is the tokens you’re putting into the model at that point of the workflow or that point of the agent. I have a question.
[00:18:50] Itamar: Go ahead.
[00:18:50] Dex: Relevant. You all do a lot of context engineering. How do you think about the bitter lesson and, like, this challenge of, like, hey. We do a bunch of context engineering, and then a new model comes out and actually is like, oh, we can’t just prompt it, and we’ll go gather its own context or figure it out itself. And now all the code we wrote is irrelevant.
[00:19:07] Itamar: Sure. First of all, I am absorbing your definition. I think I only now got it, to be frank. Like, you’re saying the context engine is more of a concept where every time you’re making an l n call, you want to make sure you’re organizing what information and context you’re putting into it that is more deterministic, that is steering into directions that you want the engineer to and the agent to go through. I I’m with you on that. I maybe I’m mixing as well the concept of memory or database or etcetera. And that’s where, at least for Qodo, we think it’s critical to accumulate that. So roughly speaking, there is the context engineering that is just in time, like, on inference time, and there is sleep-time compute that is happening to gather that context. And that’s part of what we call context engineering. Maybe that’s maybe something that we should refrain as memory or etcetera or some engineering work. And the reason it’s so important is because when you’re doing AI code review, just as an example of, you know, AI, software development and the era of AI is that, like, you want to bring the subjectivity that teams and individual have into the review. Quality is subjective. Like, there are some global, you know, agreement about what is good and bad in code, but a lot depends on the specific repo and a specific use case and even on specific people that are owning the repo and that’s how they want it if they want to take ownership. And then, like, we’re gathering a lot of contact from PR discussions, from user interactions, and we want to accumulate that over time and create that tool that is in charge to enrich the context with history that was accumulated and agreed upon. At least for us, that’s how we include data in a context engine, but maybe the right definition is context in engineering and memory or collection or so. Would love to hear a thought about it.
[00:21:15] Dex: Yeah. I mean, memory is a really interesting one. And, like, I think a lot of the really sophisticated memory implementations I’ve seen are quite deterministic. Like, they’re not they’re not tool-based at all. Like, some of the early memory things were, like, agent could, like, record memory and then search memories, and it was a very simple, like, RAG-based thing. And then I’ve seen some really sophisticated ones where it’s like every time a conversation ends, it gets handed to another model to do inference on that conversation and compact it into memories. And then you have sleep-time compute, which is, like, basically doing the they call it, like, decaying-resolution memory where you have, like, summaries for the last 14 days, and then you have weekly summaries for the preceding six weeks, and then you have monthly summaries going back in time forever. And every night, you take the oldest day summary and compact it into the last week, and you’re constantly, like, shifting and resummarizing this whole set. And then that all just gets deterministically injected in. So the model who is doing the conversation has no way to modify its own memory. It is deterministic what gets injected in, and the way it gets compacted and recorded is a completely separate context window. So you’re not asking the model who’s trying to help the user with their task to also think about, like, this complex thing of, like, make really good choices about what gets recorded into memory. Like, it’s just a completely separate prompt with a separate model that is making all those decisions.
[00:22:37] Itamar: I agree about that. I think the purpose of a memory is to reduce variance. I can’t really, like, see how it is, like, completely deterministic, but the whole idea is grounding with information that was over time accumulated and agreed upon during, like, computation that has been before inference and user interfaces that are meant to, like, create alignment, etcetera. Like, I think I think that’s that’s completely, like, awesome. And I think, like, there’s I believe that’s it’s still, like, a big portion of advancement that we’re going to see in the future is around memory and context engineering, etcetera, on on that sense. Nice.
[00:23:18] Nnenna: I’ve seen a lot of folks actually talk about memory now more than ever. I think there’s one guy who’s, like, a director of developer relations at Oracle. And there’s they that is, like, the main push is, uh, memory.
[00:23:31] Dex: Yeah. I we’ll see where it goes. I a lot of I actually someone someone pulled me over on the floor at AI Engineer Miami, and he was talking to somebody. And he was just like, hey, Dex. Come over here. I was like, what do you think about memory for AI agents? And I was like, well, I think there’s a lot of interesting off-the-shelf memory systems. I think if you pull an off-the-shelf memory system, it’s going to be similar to an off-the-shelf agent, like, agent SDK or agent harness where, like, if you grab the thing off-the-shelf, it’ll get you to 80% really fast. And then once you hit a wall, you’re going to end up reverse engineering the whole thing to figure out, like, okay, where do these tools get injected? Where does this come from? How are sub agents called? And, eventually, you’re going to just end up building it yourself. And so I think any good implementation of memory like, we haven’t found the right primitives because everything is changing so fast, and there’s so many different approaches that work well that, like, I admire the people working in the memory field trying to build, like, horizontal tooling for agent memory. It must be incredibly hard because every app I’ve seen that is, like, high quality actually selling same thing with same thing with context engineering and agents. It’s like, if you are actually building a really high quality product, you are in the weeds really engineering every single token in and out of your memory system to make sure that it solves your exact use case.
[00:24:52] Itamar: Agree. Totally agree. So it looks like you have a lot of experience, obviously, in building, like, real-world systems that scales going to production. You mentioned working with Fortune 500, etcetera. I’d love to hear more about that. Like, our audience love to hear, like, how to bring, you know, agents to production, and it could be auth. Like, how do you do, like, authentication with agents, like, different
[00:25:18] Dex: Alright. I think we’re good now.
[00:25:20] Itamar: Privacy, etcetera. Whatever you like to share in that front will be interesting.
[00:25:24] Dex: Okay. Cool. I guess my question is, like, what about PR reviews? When you go to when you go to sell someone AI code reviews, like, what is the pain point that you all are most addressing? Because I’m I’m really curious about, like, how people find their way to I mean, obviously, like, anything that AI AI can do that a human was doing before that AI can do as well or almost as well is like, great. Hell, yeah. Like, spend tokens on that instead of spending human hours. But I’m curious, like, when people that’s like, okay. Let me just be more efficient everywhere. That’s the vitamin category. Like, for whom is your product a painkiller? And, like, what is the what is the pain that you’re solving? For whom is, like, AI code review a burning fire? And they’re like, we have to get this in ASAP.
[00:26:06] Itamar: So code quality is a trillion dollar problem from the time of ancient Greek. But I think so it it’s always always a problem. Like, you don’t want fewer issues in production. You don’t want to go back again and again to fix, like, and, like, issues and bugs and technical debt and etcetera. But but the thing is that if you want to cross some levels of velocity, then you have to treat that bottleneck in a more serious way. Like, and once you get to a point where you’re opening more PRs than your human developers can can review, then and they cannot go line by line, you want to increase automation on quality. And then it’s not just like, you know, a vitamin, it’s it’s a pain painkiller. Now that requires, like, the stakeholders supposedly that you’re solving for is technical reviewing that is reviewing one PR. But, actually, you’re solving also for the dev team, the organization, to collect their tribal knowledge over time. And that’s why the the idea about memory is very important for AI code review. You’re solving for the entire team to collect their tribal knowledge over time such that eventually, the AI code review could provide feedback at the level of not only a senior developer, but a senior developer that has some tenure and experience in your company, in your team. And that’s not trivial to do. And the last thing is that once those PRs, like, are at scale, like, it’s not like, in some of our clients and customers and we’re not only for us, for others, like, we’re seeing, like, every two months, 2x amount of PRs. It’s not only about automating that review per pull request. It’s also looking on all the pull request as a whole. Like, as a tech lead, as a manager, I want to understand how my rules are doing, which rule is actually being violated and caught, and which rule is not, etcetera. So that holistic view of what extra guardrails that you could put to further automate across your teams and how one team is doing versus Otter is, like, superpowers that we didn’t have before. Like, tell me how many companies you know have the level of policies of Uber or Google, and they’re actually being. You couldn’t do that. Yeah. And now it’s actually you must. So that this is, like, how we see them, the importance. And the and the market is, like, capturing on that, like, where we’re seeing, like, it crossed the visionary to early adopters, and it’s getting to the mass market. Like, by 2026, I think that at least half of those clients are having code generation and using them to the harnessing it to matter extreme, but really well would also have an AI code review tool or two. Uh, that’s how how we’re seeing it.
[00:29:11] Dex: Yeah. I think that makes sense. I think your point about bottleneck is really interesting is, like, we have had processes You know, we do this RPI thing, and one of the first things we did back in, I want to say, like, September is we built, like, a software factory version of RPI, where, like, you create a Linear ticket and you assign it to some agent, and it will take it through the RPI process. But instead of just being, like, dev to code review, it had, like, research in progress, research in review, plan in progress, plan in review. And then it only then it would go to in dev. And so we would go through a bunch of, like we’d run through the backlog, and we’d find all the small stuff, and we’d just be like, cool. Like, that’s easy. Just assign that to the agent. Just assign that to the agent. We go through and, like, you know, 80% of our thing is just assigned to the agent. It would, like, flow through. We check it at the end. And what we realized was all we had done was converted our backlog from, you know, 40 tickets of, like, to do to 40 tickets of research and review. And so it’s like no matter what the bottleneck is, whether it’s the research or the design or the plan or the code review, you’re just shifting the bottleneck to some other place where it’s like, okay. Cool. The agent can do all the coding, but we still have to decide what if this is the right approach, if the code is high quality, all of this. And it’s probably faster, but it’s like, yeah. We’re we’re drowning in PRs is a common problem I hear. And my one counterpoint here is I will push us to, like, acknowledge as an industry probably is that you actually don’t have too many PRs. What you have is too many bad PRs. If the PR is perfect, it’s actually a joy to review. And the problem is even if a PR only needs, like let’s say it needs, like, 20% rework, which is in the age of AI, it’s usually more like 50% or more or it’s just complete crap. If you just gave it to an agent to YOLO it through for any, like, reasonable-sized task, it’s not like a little tiny copy change or something. It’s, like, usually needs. And so if it’s a human writing the code or an agent writing the code, it’s always a, like, intellectual burden on the reviewer. And then if a human is behind it, then it’s an intellectual burden on the reviewer and the submitter to work through that review even if it’s just 20% significant burden. And if it’s two humans, it’s also a significant, like, emotional burden, whether the PM wrote the ticket and the engineer is reviewing it or whatever it is. And so what we have built internally is we don’t do a ton of, like, AI code review yet. We’re focused more on, like, what can we do upstream from the code being written to make sure that most of the PRs that come through need less than 5% rework? Like, that the decisions are made, and we’ve kind of aligned on a team as on as a team on, like, a 200-line Markdown doc for a 2,000-line code change, which is, like, much higher leverage. So we have this design doc as part of our process, which is, like, kind of in the middle. It happens after the research but before the plan gets written, and it’s, like, the very high-level, here’s the end state architecture. Here’s all the decisions I’ve made about how it’s going to look. Uh, we don’t require it’s an optional review step in our process. We don’t require people to get their design reviews before they go to pull request. We’re a three person team, so we can kind of be a little bit more loose there. But I am incentivized. I always almost always send Kyle my design docs, my CTO, because I would rather have him tear apart a 200-line Markdown doc easier for him to do, easier for me to go change before I’ve written all the code and gotten the test passing and, like, polished it and gotten it how now I’m attached to it. And now if I need to change it, I’m like, oh, fuck. I don’t want to change it. And now if someone is asking for changes, like, now I got to ask him to change it. You know what I mean? And so I was like, how do you optimize your SDLC, and can you apply AI upstream from the code review itself in order to, like, ensure that you have fewer surprises at PR time, and all of your PRs are almost perfect.
[00:32:57] Nnenna: Yeah. That is, like, literally what a big part of my talk was about at AI Engineer Miami. I was like, okay. If if people can just walk away remembering anything, I’m just, like, earlier, like, code quality practices, like, early and often just to reduce all of the stress or all of the problems that could arise, uh, later on down the line at code review. And there’s many different ways to do that, and what you just described was definitely, definitely one of them on top of and underneath that layer, just like the way in which you are leveraging AI and the context.
[00:33:34] Dex: Yeah. And, I mean, this is what’s so exciting about, you know, hey. We’re going to build a context and memory layer on like, let’s capture the human taste and judgment from all the senior engineers in your org and then expose that to an AI that can then go apply it to a problem is, like, that same knowledge is just as valuable in the design stage as it is maybe even more valuable in the design stage than it is in the code review stage.
[00:33:58] Itamar: One, I agree, two, but. I’ll explain. I think, like, if we’re thinking about, like, what you’re saying, which I completely agree with, about the value of spending more time on the research and the spec, if we can call it this way, makes sense. Actually, even we can claim that this is better engineering from the get-go regardless of AI. But to some extent, if you go to the spec into the smallest details there are, it’s basically coding. Right? So if you mention for every part, which rules it needs to follow, which, like, error handling it needs to catch at the level of, like, if and else, and you’re writing code. So then, practically, you’re, like, you’re adding another stage. Like, if you’re looking on the future, you’re I am adding and suggesting that you presented how the future should look like on putting efforts into upstream. At the same time, we should put efforts in the downstream, not in reviewing, but rather writing down what are our rules, what are our guidelines, how are we expecting AI for code review to review the code. So, eventually, not only that we’ll get much better percentage of good quality code, but also we’ll get much better percentage of high quality review about AI, and then we streamline.
[00:35:18] Dex: Interesting.
[00:35:19] Itamar: Perfect, like, cases where we did a really proper job in writing the spec and proper job and writing rules and guidelines and skill in reviewing, having this multisystem agent for review. And then we’re left with the AI raising us, like, issues that need human attention, etcetera. I’m not claiming for it right now, but I actually do think we can get to this within 12 months or so. That’s how I see it.
[00:35:48] Dex: So this is why I hate the word spec is because you have said spec and meant a completely different thing. I don’t use the word spec because the word spec is completely semantically diffused. For some people, it’s like PRD. For some
[00:35:58] Itamar: people, it’s a very detailed PRD. For some people, it’s
[00:36:01] Dex: detailed PRD. For some people, it is, like, something that is, like, a or like what you talked about of, like, down to the if/else level of, like, it’s basically just code written in English or I mean, this we had this problem. I had talked about this AI engineer as well. We were reviewing the plan file, and the plan was a thousand lines of, like, line by line. Here are the files we’re going to change, what order we’re going to change them in. And then we would review the people would review those, and then they would have to go write the code. And there would be a couple surprises. And so the end code state would be 20% different from what was in the plan, either because it was poorly specified or because it was wrong or because there was a type error, whatever it is. And then you’re reviewing the code, and so, like, the any spec that is you’re looking for leverage. Right? Any specification that’s going to take the same amount to read as the code itself, just read the code. But if you can find a type of document that describes what you want at a at a, you know, 20% compression ratio to the actual code, then you have leverage because then you’re making decisions at a higher level. And so I like, specification I’d yeah. I don’t know. I’m making three comments in in one go. But, like, yes, I agree with what you’re saying of, like, if your if your Markdown is too detailed, then it’s no one should read it. You should just read the code that comes out the other end.
[00:37:12] Itamar: Awesome. We’re aligned. That’s not good, so we need to change the topic.
[00:37:18] Dex: Oh, do you want to fight? I got I got lots of spicy things we could argue about. Let me go get no. I’m just kidding.
[00:37:23] Nnenna: I think one of the things I wanted to talk to you about is, like, is this about in an engineering organization or a company, what type of responsibility would senior engineers or those above them in engineering leadership have when it comes to closing the gap of any AI slop or AI code quality issues? Like, who is responsible for that apart from the actual engineer in their day-to-day life? And in which ways do you think that they’re responsible?
[00:37:56] Dex: This one’s tricky. Because in the old world, you could say, like, okay. If you ship a bug, you have to fix it. You own it. Right? Like, as a senior engineer, it’s not my job to go clean up after the mistakes of the junior engineers. It’s my job to educate and mentor and coach them and hold them accountable so that they actually learn. And nowadays, it’s like, okay. If there’s a bug, I fix it. And, like, there’s, like, this new meta layer, which is like, am I shipping more bugs every right? It’s like, okay. Cool. There’s a bug. I told the AI to fix it. To fix it, we shipped it. Oh, but I broke something else. And there’s a bug, and I told the AI to fix it, and it shipped it. And, like, I’m taking ownership of the things that break, but I’m not taking ownership of, like, the velocity of actual valuable new work or the, you know, the amount of codebase churn where we’re like, okay. I make a change here and that tends historically, that has been the job of architects and senior engineers is like, okay. How can we design the system, and how can we put guardrails to code review or whatever using my, like, accumulated taste and judgment from ten years of writing software and writing bad software and getting paged at three in the morning and seeing all the patterns that I hate and making sure that we don’t introduce those to, like, how do I hold people accountable for for noticing that higher up layer? It’s almost like, as a junior engineer, your job is now to wear the senior engineering hat of, like, how do I make the system safer so that when I prompt the model that it just goes? And, like, maybe everybody is a software factory engineer now, but it’s like, as a junior engineer, you don’t have that accumulated like, Mario Zechner is the Pydantic guy as this thing of, like, you should embrace the friction and you should slow down because the things that are hard to understand, if you don’t understand them and you just let the model YOLO it, then you never learn. The friction is the part of the process where you learn.
[00:39:45] Nnenna: Yeah. Completely agree. I feel like, I don’t really I feel like when we talk about engineers or what they need to do with AI, in what what you just said about the architects, I’m just like, oh, yeah. Like, are we talking about what architects are responsible for now in this age? Because we talk about software engineers co-architecting, and I say this myself, co architecting with like Codex or Claude Code on on when we’re in planning mode or whatnot. But now, like, what what does an architect do moving forward in this? I mean, I don’t know. I’m curious to hear both of your thoughts on this.
[00:40:22] Itamar: Sure. I could start short. I think, like, it’s actually writing the skills and the architecture files and things like that can be harnessed by by AI. I think that’s exactly the, you know, the step for the next step that we can expect in this level of automation. I mean, like, it’s not that even twenty years ago, I worked in Mellanox and a hardware development. There were, like, specs all all around, but you actually needed to know where to look for them, where to fetch them, pull them up on time, etcetera. And I think that these days, like, tech leads and architects can can actually leverage AI even further writing those, and then these, like, documents or rules and etcetera, are probably going to be fetched more often by the coding agents, for example, by the AI code review tools, etcetera. So I actually think, like, if we go back to what we said, like, ten minutes fifteen minutes ago, like, let’s do more planning and designing and sorry for swearing here specs, then which is the best practice anyway, the same thing here. The more you write down in a AI native new world, make it a skill, make it, uh, like, approachable to harness with AI, your job will be much better. And from the get-go, I love, Dex, what you said is that we want those clean PR from the get-go. That’s that’s my feeling about it.
[00:41:48] Dex: The cleanest way to put it is, like, the software engineering role hasn’t changed much. It’s completely different, but it also hasn’t changed. It’s like your job has gone from writing high quality working code that scales and is maintainable and is evolvable over time to producing high-quality working code. You’re not actually writing the code, but it’s still your job to produce code, to ship, to do I mean, I think it was Guillermo Rauch said this whole thing of, like, yeah, shipping is different from coding. Coding is writing the program. Shipping is, like, getting it deployed, testing it, monitoring it in production, getting feedback, iterating on it, like, making it really, really good. And all of that is, like, you can automate parts of this, and you could wire up your support queue and your Sentry and your Datadog. You can, like, build this feedback loop where just everything that breaks goes back to a model and the model goes and ships it. The problem is, like, I still don’t I don’t see a path to models being really good at software architecture. And you can do it with skills, and you can, like, bake those opinions into skills. But at the end of the day, like, the thing that took, like, LangChain, CrewAI, all these frameworks that were in vogue eighteen months ago and, like, the gap between that and Claude Code is, like, reinforcement learning and, like, training the model on the harness it’s going to run. And that was the first time you got really, really good reliable, consistent tool calling, and you can RL a model. You look at, like, SWE-bench and SWE-bench Multilingual. Like, you can RL a model on getting the test to pass. You can RL a model on writing good code that solves a problem. What you can’t RL a model on today or what is, like, hard to, like, evaluate deterministically in a, like, reinforcement learning gym is the code well architected? Is it designed well, the only thing you do there is ask another model, is it designed well? And that’s super nondeterministic, and you could, like, bias it one way or the other. So I want to see one thing we’re working on kind of toying around with at HumanLayer is, like, a benchmark for, like, a long-term code maintainability where, like, the test dataset is, like, the model actually gets, like, 20 issues in or it doesn’t know the road map in advance. It’s just, like, feature number one, implement it. Feature like, do that for, like, 20 features and then hand that code base to a smaller, dumber model and see if that model can implement the 21st feature and just see how it does. And then you see, okay, what does codecs produce after 20 features? What does Opus produce after 20 features? And then, like, to what extent and then you run a SWE-bench on on both of those code bases to add another three or four or five features, and you have, like, challenges there. So, like, I don’t know exactly what it looks like, but I want a way to basically, like, evaluate, do does this code get worse over time, and do models get worse at working in this code base over time versus just this one-off? Like, here’s a PR from Django. Can the model reproduce it? So I don’t know. That’s what I’m excited about. I think that’s what’s coming next.
[00:44:34] Nnenna: That’s super exciting. I don’t know if you wanted to say anything else Itamar, but I realized we went over time. And I think that was actually a perfect way to wrap this up. It kind of hit, like, all of the points where we’re headed, you know, what you’re excited about, what you see in the future, what you guys are building and trying to test. So that was wonderful.
[00:44:53] Itamar: Really enjoyed it. Thank you so much for
[00:44:54] Dex: being our show. That was great. If you want to try our stuff, humanlayer. dev, we have a wait list up. We will be launching in the next couple weeks. I know I’ve been saying that for a couple weeks, but it’s actually freaking happening. So if you want to get hands-on with the Crispyprompt or try our IDE, we’re getting ready to give that to a lot more people. So if you want to learn more, please go check that out. I run a podcast called AI That Works, which has terrible SEO. But if you type a unicorn emoji and then AI That Works, you will probably find the podcast. That’s the that’s the secret sauce. This was super fun.
[00:45:24] Nnenna: Where do people find you most? I know this answer, but I want others who are listening to know this answer. Where are you yapping the most?
[00:45:32] Dex: I yap the most on Twitter. Come find me on Twitter, twitter. com/dexhorthy. I’m sure we’ll put it in the show notes or whatever. But, yeah, come shout at me. Tell me what I’m wrong about. Tell me what you liked. Tell me what you hated. I love chatting and learning with the community here.
[00:45:45] Nnenna: Love it. Awesome.
[00:45:46] Itamar: Thank you.
[00:45:47] Nnenna: Thank you.
[00:45:48] Itamar: If today’s conversation challenged how you think about AI and code quality, that’s the point.
[00:45:54] Nnenna: At Qodo, we believe that independent, context-aware code review with rules as guardrails is how engineering teams maintain standards at scale.
[00:46:03] Itamar: If you’re leading an enterprise team and want to see how intelligent AI code review can reinforce governance, visibility, and accountability in your workflow, visit qodo. ai to learn how we help teams turn AI productivity into production-ready quality.
[00:46:20] Itamar: And if you enjoyed this episode, subscribe, share it with your engineering leadership circle, and leave us a review.
[00:46:26] Itamar: Until next time, keep humans in the loop.
[00:46:28] Nnenna: And keep shipping.
About the hosts