Rethinking Code Review for the AI Era

AI is accelerating code generation faster than we can review the code. I'm wondering whether we're reviewing the wrong thing at the wrong time.

Pull requests always seem like a great idea in theory to me that just doesn't work smoothly in practice. We create them faster than we review them, complex ones might wait for reviews for days, and it's often unclear what the goal of the review should be: knowledge sharing, scrutiny, or a quick sanity check.

Now add coding agents to the mix. AI is accelerating how fast developers can ship code, but it's not accelerating how fast humans can review it. PRs seem to be getting larger, not smaller - while larger PRs take exponentially longer to review. We're making the problem worse.

So I am wondering: what if the answer isn't improving the review process, but dropping human code reviews as the default entirely? What would the consequences be if we relied on automated reviews instead?

I've been thinking about this a lot as we build Aonyx, where we're using coding agents heavily ourselves. I don't have all the answers yet, but here's the model I'm exploring.

Are Pull Request Reviews Really a Bottleneck?

Let's first look at the data and confirm whether pull request reviews really are an issue. After all, the problem might be me and not the reviews. Sadly, though, the data is quite clear on this:

A LinearB study of a million PRs across around 25,000 developers found that it takes about five days to get through a review and merge the code.
Another LinearB study found that pull requests are waiting on average 4+ days before being picked up, and PR review pickups are the number one bottleneck in cycle time.

What makes this even worse is the relationship between PR size and review time. The data suggests it's not linear - it's probably closer to exponential:

The same study by LinearB found that cycle time and idle time doubled for pull requests of 200 lines of code compared to pull requests with 100 lines of code.
And Propel found that each additional 100 lines of code increases review time by 25 minutes.

This tracks with my experience. Small PRs are easy to reason about and quick to approve. Large PRs sit in the queue because nobody wants to context-switch into a 500-line diff, and when someone finally does, they're more likely to skim than scrutinize.

Are Coding Agents Making It Worse?

My gut feeling is that coding agents push larger units of work than humans typically do. When I work without an agent, I can make fine-grained commits and push them one by one. But when I hand a feature to an agent, I'm not going to review and push every small task–I'm going to push the whole implementation. That naturally means bigger pull requests.

The early data supports this. Jellyfish found that pull requests are getting about 18% larger as AI adoption increases. And a study of Claude Code PRs on GitHub found that over 27% of agent-authored PRs combined multiple tasks, with "too large" being among the top three rejection reasons.

This compounds the bottleneck. If larger pull requests take exponentially longer to review, and agents push larger pull requests, then agents aren't just increasing the volume of code–they're increasing it in the shape that's hardest to review.

Why Optimization Doesn't Help

So where does this lead us? If AI-generated code is already creating review bottlenecks, and AI capabilities are only accelerating, what does the future look like?

The trajectory seems clear: coding agents will generate more code–not just more pull requests, but larger ones. As AI coding agents become more capable, they'll tackle bigger features, refactor more aggressively, and iterate faster. The volume of code flowing into review queues will keep growing.

The natural response, and what I've certainly done in the past, is to optimize our way out. Make reviews faster by providing more context to reviewers. Streamline our processes. Create more dedicated review time. Build better tooling. And yes, these help—but they don't solve the fundamental problem.

Because even if we get twice as fast at reviewing code, and AI lets us generate twice as much code, we end up spending the same amount of time reviewing. The productivity gains from code generation get entirely consumed by the increased review burden. We're running faster just to stay in place.

In fact, it's likely worse than that. As review volume increases, quality probably degrades. Reviewers get fatigued, context switching adds overhead. The review queue will grow, delays will compound, and the bottleneck will tighten.

We can't optimize our way out of this. Optimization keeps us playing the same game at higher speed. What we need is to change the game itself.

If reviewing generated code is the bottleneck, maybe we're reviewing the wrong thing. Or more precisely: maybe we're reviewing at the wrong time. The answer isn't to review code faster—it's to review something else, earlier in the process, before the expensive code generation happens at all.

Shifting Review Left

Here's the idea I want to explore: instead of reviewing code after it's written, we review the specification before any code gets generated.

This works well with spec-driven development, an approach that's been gaining traction with coding agents. The idea is that you write a detailed specification–scope, approach, expected behavior, edge cases–and hand that to an agent to implement. In this model, the spec becomes the artifact that gets reviewed, not the code. Your team aligns on what you're building before any code exists, which is when feedback is cheapest to act on.

Even without full spec-driven development, this could be as simple as writing a GitHub Issue thoroughly enough that it can be handed to a coding agent. The team iterates on the issue together, and once there's alignment, the developer takes it from there.

The key shift: we trust the developer to verify that the implementation matches the spec. They're responsible for reviewing the AI-generated code, running tests, and confirming it does what it's supposed to do. We don't gate on a formal PR review by another teammate, because we already agreed on what this feature should be.

Human code review becomes the exception, not the rule. Instead, we rely on automated checks running in CI - linting, tests, type checking, security scanning. We cover as much as we can with deterministic, rule-based approaches. For security-critical paths or sensitive data, tools like CODEOWNERS can require human review on specific files or directories.

I'm also curious about using AI to check whether the implementation matches the specification. If an AI can review the code against the original spec and request a human review when something doesn't align, that adds another layer of confidence without blocking on human time for every PR.

The result is that human collaboration happens earlier, when it's most valuable - shaping what gets built rather than nitpicking how it was built. And the review bottleneck disappears because we're not blocking on humans to approve every line of code.

Open Questions

I don't have this fully figured out. There are a few things I'm still thinking through.

One of the underrated benefits of code review is that it spreads awareness of how the codebase is evolving. If we stop reviewing each other's code, how does the team stay informed? One idea: a regular, non-blocking review of recent changes. Not to approve or reject, but to understand what's changed and surface follow-up ideas. Maybe a weekly session where the team skims through merged pull requests together, not for gatekeeping but for learning.

Can automated checks provide enough confidence?

Linting, tests, and security scanning catch a lot–but not everything. There's judgment involved in code review that's hard to automate: is this the right approach? Does this duplicate something that already exists? Is this going to be maintainable? I suspect we'll need better AI tooling here, particularly around checking implementation against specification.

How does this work with compliance?

Frameworks like SOC 2 and ISO 27001 often expect peer reviews as part of secure development practices. The good news is these frameworks are more flexible than they first appear–they require you to define and follow your own controls, not specifically mandate peer review. If you can demonstrate that automated checks plus AI-triggered escalation achieve the security objectives, that might be defensible with auditors. But it's an open question, and probably depends on your company, auditor, and risk profile.

What's Next?

This is still a hypothesis, not a playbook. But it's one we're going to test at Aonyx. As we build with coding agents ourselves, we'll see how spec-driven collaboration holds up in practice–what works, what breaks, and what we didn't anticipate.

If you've been thinking about this too, or have tried something similar, I'd love to hear about it. Let me know on Mastodon or Bluesky.