Early Lessons on AI Agent Productivity

If learning agentic coding is a Hero’s Journey, I feel like I’m in the “Initiation” phase - a road of trials and ordeals, but the reward (big leaps in productivity) is in sight. In this post I’m talking about how I got myself to being a “1x developer” leaning hard on coding agents, and the path I see to “5x developer” or so with currently available tools.

Refusing the Call

I’ve been messing around with AI coding assistants starting from when they were AI-assisted auto-complete. They had some delightful moments, but were far too inconsistent to be relied upon for anything other than novelty apps, and certainly not for writing actual production code. So in some ways, I’m a late comer to this compared to others - but recently tools have gotten high enough in quality that I’ve been taking a much deeper look.

Crossing the First Threshold

In the second half of 2025, at work I’ve been working on AWS DevOps Agent. And as I started to see what agentic applications were capable of, it was time to dive in and write all my production code with AI coding agents. And so I did - as much as possible, I wrote every line of production code that I worked on across various parts of that codebase using AI coding agents. Naturally, all code had to be production quality, and the fact I used AI coding agents could never be an excuse for mistakes.

Becoming a 1x Developer

I had one critical rule guiding the entire process: Do not have an AI agent write code you don’t understand how to write yourself.

Full “vibe” coding is fine for prototypes and experimentation. What we need in production is “vibe engineering”.
If you don’t understand the code or techniques involved, you can’t properly review it.
If you can’t review it, you can’t stand behind it.

In essence, AI coding agents became my typist. I would produce short design and specification documents (sometimes with AI help, sometimes not), review those for accuracy, and then send AI agents off to implement that specification, with a goal of tasks that would take no more than a half hour to complete so that code review sizes could be kept small.

Sidebar: If this sounds a lot like how Kiro works, that’s not a coincidence. Sometimes I used Kiro and sometimes I did not, but I was often inspired by its process and still am, no matter which coding agent I use.

The result was I became a 1x developer. Having AI agents do things for me, I found myself about 1x as productive (I anecdotally think more productive, but not clearly measurably so). But the breakdown of why is setting the table for the challenges to come:

Writing Code: Probably ~10x faster. AI agents can read and write code really, really fast. Even accounting for the time to write small specification documents and ensure they’re accurate, initial development was almost always done in a flash.
Reviewing “Your Own” Code: This probably takes as long, if not a bit longer, than the agent takes to vibe out your first pass of code. You have to look very carefully for anti-patterns, and often you’ll have a few loops of “no, that structure is not optimized, please refactor it” and understanding your intent is not something an AI coding agent does as well as an intern who you hand a copy of “Design Patterns” to.
Debugging: 0.1x-0.2x speed. My worst experiences, and by far the area that most often made me drop out of using coding agents and revert to manual work was fixing errors. Our build process took about 5 minutes to run in the best circumstances, with a huge (read: context window destroying) output size, and so if I ever asked an AI agent to resolve its own test failures? I’d better have something to do that could occupy a long period of time, because it would either fix the problem immediately, or spin helplessly for approximately one geological era. With current tools this is most often where you should take the wheel directly. It’s also why you should not have AI agents write production code that you couldn’t write yourself. You have to understand when it’s wrong and why it’s wrong.
- This opens up a point about “fast running test suites” being really important. A human can methodically work through failures and have reasonable confidence the tests will pass. An AI agent far more often has to brute force it, and you want that to be as smooth as possible. You also want fast tests so that you can notice when the AI agent is clearly in an infinite loop of failure.

Nevertheless, as I learned what coding agents were good at, and what they were not good at, the total experience was enjoyable. But it made mistakes often enough that things didn’t feel anywhere close to that promised future of writing a product definition and “letting it rip” unsupervised.

Chasing 5x

Road of Trials

By forcing myself to dive in with both feet to writing production code, I found a few key sticking points to work through, with a massive benefit to leverage.

The clear benefit is multithreading: When an AI coding agent is spinning on a problem, EVEN WHEN it’s not doing so super efficiently, you can leave it in the background and do something else. As somebody who has always struggled with true multitasking, it felt a lot more like making the specification doc was a task, then you spin up an agent, and then later you review. In that interim time the agent was running, I didn’t have the same “open thread” in my brain that blocked other tasks. I could review other work, chase down other teams I needed something from (human router work), write a doc, or anything else. At times, this meant running another AI coding agent doing something else, and this felt…actually quite manageable.

But some sticking points remained:

AI code quality was super dependent on existing test harness quality and code quality. Simply put, if you have great code with great tests, you tend to get great results (so long as you always keep the quality bar high before pushing code). Any gaps in testing, ambiguous structure, or poorly documented code would lead to inevitable debugging cycles or unwelcome discoveries when reviewing code. In other words, it’s great for greenfield code when you’re disciplined, but it’s not exactly a master of improving the design of existing code.
Debugging code bugs was extremely slow and painful. Once you know this, you do it yourself and keep the coding agent out of it, or let it run in the background and give you suggestions to review async. You also really, really make sure you keep the batch sizes down.
Long sessions quickly led to breakdowns in rule following. You can make AGENTS.md as sophisticated as you want, but as soon as the context window gets compacted, if not sooner, all those rules fly out the window. You have to come back and get a new agent session going when this happens, often. (Perhaps leading us to sub-agent based development, soon enough.) Incidentally, people who make their coding agents call them funny names are definitely onto something. It’s the Brown M&Ms clause of agentic development.

Tests, Allies, and Enemies

I’m taking the last month of 2025 off of work, but in some of my free time, really diving in on these sticking points. I’m not all the way there, but I’ve found some very interesting things so far.

First, I pulled up an old side project from 2022 which had both excellent test coverage and to-the-minute logs of how long each task took to implement, for comparison. How did that go? I ripped through my entire task backlog in a week of scattered hours of time, and found I was roughly 2x as productive as before, all-in. And if I sped up my CI/CD pipeline or improved the ability to make one-time developer stacks of the app, this could easily be 3x (as hinted above, debugging loops were where I lost the most time).

Second, in parallel with these tasks I was working on improvements to a React front-end website, and a brand new iOS/watchOS mobile app built on top of the API. While these aren’t production quality and probably have some edge case bugs (I don’t know how to make production iOS apps yet, so I’d call it a prototype for sure), two conclusions were clear:

I was finishing web/mobile tasks 10x-20x as fast as it would take to do them myself. AI coding agents are indisputably amazing at making working prototypes.
I was able to work on these tasks in parallel with my REST API changes. So when waiting on CI/CD pipelines or debugging cycles, I could work on something else without feeling overwhelmed.

Third, I got into the habit of making skills/commands and using those. Rather than prompting the coding agent for what I want, I’ll say something like /make-design-document, /tdd-workflow, /code-review, /add-documentation, or /protocol-test-coverage alongside what I want to keep the agent on task rather than relying on AGENTS.md. Anecdotally, once I started doing this I saw a huge difference. (More posts on these later.)

Finally, I feel like the biggest gains will come when, and if, long-running agents can produce higher quality outcomes. So I’ve been exploring that as a side thread. I reached out to Adrian Cockcroft and learned about his brazil-bench project which is looking at how well different coding agents, prompts, and helper tools are able to “one-shot” a reasonably complex greenfield coding problem. These are all run in GitHub Codespaces (ergo, clean environment) and use tool combos like Claude Code + Beads (another great tool to discuss another time), or Claude Code + Hive. At the time of writing, Claude Code (with Opus 4.5) + Beads performs the best on this benchmark, and well enough overall that I’m encouraged to do more long-running agent experiments in the near future.

Road to the Reward

I’ve seen enough to know that agentic coding is how I’m going to be working moving forward, unless and until a better way comes along.

For now, I feel like I’m somewhere between a 1x-2x engineer by focusing aggressively on code quality standards from the start, carefully iterating on design specifications before starting, and using custom slash-commands to ensure the agent sticks to instructions consistently. When I can multi-thread with multiple agents working on different things, >2x feels like it’s here.

Eventually, if long-running agents improve, the promised land of “writing up prompts as soon as I think of something that needs doing, firing off an async job, and coming back later to seeing a high quality PR” could be an absolute game changer, and building the setup that makes this possible is worth the effort.

Colophon

This post was written entirely by me, a human. AI was used to proofread (a task at which it is…okay if perhaps a bit excessive in praise), but the words are all my own. My thanks to Nikki Pinski and Adrian Cockcroft for reviewing drafts of this post.