AI Agents Coding for Hours — What’s Really Going On?

There’s a growing narrative that AI agents can now “code for hours” and build meaningful applications with minimal human involvement

I’ve spent a bit of time looking into this, and I found myself slightly unconvinced—not because it isn’t happening, but because the framing doesn’t quite line up with how software engineering actually works.

If an agent is genuinely producing useful software over an extended period, then one of three things must be true:

The problem is very tightly constrained
The quality bar is lower than we’d normally accept
Or we’re not talking about typical engineering problems

In practice, it’s usually a combination.

Three patterns hiding behind the headline

What’s being described as “hours of autonomous coding” tends to fall into three fairly distinct categories.

1. Well-bounded, conventional problems

If you point an agent at something like a CRUD application, an API layer, or a simple front end, it can make steady progress for quite a long time.

That’s not especially surprising. These are problems with well-established patterns, strong framework defaults, and relatively little ambiguity. The “specification” is largely implicit.

The agent isn’t designing something new—it’s assembling something familiar.

2. Iteration loops at scale

A lot of what looks like sustained progress is really just persistence:

generate → run → error → fix → repeat

Given enough cycles, this converges more often than you might expect, particularly if there are tests or clear runtime signals to guide it.

This is useful, but it’s worth being precise about what it is. It’s not architectural reasoning—it’s search with feedback.

That distinction matters once you move beyond straightforward problems.

3. Pre-engineered environments

Many of the more impressive examples rely on a well-prepared starting point: clean codebases, sensible structure, decent test coverage.

In other words, the environment is doing a lot of the work.

Agents perform well when the constraints are already in place. Remove those constraints, and the results become far less predictable.

Where this breaks down

The limitations show up in the areas you’d expect:

Ambiguous or evolving requirements
Trade-offs between competing concerns
Long-term maintainability
Cross-cutting issues like security, cost, and performance

These aren’t edge cases. This is most real-world software.

Agents don’t handle these particularly well—not because they’re flawed, but because these problems aren’t easily reduced to a closed loop with a clear success condition.

What is changing

There is a genuine shift, but it’s not quite the one being advertised.

The leverage is moving upstream.

The more precisely you define:

constraints
structure
expected behaviour

the more effective the agent becomes.

Which leads to a slightly counterintuitive point:

If you want an agent to run for hours and produce something useful, you generally need to invest more effort in the specification, not less.

A more realistic model

The pattern that seems to work is relatively straightforward:

Humans define the architecture, invariants, and boundaries
Agents handle localised implementation and iteration
Tooling provides continuous validation

That combination is powerful. It can significantly increase execution throughput.

But it’s not autonomous software engineering in any meaningful sense. It’s assisted implementation with a very fast feedback loop.

Final observation

There’s real value here, particularly for well-understood domains and repetitive tasks.

But it’s worth separating two ideas that are often conflated:

an agent generating code continuously for an extended period
an agent designing and delivering a robust, production-quality system

Those are not the same thing.

And for now, the second still depends heavily on human judgement.