Agent Loops: How to Make Claude Code or Codex Work Until the Job Is Done

There is a shift happening in how people use coding agents. For the last year, most of us treated Claude Code and Codex like a very fast assistant: you ask for one thing, it does it, you check it, you ask for the next thing. You were the loop. You were the part that kept saying "okay, now do this."

The newer pattern hands that job to the agent. You give it a goal, a way to check its own work, and a clear signal for "done." Then it runs the loop itself: try, check, fix, check again, repeat until the goal is met. People are calling these agent loops, and the simplest version is the /goal pattern: a goal plus a way to self-verify.

This guide explains what an agent loop actually is, why it works better now than it did six months ago, how to write a good loop prompt, and three copy-paste templates you can use today. It also has an honest section on when not to loop, because a loop will chase a bad goal just as fast and confidently as a good one.

Quick picks

The pattern: Goal, verify, stop. An agent loop is a clear goal, a way for the agent to check its own work, and a stop condition. Miss one and it falls apart.
Why now: Models hold a goal longer. The current generation stays on a goal across a long session, and the tools can now run tests, build, and even look at a screenshot to verify.
The one rule: No loop without a real check. If the only check is the agent's own opinion, you have built a machine for generating confident nonsense.
Best fit: Cheap, repeatable checks. Loops shine on work with a real, repeatable verification and a low blast radius. They are risky where "done" is a judgment call.

What an agent loop actually is

An agent loop has three parts. Miss any one and it falls apart.

That is the whole idea. Goal, verify, stop. The agent iterates between those points on its own, and you only step back in to review the result or unstick it.

Here is the mental model. You are not the typist anymore. You are not even the pair programmer. You are the manager who set the objective and will inspect the finished work. Ethan Mollick has been making this point for a while in "The Shape of the Thing" (March 2026): the skill that matters now is managing agents, not working alongside them keystroke by keystroke.

A goal. A clear, concrete description of what "finished" looks like. Not "improve the app." Something like "every page loads without a console error" or "all 142 tests pass."
A way to self-verify. The agent needs to check its own work without you. That might be running the test suite, taking a screenshot and looking at it, running a linter, or ticking items off a checklist it maintains.
A stop condition. The loop ends when verification passes, when a budget runs out (number of attempts, time, or token spend), or when it gets stuck and asks for help.

Why this works now and didn't really before

Three things changed.

The models got better at long, multi-step work. Earlier agents drifted. Give them ten steps and step seven would quietly forget step two. The current generation holds a goal across a long session and keeps coming back to it. That alone makes looping viable.

The tools learned to verify. A loop is only as good as its check. Coding agents can now run your tests, read the output, run a build, and increasingly look at the result. Victor Mustar shared a loop where the agent built a Boeing 747 in Three.js, then rendered it, inspected the render, and kept improving the model until he said it was "100% satisfied" — the visual check was part of the loop, not a separate human step. When the agent can see whether it succeeded, it can decide whether to keep going.

The patterns got shared and named. Once something has a name, people copy it. Matthew Berman launched a Loop Library of reusable agent loops you can grab and adapt. Tom Osman (@tomosman) described running a loop that went over every feature, wrote user stories, tracked them in a spreadsheet, then looped through testing and fixing until done. OpenAI is even building the "show it once, it repeats forever" idea into the product: Codex Record and Replay (announced by @OpenAIDevs, June 18, 2026, rolling out to select markets) lets you demonstrate a task once — say, filing an expense report — and Codex turns that demo into an inspectable, editable skill it can run again.

None of this removes the need for a human to set a good goal. More on that below. But the machinery to run the loop is finally reliable enough to lean on.

The anatomy of a good loop prompt

A loop prompt is different from a normal request. A normal request says "do X." A loop prompt says "do X, here is how to know if you succeeded, keep going until you have, and stop when this is true." Four ingredients.

A clear, testable goal. Vague goals produce vague loops that never end or end wrong. "Make the dashboard better" has no finish line. "Every chart on the dashboard renders with real data and no console errors on page load" does. If you cannot describe how you would check it, the agent cannot either.

A canonical tracker. Give the loop one place to record state, and tell it to keep that place up to date. A markdown checklist in the repo works great. So does a spreadsheet or a simple table. This matters more than it sounds: the tracker is the agent's memory of what is done, what is left, and what failed last time. Without it, a long loop loses the plot. Tom Osman's spreadsheet was doing exactly this job.

An explicit verify step. Spell out how to check, not just that it should check. "Run npm test and read the output." "Take a screenshot and confirm the header is centered." "Run npx tsc --noEmit and confirm zero errors." If you leave verification implicit, the agent will often declare victory early. The verify step is the most important line in the whole prompt.

A stop condition and a budget. Tell it when to stop succeeding and when to stop trying. "Stop when all items in the checklist are checked and tests pass." And a safety valve: "If you cannot fix the same failure after 3 attempts, stop and show me what you tried." Without a budget, a stuck loop will keep burning tokens and editing code in circles.

Three copy-paste loop templates

Adapt the templates below. Replace the bracketed parts. They are written for Claude Code or Codex, but the structure works in any capable agent.

On the visual template: that "list what is wrong in writing" line does real work. Forcing the agent to name the defects before fixing them keeps it from declaring something done just because it ran without errors.

When NOT to loop, and how to stay safe

A loop is an amplifier. Point it at a good goal and it gets you there fast. Point it at a bad goal — or no real verification — and it will confidently produce a lot of wrong work, fast, and tell you it succeeded. This is the part the demos skip.

The sober version comes from people who build with these tools and watch where they fail. In "Why AI hasn't replaced software engineers" (June 14, 2026), Arvind Narayanan and Sayash Kapoor — and separately Simon Willison, who has been beating this drum for a while — make the same point: the bottleneck was never the typing. It is requirements, verification, and judgment. A loop automates the typing and the retrying. It does not automate knowing what you actually want or recognizing when the result is subtly wrong. That part is still yours.

The honest summary: loops are great for work that has a cheap, real, repeatable check and a low blast radius. They are risky for work where "done" is a judgment call or where a wrong answer is expensive. Match the tool to the task.

Do not loop without a real verify step. "Looks done to me" is not verification. The check has to be something external and concrete: a test suite, a build, a screenshot you also look at, a linter.
Do not loop on goals you cannot define. If you cannot say in one sentence how you would know it is finished, you are not ready to loop. Do the thinking first.
Keep a clean git status before you start. Commit or stash everything first, so git diff shows exactly what the loop changed and nothing else. A clean starting point is your undo button.
Scope it tight. Tell the loop which folders and files it may touch, and what is off-limits (secrets, deploy config, generated files, anything in production).
Review the diff like you mean it. Do not merge a loop's work just because the tests it wrote pass — it may have written weak tests, or changed the wrong thing in a way the tests do not catch.
Keep a human in the approval loop for anything with consequences. Database migrations, money, deletes, deploys, customer data, security settings. Let the loop prepare the change. You approve the change.

Where this is heading

The early loops are hand-written prompts. The direction is loops becoming reusable, named, and shared. Matthew Berman's Loop Library is an early version of that — grab a loop someone already tuned instead of writing your own. Codex's Record and Replay points at the same future from the product side: demonstrate a task once, get back an editable skill.

The bigger picture is teams wiring loops together. Ethan Mollick described a three-person team at StrongDM that built what he calls a "Software Factory": coding agents and testing agents looping against each other until the work satisfies the tests, with the humans only reviewing the output. That is the same /goal pattern, just stacked — one agent's verify step is another agent's goal.

You do not need a factory to start. You need one clear goal, one real way to check it, and the discipline to scope it and read the diff. Start a loop on something small and verifiable today, watch how it behaves, and grow from there. The site's usual advice applies here more than anywhere: it is genuinely useful, so try it — and verify what it gives you.

Copyable prompts

Template 1: The generic loop

Goal: [describe the finished state in one or two concrete sentences].

Work in a loop:
1. Create a checklist file at ./LOOP.md listing every sub-task needed to reach the goal.
2. Pick the next unchecked item and do it.
3. Verify it by [exact check — run this command / take a screenshot / inspect this output].
4. If the check passes, mark the item done in ./LOOP.md. If it fails, fix it and re-verify.
5. Repeat until every item in ./LOOP.md is checked.

Stop when: all items are checked AND [final verification, e.g. "npm test" passes with zero failures].
Safety: if the same check fails 3 times in a row, stop and show me what you tried.
Keep changes scoped to [folder/files]. Do not touch [anything off-limits].

Template 2: Audit every feature, then test and fix

Goal: every feature in this app works as a user would expect, with no broken paths.

Phase 1 — map it:
- Go over every feature in the app. For each one, write a short user story
  ("As a [user], I can [do thing] and see [result]").
- Track them in a table in ./FEATURES.md with columns: Feature | User story | Status | Notes.
  Set every Status to "untested".

Phase 2 — loop:
1. Pick the next "untested" or "failing" row.
2. Actually exercise that feature [run it / hit the endpoint / click through the UI].
3. Verify the result matches the user story.
4. If it works, set Status to "pass". If not, set "failing", note the cause, fix it, re-test.
5. Repeat until no row is "untested" or "failing".

Stop when: every row in ./FEATURES.md is "pass".
Safety: if a feature fails 3 fix attempts, mark it "blocked", note why, and move on.
Commit after each feature passes so I can review the diff per feature.

Template 3: Build and visually self-check until it looks right

Goal: build [the thing] so that it looks [describe the target — "a clean, centered
landing hero with the logo, headline, and one button, no overlap on mobile"].

Work in a loop:
1. Build or edit the [component / scene / page].
2. Render it and take a screenshot.
3. Look at the screenshot. Compare it to the goal. List, in writing, what is wrong
   or missing (spacing, alignment, color, proportion, anything off).
4. If the list is empty and it matches the goal, you are done.
5. Otherwise fix the top issue and go back to step 2.

Stop when: the screenshot matches the goal and your "what's wrong" list is empty,
OR after 8 render-and-check rounds (then show me the latest screenshot and the
remaining issues).
Show me the screenshot at the start, the midpoint, and the end.

Related Power of AI pages

Claude Code Best Practices: The inspect-plan-edit-test-review workflow that loops build on.
Claude Code vs Codex: Which agent fits which kind of loop.
AI Coding Agents: The foundation for how these agents work.
How to Question AI: The verify-everything habit, applied to agent output.

Sources and official references

Related Power of AI pages

Keep reading with AI Finder, Prompt Studio, ChatGPT vs Claude vs Gemini, the AI glossary, and Which AI Should You Use?.