Claude Code Creators Boris Cherny And Cat Wu & Addy Osmani Explain How To Use Agent Loops |

Claude Code’s origin story begins in the most startup way possible: an internal Slack demo that got two emoji reactions.

One year later, Boris Cherny, Anthropic’s Head of Claude Code, says his workflow looks less like typing prompts and more like managing “armies of agents”, with agents prompting agents in trees of thousands. Cat Wu, Head of Product for Claude Code, keeps steering the conversation back to how the team actually works: verification, routines, auto mode, Loop, agent fleets, and context minimalism.

They both sat down for a conversation to celebrate one year of Claude code, which you can watch below.

ICYMI, Claude Code is Anthropic’s terminal-based coding agent, which means it can read a project, edit files, run commands, and work through a task across multiple steps. Anthropic posts product updates and best practices through the ClaudeDevs channel.

Below, we break down the top insights from the chat and how to apply them in your own work, as well as weave in insights from Addy Osmani’s article on loop engineering, which is sort of a reaction to a debate that happened this weekend on X about designing “loops” for your agents so you don’t even have to prompt them. His framing turns the video’s biggest idea into a usable blueprint for both Claude Code and Codex. We’ll cover both!

The useful lesson here is this: agent work in 2026 is becoming an operating system, so long as you make three key changes: 1. You preserve mistakes (for learning). 2. You build verification into the loop. 3. You move recurring work into routines.

Let’s get into it.

Both Boris and Cat’s actual Claude Code workflows, in one place
How to set up your Claude Code like its creator Boris Cherny
The core lesson: every agent mistake should improve the system
Verification means: can the agent run the thing?
The surprising adoption story: adjacent roles start shipping
Routines are agents that watch the work queue
Auto mode replaces approval spam with evaluated trust
Loop is the second leap
Addy Osmani turns Loop into a working blueprint
Organizations get value when the agent sits at the center of the process
The new manager job: supervising a fleet
Context minimalism: give the model less, but give it a way to find more
What Boris thinks changes next
The credible counterpoint: Anthropic is describing the frontier case
Agent view is the missing control room for agent fleets
How to apply Claude Code’s first-year lessons this week
The official docs
Timecode map: all the moments worth watching

Both Boris and Cat’s actual Claude Code workflows, in one place

The conversation was less a product tour than a snapshot of an AI-native team mid-transition.

Here are the key habits they shared:

Boris turns mistakes into memory. When Claude makes a repeated mistake, he has Claude write the lesson into CLAUDE.md or a skill, so the fix persists into future runs instead of staying private to one session.
Cat turns local app testing into a skill. Her desktop app workflow uses a desktop development skill that starts the local app, uses computer use to click through the interface, invokes the new UX, tests edge cases, fixes issues, and rechecks.
Cat lets Claude inspect the team context. When staging breaks or a bug looks environmental, she has Claude read Slack to see whether staging is down or whether another teammate has already hit the same problem. The official Slack docs explain the same mechanism: Claude can gather thread and recent channel context before routing a coding task into Claude Code on the web.
The team treats coding as a shared medium. Boris says “everyone codes”: PMs, designers, developer relations, data science, and even finance use Claude Code for work that used to require handoffs.
Routines watch for work before humans ask. One Claude Code routine listens for voice mode tickets, GitHub issues, and bug reports, then drafts fixes and pings the PR owner. Another finds bug reports untouched for five hours and fixes the easy-to-verify ones.
Boris uses auto mode as his default. He moved away from plan mode because newer models need less explicit planning for many tasks. Auto mode starts work, checks tool-use safety through another classifier model, and lets him move to the next agent.
Loop changes the interface. Boris frames the progression as source code to agent to loop: first you edited code, then you asked an agent to edit code, and now a /loop, /goal, or routine prompts Claude for you.
Agent view and Remote Control change the ergonomics. Boris moved from six terminal tabs and six checkouts to one agent view, automatic worktrees in the desktop app, Remote Control from his phone, and voice mode for spinning up agents from an idea mid-conversation.
Context minimalism is the new default. Boris recommends the minimal system prompt, minimal tools, and a way for the model to pull context. Cat calls herself “a context minimalist”: tell the model only what it needs, then let it choose the route. The context window docs explain why: every file read, rule, hook, skill, and subagent changes what the model is carrying around.

The thread running through all of it is simple: Claude Code becomes more useful when the team stops treating the agent as a chatbot and starts treating it as a system that can remember, verify, watch, delegate, and report.

How to set up your Claude Code like its creator Boris Cherny

If you want to apply all the insights from Boris’ chat, here’s the shortest version of Boris Cherny’s Claude Code setup, stitched together from everything he says in the interview:

Teach Claude from its mistakes, give it real verification (as in let it actually test what it’s building as a user, not just run “tests”), default to auto mode when scoped, turn repeated work into routines and loops, run parallel agents in isolated worktrees, supervise them in agent view (from the desktop app), keep an eye on them from your phone with Remote Control, and keep the context lean enough that the model can still think.

Now, stepping that out a bit more:

Make Claude remember its mistakes. Boris’s first rule is that when Claude makes a repeated mistake, he has it write the lesson into CLAUDE.md or a skill. That way, the correction does not stay trapped in one chat. His basic move is: correction → durable instruction → better future runs.
Build verification around “can the agent run the thing?” Boris says people often think verification means unit tests, lint, or type checks. His sharper version is whether Claude can actually run and test what it changed. That means your setup should include a real way for Claude to use the product: Bash, an iOS simulator, an Android simulator, a desktop app, or computer use that lets it click through the interface and verify the result.
Use auto mode once the work is scoped. Boris says he used to rely on plan mode, where Claude proposes a plan before editing. Now he mostly uses auto mode, because newer models need less explicit planning for many tasks. The workflow is: start a Claude, let it work, then move to the next agent. Cat’s security point matters here: if humans accept 99% of permission prompts, those prompts stop being meaningful. Auto mode tries to route attention toward the moments that actually need review.
Move recurring work into routines and loops. Cat describes routines that listen for tickets, GitHub issues, and bug reports, then put up fixes and ping the PR owner. Boris says routines became the first obvious use of the Agent SDK: code review, PR babysitting, CI fixes, and rebasing. Boris’s bigger framing is that the interface moved from source code, to agent, to loop. In practice, turn repeated work into a routine, /loop, or /goal instead of manually prompting Claude every time.
Use the desktop app and worktrees for parallel sessions. Boris says his old setup was six terminal tabs and six checkouts of the same repo. Now he uses agent view and the desktop app, which handles worktree cloning. Translation: each agent gets its own isolated checkout, so several agents can work on the same repo without trampling each other’s files.
Manage the fleet from agent view. Agent view gives one screen for background sessions: what is working, what needs input, what is ready for review, and what finished. This is the control room. You dispatch agents, scan rows, peek when something needs attention, attach when you need the full conversation, then detach and let the session keep running.
Use Remote Control and voice mode away from the laptop. Boris says about half his engineering now happens on his phone through Remote Control. He starts agents, checks in while walking around, kicks off another agent while getting coffee, and uses voice mode to spin up work from an idea mid-conversation. Cat’s couch anecdote captures the setup: his laptop stayed open and locked at the office while PRs kept landing because he was supervising agents remotely.
Keep Claude at the center of the workflow. Boris says Claude is where he goes for questions, code, code review, security review, and even forms. During onboarding, new employees ask Claude instead of interrupting coworkers. His “I don’t have a to-do list anymore” line is the operating model: ideas become agents, routines, reviews, and PRs.
Use context minimalism. Boris says the progression was prompt engineering with Sonnet 3.5, context engineering with Opus 4, and now context minimalism. His advice is a minimal system prompt, minimal tools, and a way for Claude to pull the context it needs. Cat says too much context becomes micromanagement. The Boris-style setup is “give Claude the goal, constraints, and retrieval path, then let it work.”

Now we’ll go into all of that with a bit more detail.

The core lesson: every agent mistake should improve the system

Boris’s most important workflow is hidden in the first minute: every Claude mistake becomes a reusable instruction in CLAUDE.md or a skill.

CLAUDE.md is a project instruction file Claude reads at the start of every session. Anthropic’s docs say to add to it when Claude makes the same mistake twice, when a code review catches something Claude should have known, or when you keep typing the same correction across sessions. A skill is better when the instruction has become a reusable procedure rather than a permanent fact. In plain English: when the agent messes up, the fix should go into the system, not only into that one chat.

There is one caveat worth keeping: Claude treats CLAUDE.md as context, not enforced configuration. If the rule is “prefer this testing pattern,” memory is fine. If the rule is “block this dangerous action,” Anthropic says to use a PreToolUse hook.

Boris’s exact operating principle is that he does not tell Claude to do one thing differently. He tells it to write the lesson down somewhere durable. His reason is practical: if you keep doing that, Claude can keep running because the repeated mistakes stop staying private to one session.

That changes the compounding curve. A one-off correction saves the current task. A reusable instruction improves every future task that touches the same pattern.

AI Skill: Build a mistake-to-skill loop

Try this skill out with your Claude Code agent:

When Claude makes a mistake, stop and name the pattern.

Ask it to convert that pattern into a project rule.

Add the rule to CLAUDE.md, a project doc, or a reusable skill.

Add a verification step so the agent checks the rule next time.

You made this mistake: [describe the mistake].
Turn it into a reusable project instruction.

Output:
1. A one-sentence rule for CLAUDE.md.
2. A verification step the agent should run before finishing.
3. Two examples: one correct, one incorrect.
4. The shortest possible wording that will prevent this mistake next time.

Verification means: can the agent run the thing?

Cat asks for tips on making Claude Code good at verification, and Boris says many people hear “verification” and think of unit tests, lint, or type checks. Those matter, but they were already easy to automate.

The agent version is more direct: “can the agent run the thing?”

That means the agent needs some way to use the software it changed. Boris points to a moment with Opus 4 testing itself in bash, then says Anthropic now has loops for iOS simulators, Android simulators, and desktop computers. Uhhh, please make those public my guys??!

Cat gives the concrete example: an engineer built a desktop development skill that teaches Claude how to run the local desktop app. Claude spins up the app, uses computer use to click around, invokes the new UX, tests edge cases, fixes bugs, and rechecks.

The docs make this workflow concrete. Computer use lets Claude open apps, click, type, and see the screen from the CLI, which means it can validate native apps, click through onboarding flows, reproduce visual bugs, screenshot results, patch code, and verify the fix. The docs also say Claude tries narrower tools first, like MCP, Bash, or Chrome, then falls back to computer use when those do not apply.

Her extra move: when Claude hits staging bugs, she has it read Slack for environmental context. It checks whether staging is down or whether someone else has already hit the issue, then updates the desktop development skill after the debugging loop completes.

That is the real upgrade. The agent moves from “I changed the code” to “I changed the code, opened the product, tried the new thing, found the weird case, fixed it, and tried again.”

AI Skill: Write a verification runbook

Use this for coding agents, spreadsheet agents, content agents, or anything else where “done” should mean “checked.” (Note: feel free to customize these to your content for whatever work you ACTUALLY do).

Create a verification runbook for this workflow: [workflow].

Include:
1. The exact command, file, app, or page the agent must open.
2. The happy-path test it must run.
3. Three edge cases it must test.
4. What evidence it should capture before saying it is done.
5. What it should do if the test fails.

Keep it short enough to paste into a project instruction file.

The surprising adoption story: adjacent roles start shipping

One of the most useful sections of the convo starts when Boris says “everyone codes” now (at least at Anthropic). His point is broader than programming. When Claude writes more of the code, the scarce skill shifts toward having a strong idea, understanding the product, knowing the user, and owning the business context.

He tells a small story about seeing a designer named Meaghan ( that would be Meaghan Choi, design lead for Claude Code) who now open PRs, which are proposed code changes for review. At first, he was horrified. Then he saw she was fixing a button, and the code actually looked good, and nowadays that behavior has became totally normal.

Cat says the adoption pattern repeats across enterprises: engineers adopt Claude Code first, then adjacent roles look over their shoulders and try it. She says designers become more productive by making prototypes and app changes directly instead of waiting on an engineer. PMs make app changes. Finance runs projections in Claude Code. Data scientists keep it open on their screens.

That is the role shift in one sentence: the person closest to the idea can now move closer to the implementation.

AI Skill: Turn product context into a prototype

This is useful for PMs, designers, marketers, operators, and founders who know what should exist before they know how to build it.

I want to prototype this product change: [describe the change].

Before touching any files, ask me up to 5 questions about:
- The user problem
- The desired behavior
- Constraints
- Edge cases
- What “good” looks like

Then propose the smallest prototype that would prove the idea.
After I approve it, implement only that prototype and give me a verification checklist.

Routines are agents that watch the work queue

The biggest practical idea in the video is routines. Cat describes an engineer who built a routine for voice mode that listened for every ticket, GitHub issue, and bug report, picked up issues proactively, put up a fix, and pinged the PR to him.

Then he made a second routine that found bug reports unanswered for five hours, put up fixes, and let him merge the ones that were easy to verify. Cat’s example lands because she was about to fix a bug from her own feature, and Claude told her another Claude had already fixed it.

Boris says routines became the first obvious use of the Agent SDK, which lets teams use Claude Code programmatically. The early “what do we use this for?” answer became code review, PR babysitting, CI fixes, and rebasing. CI means continuous integration, the automated checks that run when code changes. Rebasing means bringing a branch up to date with the latest version of the codebase.

The official routines docs add two important details. First, a routine is a saved configuration: prompt, repositories, environment, connectors, and triggers. Second, routines run autonomously in the cloud with no permission prompts during a run. That makes scoping matter: repositories, network access, environment variables, and connectors should match the job, because anything the routine does through GitHub, Slack, Linear, or another connected service appears as you.

That is a subtle but important shift. The routine does not wait for the owner to remember the task. It monitors the environment, finds neglected work, drafts the fix, and routes the result back to a person only where judgment is still needed.

The non-code version is easy to imagine:

A support routine watches unanswered customer issues.
A content routine checks drafts against house style.
A finance routine watches for weird changes in a forecast.
A sales routine drafts follow-ups when a deal has gone quiet.
An ops routine flags tickets that have sat untouched for too long.

The magic is the trigger. The agent starts because something happened in the system, not because a person remembered to ask.

AI Skill: Design a watch-and-fix routine

Design an agent routine for this recurring workflow: [workflow].

Define:
1. Trigger: what event should start the routine?
2. Scope: what is the agent allowed to inspect?
3. Action: what can it draft, fix, or update?
4. Verification: how does it prove the work is correct?
5. Escalation: when should it ping a human?
6. Safety: what is it forbidden to change?

Keep the first version low-risk and human-approved.

Auto mode replaces approval spam with evaluated trust

Boris says he used to rely on plan mode, where Claude researches and proposes changes before editing. Now he mostly uses auto mode, because newer models need less explicit planning for many tasks.

His model-specific note is useful: plan mode mattered for Opus 4 through roughly 4.5, then became less necessary around 4.6 and especially 4.7. Some people still like the planning artifact. Boris prefers auto mode because he can start a Claude, let it work, and move to the next one without watching every step.

The old Claude Code pattern was permission prompts. The agent wanted to run a tool, then asked the human to approve it. Boris says that made sense when models and classifiers were weaker. Auto mode routes routine work through background safety checks that block actions that escalate beyond your request, target unrecognized infrastructure, or appear driven by hostile content Claude read.

Cat’s security point is the one every company should copy. When a human accepts 99% of permission prompts, the review stops being meaningful. Auto mode can be safer if it makes the human pay attention only when the risk is real.

Anthropic did the heavy version of this. Cat says the team collected thousands of agent transcripts, had auto mode classify safety, brought in red teamers, created evals, and asked internal teams to prompt inject and hack Claude Code. Red teaming means trying to break your own system before attackers do. Evals are repeatable tests that measure whether the system behaves correctly. Prompt injection means hiding instructions in content so an AI follows the wrong command.

Boris’s meta-lesson is that building on models keeps breaking old engineering instincts. He says ideas that sounded wrong at first, like routing a permission prompt to another model, turned out empirically strong. The job becomes less about defending old patterns and more about testing which weird model-native patterns actually work.

The docs keep the trust boundary clear: auto mode is a research preview that reduces prompts, and sensitive operations still need review. The takeaway: autonomy should be earned at the system level, not granted one button-click at a time.

AI Skill: Build an autonomy ladder

Create an autonomy ladder for this agent workflow: [workflow].

Use 4 levels:
Level 1: Suggest only.
Level 2: Draft changes, human applies.
Level 3: Apply low-risk changes, human approves before publish or merge.
Level 4: Apply and complete automatically with audit logs.

For each level, define:
- Allowed actions
- Forbidden actions
- Required verification
- When to escalate
- Rollback plan

Recommend the safest level to start with and explain why.

Loop is the second leap

Boris calls Loop the next leap (docs, agent loop explained). The first leap moved engineers from editing source code directly to talking to an agent that edits source code. The next leap moves people from talking to one agent to talking to a loop or routine that prompts Claude for them.

His version of the history: the old object of attention was source code, then it became the agent, and now it is the loop. The person designs the repeated process, and the process prompts Claude.

A loop is a repeated process: check the state, decide what to do, act, verify, and repeat. That sounds small until you apply it to real work. A person no longer has to wake up the agent, remember the context, assign the task, and check whether anything happened. The loop keeps the work alive.

This is where agent work starts to feel organizational. A single prompt helps one person. A loop changes how the work moves.

AI Skill: Convert a task into a loop

Convert this recurring task into an agent loop: [task].

Map the loop:
1. Observe: what should the agent check?
2. Decide: what rule determines the next action?
3. Act: what should it do?
4. Verify: how should it check the result?
5. Report: what should it tell the human?
6. Improve: what should it add to memory after a failure?

Make the first version small, reversible, and easy to audit.

Addy Osmani turns Loop into a working blueprint

Addy Osmani’s loop engineering post sharpens the idea: “loop engineering” means replacing yourself as the person who prompts the agent. You design the system that prompts the agent instead.

He defines a loop as a recursive goal: you give the system a purpose, then the AI iterates until the goal is complete. That matches what @steipete argued about designing loops that prompt agents, and what Boris Cherny said about writing loops that figure out what to do.

Addy is careful about the hype. He thinks this may be the future of working with coding agents, but he flags three real concerns: token costs can vary wildly, quality can drop, and worries about slop are valid. In other words, the loop is leverage. It can also become a very confident conveyor belt for bad work.

His clearest explanation is the shift from turn-by-turn prompting to system design. For the last two years, most coding-agent work meant writing a prompt, adding context, reading the result, and typing the next instruction. Now the higher-leverage move is building a small system that finds work, hands it out, checks it, records what happened, and decides the next step.

That places loop engineering one level above agent harness engineering, which designs the environment one agent runs inside, and above the factory model, which treats software work as a system that produces changes. The loop runs on a timer, spawns helpers, feeds itself, and keeps state outside any one chat.

The practical surprise in Addy’s post is that this has become a product pattern. A year ago, you needed a pile of custom bash scripts to build a loop. Now the pieces ship inside both Codex and Claude Code. The tool names differ, but the shape is the same enough that the real job becomes designing a loop that survives across tools.

The five loop building blocks, plus memory

Addy says a useful loop needs five building blocks and one memory layer:

Automations that run on a schedule, API call, or GitHub event and do discovery or triage by themselves.
Worktrees so multiple agents can work in parallel without writing over each other.
Skills that store the project knowledge the agent would otherwise guess.
Plugins and connectors that plug the agent into the tools you already use.
Subagents so one agent can make the thing and another can check it.
Memory, like CLAUDE.md, auto memory, a markdown state file, or a Linear board, that lives outside the conversation and records what is done and what comes next.

The memory point is easy to miss. Addy connects it to his post on long-running agents: the model forgets between runs, so the memory has to live on disk, in a repo, or in a system like Linear. The agent forgets. The repo does not.

Automations are the heartbeat

In Addy’s framing, automations are what make a loop more than a one-off run. In Codex, he says you can create one in the Automations tab, choose the project, choose the prompt, choose the cadence, and choose whether it runs on a local checkout or background worktree. Runs that find work go to a Triage inbox. Runs that find nothing archive themselves.

He says OpenAI uses automations internally for “boring” work like daily issue triage, CI failure summaries, commit briefings, and hunting bugs introduced the week before. The maintainability trick is that an automation can call a skill. Instead of pasting a giant prompt into a schedule, you can fire $skill-name and update the skill when the workflow changes.

Claude Code reaches a similar place through routines, scheduled tasks, and hooks. Addy mentions /loop for running a prompt or command on an interval, cron tasks for scheduled work, shell hooks that fire at specific points in the agent lifecycle, and GitHub Actions for loops that keep running after your laptop closes.

Anthropic’s scheduling docs split the options clearly. Use cloud routines for reliable work that should run without your machine. Use desktop scheduled tasks when the job needs local files and tools. Use /loop for quick polling inside a live CLI session.

He also calls out /goal, which matters because it gives the loop a stopping condition. /loop re-runs on a cadence. /goal keeps going until a condition you wrote is true, such as “all tests in test/auth pass and lint is clean.” After every turn, a separate small model checks whether the work is done, so the writer is not also the grader. Anthropic’s docs add a useful constraint: the evaluator does not run commands or read files independently, so Claude has to surface evidence in the transcript. Addy says Codex has the same primitive, also called /goal, with pause, resume, and clear.

Worktrees keep parallel work from turning into file chaos

The moment you run multiple agents, file collisions become the failure mode. A git worktree gives each agent a separate working directory on its own branch while sharing the same repo history. That means one agent’s edits cannot touch another agent’s checkout.

Addy says Codex builds worktree support into the app so multiple threads can hit the same repo at once. Claude Code gives similar isolation through git worktrees, a --worktree flag, and an isolation: worktree setting for subagents. The docs add one useful operating detail: the desktop app creates a worktree for every new session automatically, and CLI users can start one with claude --worktree feature-auth. He links this to the orchestration tax: worktrees remove mechanical collisions, but your review bandwidth still decides how many agents you can actually run.

Skills keep intent from leaking out of the loop

Addy’s skills point lines up with Boris’s CLAUDE.md advice. A skill is a folder with a SKILL.md file, instructions, metadata, and optional scripts, references, or assets. Codex can run a skill when you call it with $ or /skills, or automatically when the task matches the description. Claude Code uses the same pattern through /skill-name invocation and automatic relevance matching.

The docs add the key design rule: use CLAUDE.md for facts Claude should hold in every session, then move multi-step procedures or long references into skills. Since skill bodies load only when invoked, a skill can hold a big checklist without wasting context on every turn.

The useful detail is that boring descriptions beat clever ones. The agent has to know when to call the skill. A clever label that humans enjoy can make the skill harder for the model to match.

Addy connects this to agent skills and intent debt. Intent debt is what happens when the agent starts cold and fills gaps with confident guesses. A skill writes the team’s conventions, build steps, and scar tissue into a reusable place the loop can read every run. Without skills, the loop re-derives the project from zero every cycle. With skills, the loop compounds.

He also draws a useful distinction: the skill is the authoring format, and a plugin is how you ship it. In Claude Code, plugins can package skills, agents, hooks, and MCP servers. Use standalone .claude/ configuration for personal or project-specific experiments, then convert to a plugin when the workflow needs to be shared, versioned, or reused across projects. Addy says the same pattern applies in both Codex and Claude Code.

Connectors let the loop touch real work

A loop that can only read files is small. Connectors, built on MCP, let the agent read an issue tracker, query a database, hit a staging API, or post in Slack. Anthropic’s docs frame MCP as the answer to copy-paste work: if you keep copying data from a monitoring dashboard, issue tracker, database, or design tool into chat, connect that system so Claude can read and act on it directly. Addy says both Codex and Claude Code speak MCP, so a connector written for one often works in the other.

This is the gap between “here is the fix” and “I opened the PR, linked the Linear ticket, and pinged the channel once CI turned green.” Plugins can bundle connectors and skills together, so a teammate installs the setup instead of rebuilding it from memory. The docs also flag the safety tradeoff: servers that fetch outside content can expose Claude to prompt-injection risk, so trust and scope matter.

Sub-agents keep the maker away from the checker

Addy’s strongest structural rule is to split the agent that writes from the agent that checks. The model that wrote the code can be too forgiving when grading its own work. A second agent with different instructions, and sometimes a different model, catches what the first one rationalized.

In Codex, he says subagents only spawn when asked, run in parallel, and fold results back into one answer. You define them as TOML files in .codex/agents/, with a name, description, instructions, and optional model or reasoning effort. That means a security reviewer can use a stronger model at higher effort, while an explorer can be fast and read-only.

Claude Code follows the same pattern with subagents in .claude/agents/ and agent teams that pass work between them. The common split is explorer, implementer, and verifier.

The docs explain why this saves more than time. Each subagent runs in its own context window with its own system prompt, tool access, and permissions, then returns only the result. That keeps high-volume searches, logs, and file reads from flooding the main conversation. Clear descriptions matter because Claude uses each subagent’s description to decide when to delegate.

Addy ties this to the code agent orchestra and adversarial code review. The loop runs while you are away, so the verifier is the only reason you can trust the loop enough to walk away. The tradeoff is cost: subagents burn more tokens because each one does its own model and tool work. Spend that second opinion where it is worth paying for.

AI Skill: Build a loop spec from Addy’s framework

Design a loop for this recurring workflow: [workflow].

Use Addy Osmani’s six-part loop structure:
1. Automations: what runs on a schedule or trigger?
2. Worktrees: how will parallel work stay isolated?
3. Skills: what project knowledge should live outside the chat?
4. Plugins/connectors: which tools must the loop touch?
5. Sub-agents: who explores, who implements, and who verifies?
6. Memory: where will the loop record what happened and what comes next?

Then add:
- Token-cost risk
- Quality/slop risk
- Human review checkpoint
- Stop condition
- First safe version to test

One real loop shape

Addy’s example turns the building blocks into a simple daily workflow. An automation runs every morning on the repo. Its prompt calls a triage skill that reads yesterday’s CI failures, open issues, and recent commits. It writes findings into a markdown file or Linear board.

For each worthwhile finding, the thread opens an isolated worktree and sends one subagent to draft the fix. A second subagent reviews the draft against project skills and existing tests. Connectors open the PR and update the ticket. Anything the loop cannot handle goes to the triage inbox. The state file becomes the spine: what got tried, what passed, and what remains open, so tomorrow’s run picks up where today’s stopped.

That is the whole shift. You designed the loop once. You did not prompt each step. The same shape works in Codex or Claude Code because the pieces are now similar enough.

What the loop still does not do

Addy’s ending is the counterweight the hype needs. A loop changes the work. It does not delete the human from the work.

Verification remains your job. A loop running unattended can make mistakes unattended. Even with a verifier subagent, “done” is a claim that needs evidence. Addy connects this to code review in the age of AI: your job is to ship code you confirmed works.
Understanding can rot. The faster the loop ships code you did not write, the wider the gap can grow between what exists and what you understand. Addy calls that comprehension debt.
Comfort can become the risk. When the loop runs itself, it is easy to stop having an opinion and accept whatever comes back. Addy calls that cognitive surrender.

His final advice is balanced: set up loops, keep prompting agents directly when that is the right tool, and build loops like someone who intends to stay the engineer. Two people can build the same loop and get opposite outcomes. One uses it to move faster on work they understand deeply. The other uses it to avoid understanding the work. The loop cannot tell the difference. You can.

Organizations get value when the agent sits at the center of the process

Boris puts on his “business cat hat” and uses a 1990s productivity analogy. Companies saw limited gains from computers when they kept paper processes and put computers on the side. The productivity gains came when computers moved to the center of the process.

His claim is that the same pattern is happening with AI, only faster. The personal-computer transition took 10 to 15 years. AI can move faster because so much work is already digital, and Claude can use a computer, write code, and run code.

Anthropic’s internal behavior is the case study. During onboarding, new employees ask Claude instead of interrupting coworkers. Boris says Claude is where he goes for questions, code, code review, security review, and even forms. The point is organizational design: Claude sits in the middle of the workflow, not beside it.

The human upside is quieter than the productivity pitch. Boris says he bugs people less because Claude can handle many of the information requests. When he does interact with people, it is more often for collaboration, ideas, customer conversations, and shared creation.

His strongest personal line comes near the back of the interview: “I don’t have a to-do list anymore.” The literal claim is about his own workflow, not every employee in every company. The useful insight is that the to-do list moves from a static pile of tasks to a set of agents, routines, and loops that keep converting ideas into drafts, fixes, reviews, and PRs.

That is also where roles merge. Cat answers the product-versus-engineering question with “Everyone’s going to be both.” On her team, product, developer relations, and design all write code. Engineers ship products end to end, working with legal, marketing, and security along the way.

Her filter for who benefits is worth saving: AI rewards curiosity, product taste, and end-to-end ownership. In other words, the leverage goes to people who can define what should happen, judge whether the output is good, and carry the work across the old departmental seams.

AI Skill: Run an agent-center audit

Audit my team workflow for places where AI should sit at the center.

Workflow: [describe workflow]

Find:
1. Repeated questions people ask teammates.
2. Repeated checks people perform manually.
3. Decisions that need context from multiple docs or systems.
4. Tasks with clear inputs and outputs.
5. Tasks where a draft from AI would save time even if a human approves.

Rank the top 5 agent opportunities by value, risk, and ease of verification.

The new manager job: supervising a fleet

Once people run many agents, the interface has to change. Boris says his old workflow was six terminal tabs and six checkouts of the same repo. Now he uses the new agent view and the desktop app, which handles worktree cloning. The official agent view docs describe it as one screen for background sessions: what is running, what needs input, and what is done. A worktree is a separate working copy of the same codebase, so multiple tasks can happen in parallel without stepping on each other.

The ergonomics matter because agent work can become a coordination problem. One tab plus agent view gives him a control surface with session states, row summaries, PR labels, peeks, replies, and full attach/detach controls. The desktop app creates the worktrees. Remote Control lets him start from a computer and keep supervising from his phone while the session still runs locally on his machine, with local files, tools, MCP servers, and project configuration still available.

Then comes the strangest detail: Boris says about half his engineering happens on his phone. He starts agents from Remote Control, checks in while walking around, starts another agent while getting coffee, and uses voice mode to spin up work from an idea while talking to someone.

Cat’s couch anecdote makes the workflow visible. Boris left his laptop open, plugged in, and locked at the office while PRs kept landing from that machine. Cat thought he had forgotten it. Then it happened again. The punchline was that he was coding from his couch because Remote Control had gotten good enough.

Somewhere, a standing desk shed a single tear.

The management lesson is practical. When one person runs many agents, the job becomes briefing, labeling, checking, merging, and stopping work that drifts.

AI Skill: Create an agent fleet brief

I am about to run multiple agents in parallel.
Create a fleet-management plan.

For each agent, define:
- Name
- Goal
- Files or systems it can touch
- What success looks like
- Verification required
- Check-in frequency
- Stop condition
- Merge or publish owner

Also create a 5-minute end-of-day review checklist.

Context minimalism: give the model less, but give it a way to find more

The last major advice section is about context. Boris says people used to talk about prompt engineering, then context engineering. His timeline is model-specific: with Sonnet 3.5, you had to prompt engineer. With Opus 4, you had to context engineer. With today’s models, his advice is minimal system prompt, minimal tools, and some way for the model to pull in the context it needs.

A system prompt is the standing instruction the AI follows. Tools are the things it can use, like file search, a browser, terminal commands, or app integrations. Cat calls herself “a context minimalist”: tell the model only what it needs to know and let it figure out the rest.

Her reasoning is subtle. Too much context can become micromanagement. Sometimes the model knows a better route to the same outcome, so she prefers giving it freedom inside clear constraints. She also says the team is making the harness leaner, which leaves more room for the user’s own prompt and helps Claude follow that prompt better.

The context window docs show the mechanical reason this matters. Before you type anything, Claude may already have CLAUDE.md, auto memory, MCP tool names, and skill descriptions in context. As it works, file reads, path-scoped rules, hooks, and skill bodies can all add weight. The docs recommend clearing between unrelated tasks, compacting with focus instructions, and delegating large reads to subagents so the main context stays useful.

That lesson applies far beyond coding. Many people respond to better models by pasting more instructions, more background, more examples, and more constraints. The better move is usually a smaller brief plus a clear path to retrieve details.

AI Skill: Practice context minimalism

Rewrite this prompt for context minimalism.

Original prompt:
[paste prompt]

Create:
1. A minimal version with only the goal, constraints, and output format.
2. A “context access” section telling the AI where to look if it needs more information.
3. A list of details I should remove because they micromanage the model.
4. A final version under 150 words.

What Boris thinks changes next

The back end of the interview is the most future-facing part. Boris says it would be surprising if Claude Code’s current usage patterns still looked the same in a year, because agents are running longer, acting more autonomously, and moving from one-at-a-time work to many-at-once work.

He says he rarely runs only one agent now. It is usually a few, dozens, hundreds, or thousands. That shift forces a new form factor, because a terminal interface built for one synchronous session does not naturally map to a fleet of long-running agents.

His prediction is also cultural. The next ideas will come from the team and the builder community because everyone is close to the product, everyone talks to users, and everyone is encouraged to come up with ideas. The workflow changes the product team that builds the workflow.

The credible counterpoint: Anthropic is describing the frontier case

The strongest caveat is simple: this is the Claude Code team describing its own workflow inside a company with unusually high AI fluency, and Addy is describing the pattern from the vantage point of someone who knows how to design agent systems.

Most companies lack at least one ingredient: clean documentation, safe staging environments, agent-ready tooling, red team support, permission classifiers, thousands of transcripts for evals, or enough review bandwidth to manage parallel agents. They may also have compliance rules that make “coding from the couch” a security conversation before it becomes a productivity story.

Addy’s warning makes the adoption path narrower and more useful. Start with work that has three traits:

A clear trigger, like a bug report, draft submission, failed check, or stale ticket.
A low-risk first action, like a draft, suggestion, summary, or proposed fix.
A concrete verification step, like opening the app, running the command, checking the source, or comparing against a rubric.

The unresolved issue is whether ordinary companies can build enough verification and shared context to trust autonomous agents without recreating Anthropic’s internal infrastructure. That answer will decide whether agent fleets become everyday work systems or stay trapped in impressive demos.

Agent view is the missing control room for agent fleets

So the part that actually caught me by surprise in this video was the new agent view panel, which is apparently opened with claude agents. This was the first time I had heard of this, maybe because we covered it in a burst of other news, or if it was a bit more of a stealth launch. But basically, it gives you one screen for every background Claude Code session: what is running, what needs input, what is ready for review, and what finished. That makes Boris’s “six terminal tabs” line veeeery literal. Agent view turns agent work from terminal sprawl into a queue you can scan.

The docs position it for independent tasks that can run without you watching every step. Their examples are exactly the kind of work Boris and Cat describe: dispatch a bug fix, a PR review, and a flaky-test investigation as separate rows, keep working somewhere else, then return when a row needs you or has a result.

That also explains where agent view fits beside Claude Code’s other parallel-work tools. The Run agents in parallel docs split the options this way:

Subagents are delegated workers inside one Claude Code session. Use them when a side task would flood your main context with logs, search results, or file contents.
Agent view is a dispatch board for independent background sessions. Use it when you want to hand off several tasks, check status at a glance, and step in only when one needs judgment.
Agent teams are coordinated sessions with a lead agent, shared task list, and inter-agent messaging. Use them when Claude should split a project and keep workers in sync.
Dynamic workflows are scripted multi-agent runs, useful when the work needs many subagents, several verification passes, or cross-checked results.

So agent view is the human-managed version of parallel work. Claude is not coordinating the whole swarm for you. You are. The interface gives you a control surface for that job.

How agent view actually works

The quick start flow is simple: run claude agents, type a task, and hit Enter. That prompt starts a new background session. If you type another prompt and hit Enter, Claude starts a second session alongside the first instead of treating it as a follow-up. Each session uses your quota independently, which is the “please watch the token meter” footnote hiding under every agent-fleet fantasy.

Each row shows whether the session is working, waiting on input, idle, completed, failed, stopped, or sleeping between /loop iterations. The row can also show a PR label, and the PR status is color-coded: yellow for waiting or failed checks, green for ready, purple for merged, and grey for draft or closed.

The clever bit is the peek panel. Press Space on a row and you get the latest output, the question Claude is waiting on, and any PRs it opened. Most of the time, that is enough. You can reply without opening the full transcript, pick a multiple-choice answer with a number key, use Tab for a suggested reply, or prefix the reply with ! to send a Bash command. If you need the full conversation, press Enter or → to attach.

Attaching temporarily turns the row into a normal Claude Code session. Claude gives you a recap of what happened while you were away, then you can use the usual commands and keyboard shortcuts. When you detach, the session keeps running. That is the key mental model: agent view is not a transcript viewer. It is a live switchboard for sessions that can keep working without a terminal attached.

The dispatch tricks are where this gets useful

The dispatch docs show how flexible this gets. From the agent view input, you can mention a custom subagent with @agent-name, target a sibling repo with @repo, launch a skill or command with /, run a background shell command with !, or jump to an existing PR session with #123 or a PR URL.

You can also launch background sessions from the shell with claude --bg, run a named subagent with claude --agent code-reviewer --bg, name the row with --name, or run a shell command with --exec. This is where agent view starts feeling less like a screen and more like an operating layer. You can create sessions from the dashboard, from inside an existing session with /bg, or directly from the shell.

The file-editing safety layer matters too. Before a background session edits files, Claude moves it into an isolated git worktree under .claude/worktrees/. That means parallel sessions can read the same repo while each writes to its own checkout. Deleting a session from agent view can delete Claude’s created worktree, including uncommitted changes, so the practical rule is boring and important: merge, push, or save the work you care about before cleanup.

The boring infrastructure is the feature

The hosting section explains why this works after you close the terminal. Background sessions are hosted by a per-user supervisor process, separate from agent view and separate from your terminal. The supervisor starts automatically, uses the same credentials as interactive sessions, and keeps each background session as its own Claude Code process.

Once a session finishes and sits unattached for about an hour, the supervisor may stop its process to free resources. The transcript and state stay on disk, so attaching, peeking, or replying starts it again from where it left off. Pinned sessions keep their process running while idle. Sessions survive sleep and auto-updates, but shutting down the machine stops running sessions because the work is still local.

This is the part that connects back to Boris. If agents are moving from one-at-a-time work to dozens or hundreds of longer-running sessions, the important product surface becomes the thing that tells you where attention is needed. Agent view’s real job is attention routing: which agent needs a decision, which one has a PR, which one is stuck, which one can be ignored, and which one should be stopped before it burns another small civilization of tokens.

AI Skill: Set up your first agent-view board

I want to use Claude Code agent view for this project: [project].

Design my first 5 background sessions.

For each session, define:
1. Session name
2. Exact dispatch prompt
3. Whether it should run in the current repo, a sibling repo, or a worktree
4. Whether it should use a subagent, skill, shell command, or normal Claude session
5. What status or PR result I should look for in agent view
6. When I should peek, attach, stop, or delete it
7. Token-cost risk and human review checkpoint

Keep the tasks independent so they can safely run in parallel.

The safest first use is not “run my whole engineering org.” Start with three independent, low-risk rows: one flaky-test investigation, one docs cleanup, and one PR review. Watch how often they need input. Watch how useful the row summaries are. Watch how much quota they spend. The skill is learning your own review bandwidth before the interface makes running ten agents feel normal.

How to apply Claude Code’s first-year lessons this week

Use the video and Addy’s post as a practical maturity model. You can apply the lessons without running hundreds of agents (or, y’know, do run hundreds, if you don’t mind paying the Claude tax for them!).

Pick one repeated workflow. Choose a task with a clear input and output.
Write the memory layer. Add the key rules to CLAUDE.md, auto memory, a project doc, a markdown state file, or a Linear board.
Add verification. Define how the agent proves the work is correct, and keep the maker away from the checker.
Turn failures into skills. Every repeated mistake becomes a rule, skill, or project convention.
Add a trigger. Start with a routine, /loop, cron job, hook, channel, or GitHub Action that finds the work.
Use isolation. Put parallel agents in separate worktrees so file collisions do not become the failure mode.
Connect the real tools. Use MCP connectors only where the loop needs them: issue tracker, database, staging API, Slack, PRs, or CI.
Set a stopping condition. Use a concrete goal like “all tests pass and lint is clean,” then require evidence. TBH though, your tests should be more like “you opened the app and tested the feature end to end”, based on Boris’ advice on testing.
Watch the human risks. Track token cost, slop risk, comprehension debt, and the temptation to stop reviewing. Keep yourself in the loop, basically. As Karpathy says, you can’t outsource your understanding.

There are obvious implications and applications here for developers (for engineering teams, it could mean a routine that fixes easy CI failures in isolated worktrees, or any other exmaples Cat and Boris talk about in the vid), but I want to expand the scope to non-devs as well. Imagine applying this across domains. For content teams, this could mean a routine that checks drafts against style rules and writes notes to a state file. For support teams, it could mean a routine that watches unresolved tickets and drafts safe first responses. For ops teams, it could mean a routine that flags stale work before it becomes someone’s Friday evening nightmare.

The first year of Claude Code points to a broader agent lesson: the prompt is becoming the smallest unit of work. Instead, the loop is becoming the workflow, and the human job is moving toward designing the loop, checking the work, and staying close enough to the system to still understand what shipped. We are all managers of teams now. It’s best we start learning agent-manager best practices.

The most useful way to read Boris and Cat’s advice is as operators, not forecasters. Their unique insights come from the details: Cat updates the skill after debugging. Boris treats auto mode as a trust solver, not a convenience feature (drawing your attention to the most important security flags). Routines fix bugs before the owner sees them (freeing you up to do so much more). Agent view replaces terminal sprawl. Context minimalism replaces giant prompt packets. The future of agent work will be built out of habits like those, and many more to come. Perhaps it’s time for us all to adapt, eh?

The official docs

The Claude Code docs linked below should help add some context to the discussions above. The tools Boris and Cat describe are now documented as separate building blocks you can combine:

CLAUDE.md and auto memory: every session starts with a fresh context window, so persistent knowledge has to reload from somewhere. CLAUDE.md is the instruction file you write; auto memory is the set of notes Claude writes from your corrections, preferences, build commands, and debugging patterns. Anthropic also makes an important distinction: memory is context, not a hard safety policy. If you need to block an action, use a PreToolUse hook.
Skills: reusable SKILL.md files that package a workflow, checklist, or reference material. The key docs detail is cost and context: the body of a skill loads only when Claude uses it, so long procedures do not sit in the model’s context all day.
Routines: saved Claude Code configurations that bundle a prompt, repositories, connectors, and triggers. They can run on a schedule, API call, or GitHub event from Anthropic-managed cloud infrastructure, so they keep running after your laptop closes.
/loop: an in-session scheduler for polling and recurring checks. It can run on a fixed interval, let Claude choose the next interval, or run a built-in maintenance prompt that continues unfinished work, checks PR comments, fixes failed CI, handles merge conflicts, and does cleanup passes.
/goal: a condition-based loop. You write a measurable end state, like tests passing and lint being clean, and Claude keeps working turn by turn until a separate small model says the condition is met.
Worktrees: isolated working directories on separate branches. They let multiple agents edit the same repo in parallel without touching each other’s files.
Subagents: specialized assistants with their own context window, system prompt, tools, and permissions. They keep bulky research out of the main conversation and make the maker/checker split practical.
MCP connectors and plugins: connectors let Claude read and act on outside systems like Jira, Slack, databases, and monitoring tools. Plugins package skills, agents, hooks, and MCP servers so teams can share the setup.
Agent view, Remote Control, and computer use: the interface layer for managing many sessions, supervising local work from a phone, and letting Claude open apps, click, type, and test GUI-only flows.

Timecode map: all the moments worth watching

Below are the key moments of the video you can skip to if you’re interested.

0:00: Claude Code started as a small internal demo that got two Slack reactions.
0:31: The one-year jump from simple tasks to armies of agents and agent trees.
0:51: Boris’s best workflow: turn every mistake into CLAUDE.md instructions or skills instead of only correcting the current run.
1:11: Verification means more than unit tests, lint, and type checks.
1:50: Opus 4 tested its own feature in bash through the Claude CLI.
2:24: Cat explains a desktop development skill that runs the app, clicks through UX, tests edge cases, fixes, and rechecks.
2:43: Cat has Claude read Slack when staging issues may explain a bug, then update the skill after debugging.
3:14: Everyone codes on the Claude Code team because Claude handles more of the implementation.
3:48: Designer PRs became normal after the code proved useful.
4:04: Enterprise adoption starts with engineers, then spreads to adjacent roles looking over their shoulders.
4:19: Designers prototype, PMs make app changes, finance runs projections, and data scientists use Claude Code.
4:51: A routine listens for tickets, GitHub issues, and voice mode bug reports.
5:46: Another routine fixes bug reports unanswered for five hours.
6:04: Routines became the first obvious use of the agent SDK.
6:26: Agents now babysit PRs, code review, CI, and rebases.
6:43: Boris uses auto mode instead of plan mode.
6:57: Newer models need less explicit planning for many tasks.
7:21: Claude Code’s early permission-prompt model gave way to auto mode.
8:04: If humans accept 99% of prompts, permission review turns into approval spam.
9:00: Anthropic tested auto mode with thousands of transcripts, red teamers, evals, and internal prompt-injection attempts.
9:50: Boris says building on models forces engineers to relearn old assumptions and trust empirical results over intuition.
10:25: Loop is the next leap after talking directly to agents.
10:30: The interface shifts from source code, to agent, to loop or routine.
11:06: Boris uses the personal computer productivity analogy.
12:02: Anthropic onboarding routes questions through Claude.
12:40: The computer transition took 10 to 15 years; AI may move faster because work is already digital.
12:54: Boris says the tedious parts fall away, leaving more room for ideas and customer conversations.
13:21: Boris says he no longer has a to-do list because Claude builds from the ideas.
13:31: Cat says the future is both product and engineering.
14:07: AI benefits people with curiosity, product taste, and end-to-end ownership.
14:29: Agent view and desktop worktree cloning replace six terminal tabs.
14:59: Boris says about half his engineering now happens on his phone through Remote Control and voice mode.
15:32: Cat realizes Boris is landing PRs while coding from his couch.
16:06: The team moves from prompt engineering and context engineering toward context minimalism.
16:27: Boris recommends minimal system prompt, minimal tools, and a way to pull context.
16:41: Cat says too much context can micromanage the model, and a leaner harness leaves more room for the user’s prompt.
17:14: Agents are running longer, becoming more autonomous, and shifting from single agents to dozens, hundreds, or thousands.
17:46: Boris predicts the next form factors will come from the team and the broader builder community because everyone is close to users.

And that’s all for this one! If you took the time to read this, you probably want to check out our breakdown of the top takeaways for agentic app developers from Apple’s WWDC 2026 Platforms State of the Union (video), which shows how to use Apple’s new framework ecosystem to teach Apple apps to interact with Siri. We’re also broke down the launch of Claude’s new model, Fable 5, which you can check out here.

Source: The Neuron