MLA 024 Agentic Software Engineering

Apr 13, 2025 (updated Feb 22, 2026)

Click to Play Episode

Agentic engineering shifts the developer role from manual coding to orchestrating AI agents that automate the full software lifecycle from ticket to deployment. Using Claude Code with MCP servers and git worktrees allows a single person to manage the output and quality of an entire engineering organization.

Vibe Coding Mini Series

Part 1: Vibe Coding
Part 2: Claude Code Components
Part 3: Agentic Software Engineering

Resources

Resources best viewed here

Latent Space: The AI Engineer Podcast

DeepLearning.AI Short Courses

AI Engineering: Building Applications with Foundation Models - Chip Huyen

Hugging Face AI Agents Course

DeepLearning.AI: AI Agents in LangGraph

DeepLearning.AI: AI Agentic Design Patterns with AutoGen

Microsoft AI Agents for Beginners (GitHub)

LangChain: RAG From Scratch

Microsoft Generative AI for Beginners (GitHub)

Show Notes

Learn Faster with a Walking DeskWalk While You Learn

Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

The Shift: Agentic Engineering

Andrej Karpathy transitioned from "vibe coding" in February 2025 to "agentic engineering" in February 2026. This shift represents moving from casual AI use to using agents as the primary production coding interface. The goal is to automate the software engineering lifecycle, allowing a single person to manage system design and outcomes while agents handle implementation.

Tooling and Context Efficiency

Minimize MCP servers to preserve context. 12 active servers consume 66,000 tokens, which is one-third of Claude's 200K window. Lazy-loading MCP definitions reduces usage by up to 95%.

GitHub MCP: Accesses GitHub API for PR creation, issue management, and Actions.
Context7: Fetches version-specific documentation to prevent hallucinations in libraries like React or Prisma.
Sequential Thinking: Forces structured reasoning for complex architecture decisions.
Playwright: Performs browser automation for E2E testing and UI debugging.
Memory: Local knowledge-graph for persistent project context across sessions.
Hooks: PostToolUse auto-formats code via Prettier. PreToolUse blocks dangerous commands like rm -rf or writes to .env. SessionStart with a compact matcher re-injects instructions after context compaction.

High-Impact Workflows

Plan-First Mode: Use Shift-Tab for read-only exploration. Create TODOs and milestones before implementation to reduce backtracking.
Git Worktrees: Claude Code supports parallel sessions via the --worktree flag. This allows 3 to 5 simultaneous agents to work on different branches in a single repository.
Headless Mode: Use the --print flag and JSON formatting to script Claude into external automation or CI/CD pipelines.

The Automated Engineering Pipeline

Trigger: Issues are filed or labels like claude-autofix are applied. Tools like n8n or OpenClaw can also trigger sessions via webhooks or Slack.
Implementation: Claude plans, implements changes, and writes tests in an isolated worktree.
Self-Review: The code-review plugin runs four parallel agents to score changes for correctness and security.
CI and Auto-Fix: Claude monitors CI status, auto-fixes failures, and merges PRs to staging via squash once checks pass.
Human Gate: The engineer reviews the accumulated changes in the staging branch before merging to main for production deployment.

Career Transition

The role of the engineer moves from writing code to acting as an engineering operator. Daily work involves triaging issues, making architectural judgment calls, and optimizing the automation system. Maintaining a CLAUDE.md file under 100 lines ensures maximum token efficiency and performance for the agentic team.

Never Run Out of ML ContentGenerate Your Own Episodes

Want to go deeper on a topic this podcast didn't cover? Generate your own episodes - AI agents, transformers, diffusion models, whatever you're curious about. They appear right in your podcast app.Turn any ML topic into a podcast episode in your app.Start Generating →Start Generating →

Transcript

Agentic Software Engineering

This guide assumes you already know Claude Code's components: CLAUDE.md, Skills, Commands, Hooks, MCP, Subagents, Agent Teams, and headless mode. If you need the fundamentals, read the companion guide first. Here we're wiring everything together into real workflows, and ultimately, into a system that lets you operate as a one-person engineering organization.

From Vibe Coding to Agentic Engineering

On February 2, 2025, Andrej Karpathy coined the term "vibe coding." The idea was simple: give in to the vibes, forget the code exists, let the AI handle it. It resonated. Collins Dictionary named it their Word of the Year. For a while, vibe coding was how people talked about writing software with AI assistance.

Exactly one year later, Karpathy walked it back. Vibe coding was fine for throwaway weekend projects, he said, but the reality on the ground had changed. Professional developers were now using AI agents as their primary coding interface, not for fun but for production work. He proposed a replacement term: agentic engineering. The word "agentic" because the new default is that you are not writing the code directly ninety-nine percent of the time. You are orchestrating agents who do. The word "engineering" to emphasize that there is an art and a science and a real expertise to doing this well.

This tracks with what Karpathy calls "jagged intelligence," which is one of the most useful mental models for working with LLMs. These models solve hard problems easily and trip on simple ones, and the frustrating part is that it's rarely obvious in advance which is which. The job of the agentic engineer isn't pressing a button. It's knowing when to trust the agent, when to intervene, when to run three agents in parallel and pick the best output. It's closer to managing a junior engineering team than to writing code yourself.

But here's the part most guides miss: the endgame isn't just "code faster." It's to automate the entire software engineering lifecycle, from ticket to deployment, and to elevate your role from the person writing code to the person designing systems, making architectural decisions, and owning outcomes. The software engineer who masters agentic engineering doesn't become obsolete. They become the one person doing the work of a team.

This guide gives you the tools, then the workflows, then the full automated pipeline.

Part 1: The Power Tools

Your MCP Stack

Every MCP server loads its tool definitions at session startup, consuming roughly four to ten thousand tokens of context. That might not sound like much, but it adds up fast. A developer who reported having twelve MCP servers active simultaneously was losing sixty-six thousand tokens before any conversation happened. That's a third of Claude's two-hundred-thousand-token context window, gone before writing a single line of code. The rule is ruthless minimalism. Load only what you need, use project-scoped config files so tools load only in relevant workspaces, and lean on Claude Code's MCP Tool Search feature, which lazy-loads definitions on demand and can cut context usage by up to ninety-five percent.

With that constraint in mind, here are the servers worth having and why each one earns its context budget.

GitHub MCP is the single most-used server in the ecosystem. It gives Claude full GitHub API access: creating pull requests, searching code across repositories, managing issues, triggering Actions, and reviewing diffs. This is the server that makes "read the ticket, implement it, open the PR, update the ticket" a single Claude Code session instead of four separate workflows. You add it as an HTTP transport server pointed at the GitHub Copilot MCP endpoint.

Context7 fetches real-time, version-specific library documentation. This solves one of the most common failure modes in AI-assisted coding: Claude hallucinating an API that doesn't exist in your version of React or Prisma or whatever library you're using. Context7 pulls the actual docs for the actual version you have installed. The usage pattern is simple: append "use context7" to prompts involving unfamiliar libraries or when you're working with a library you recently upgraded.

Sequential Thinking isn't an external service. It's a meta-cognitive enhancement that forces structured reasoning for complex architectural decisions. When you're designing a migration strategy or debating between three different approaches to a problem, Sequential Thinking prevents Claude from jumping to conclusions. It forces the agent to lay out its reasoning step by step, which makes it much easier for you to spot where the logic goes wrong.

Playwright MCP gives Claude browser automation through accessibility snapshots. That means Claude can write and run end-to-end tests, scrape pages, debug UI issues, and fill out forms, all from within the agent session. When Claude builds a feature and then tests it in a real browser in the same session, the feedback loop tightens dramatically. Instead of you manually testing and reporting back, the agent sees the result of its own work and can iterate.

Brave Search provides real-time web search. It's straightforward but essential. When Claude needs to look up an error message, check a library changelog, or find a StackOverflow solution mid-task, Brave Search handles it without you having to alt-tab and search manually.

Memory MCP offers persistent knowledge-graph storage backed by a local file on your machine. Architectural decisions, accumulated project context, your preferences and conventions: it all survives across sessions. This is particularly valuable for solo developers working on long-running projects where context loss between sessions is the primary bottleneck. Without Memory MCP, every new session starts from zero and you end up re-explaining the same architectural decisions.

Supabase or PostgreSQL MCP is for anyone touching databases. Either server gives Claude the ability to inspect your schema, write and test queries, generate migrations, and validate them, all without you copy-pasting SQL between tools. If database work is part of your daily workflow, this saves a remarkable amount of friction.

For configuration, the best practice is to put universal servers like Brave Search, Sequential Thinking, and Memory at user scope so they're always available regardless of which project you're in. Put team-shared servers like GitHub and database servers in a project-scoped config file at your repo root, checked into git so every team member gets the same setup. Use the slash-mcp command to disable servers per session when you don't need them, which reclaims their context budget for that session.

Notable Skills and Plugins

The skills ecosystem has matured fast. Three marketplaces now index community contributions, each with a different approach. Anthropic's official skills repo on GitHub offers production-quality, well-maintained skills for design work, document handling, and MCP server building. These are the ones you can trust without much vetting. SkillsMP indexes over two hundred thousand agent skills scraped from GitHub. That's quantity over curation, but the search functionality is good enough to find what you need if you know what you're looking for. SkillHub takes the opposite approach, curating about seven thousand skills rated on practicality and quality, which gives it much better signal-to-noise for browsing.

The standout community resource is awesome-claude-code on GitHub by the user hesreallyhim. It's a curated index of skills, hooks, slash commands, agent orchestrators, and plugins, and it's the single best place to discover what exists in the ecosystem. A few notable entries from that collection are worth calling out. There's a TDD slash command that enforces Red-Green-Refactor discipline and manages PR creation through the entire cycle. There's a code analysis command that generates knowledge graphs and surfaces optimization suggestions. There's a hook-creation wizard that walks you through setup with smart suggestions based on your specific project. And there's Auto-Claude, which is an autonomous multi-agent framework with a kanban-style UI and full software development lifecycle integration.

The Skilz universal installer is also worth knowing about. It provides a package-manager-like experience for installing skills, so instead of manually copying files around, you run a single command and the skill is set up in the right place with the right configuration.

Hook Patterns That Matter

Hooks are where CLAUDE.md's suggestions become mechanical enforcement. Your CLAUDE.md can say "always format your code" and Claude might forget. A hook runs the formatter every single time, no exceptions. Here are the patterns that power users consistently converge on, described by what they do rather than the specific JSON configuration. You can find the exact code in the official hooks guide and the awesome-claude-code repo.

The first essential pattern is auto-formatting every edit. This is a PostToolUse hook that matches on Edit and Write operations and runs Prettier, or whatever formatter your project uses, on whatever file Claude just touched. Every edit comes out formatted, every time, without relying on Claude to remember the convention or even know which formatter you use.

The second is blocking writes to protected files. This is a PreToolUse hook that checks whether Claude is about to modify an environment file, a lockfile, or anything else you've designated as off-limits. If the target file matches your protected list, the hook returns a deny decision and Claude gets told why it was blocked. This prevents a whole category of accidents where Claude helpfully "fixes" your env configuration or updates a lockfile you manage manually.

The third is blocking dangerous shell commands. Another PreToolUse hook, this time matching on Bash operations, that scans the command for patterns like rm -rf, sudo, chmod 777, or DROP TABLE. If any of those patterns are detected, the command is blocked before it ever executes. You can customize the pattern list for your environment.

The fourth, and one of the most underappreciated, is injecting reminders after context compaction. This is a SessionStart hook with the "compact" matcher. Here's the problem it solves: when Claude's context window fills up and gets compressed, your early instructions lose fidelity. The CLAUDE.md you carefully wrote gets summarized and degraded. This hook re-injects critical reminders, like which package manager to use, what sprint you're in, or what to test before committing, after every compaction event. The instructions stay fresh even in long sessions.

The fifth is desktop notifications when Claude needs input. This is a Notification hook that fires an OS-level alert whenever Claude is waiting for permission or has finished a task. It sounds minor, but it changes how you work. Instead of staring at the terminal waiting for Claude to finish, you switch to other work and get pinged when your attention is needed.

The sixth is the full-stack quality chain. This is a PostToolUse hook that runs your formatter, then your linter, then a type check in sequence, piping all output back to Claude. If anything fails, Claude sees the errors and self-corrects on the next iteration. This is the mechanical enforcement that replaces "please remember to lint" in your CLAUDE.md. It doesn't ask Claude to remember. It just runs the tools every time and lets Claude react to the results.

You configure hooks through the interactive slash-hooks UI or by editing your settings files directly. Global settings go in your home directory's Claude settings file. Project-shared settings go in the dot-claude directory at your project root, which gets committed to git. Personal settings that shouldn't be committed, like notification preferences, go in the local settings file, which is gitignored by default.

Part 2: Power-User Workflows

The Plan-First Discipline

Every power user converges on the same workflow. The specific prompts vary from person to person, but the underlying structure is universal.

Start by clearing context. Use slash-clear to get a fresh context budget. Old conversations pollute new work. Then enter Plan Mode by pressing Shift-Tab twice. Plan Mode makes Claude read-only. It can explore your codebase, read files, reason about approaches, and ask questions, but it cannot modify anything. This is important because it separates thinking from doing.

Describe the feature you want. Claude reads the relevant files, thinks through the architecture, and proposes an approach. This is where you push back, and pushing back is the whole point of the planning phase. Ask what could go wrong. Ask about edge cases Claude hasn't mentioned. Ask about alternative approaches and their tradeoffs. The few minutes you spend here consistently save thirty or more minutes of backtracking later, because catching a wrong assumption during planning is cheap and catching it after implementation is expensive.

When the plan looks right, have Claude create a TODO or spec file with numbered milestones. Review it. Edit if needed. Then switch out of Plan Mode and tell Claude to implement milestone one, writing tests first and then implementation, committing when done. Watch the work, approve tool calls, redirect when Claude goes off track. When Claude commits milestone one, move to the next. When everything's done, have Claude run the full test suite and summarize results.

The discipline is Plan Mode. If you skip it, if you jump straight to "build me a login page," Claude will produce something. It might even be good. But for anything non-trivial, the agent makes better decisions when forced to think before it acts, and you make better decisions when you can see the plan before committing to it. Treat planning as non-negotiable.

Worktrees: Running Agents in Parallel

Git worktrees are the most underrated Claude Code feature. If you're not using them, you're leaving a significant multiplier on the table.

A worktree creates a separate working directory from the same repository. Each worktree has its own branch and its own copy of the files, but they all share history and remotes. The practical implication is that you can run three to five Claude Code instances simultaneously on the same repo without any conflicts. Agent one is building a feature in one worktree. Agent two is fixing a bug in another. Agent three is refactoring a module in a third. They don't interfere with each other because they're working in isolated directories on separate branches.

As of February 2026, Claude Code has built-in worktree support. Start Claude with the worktree flag and a name, and it creates an isolated worktree automatically. Start another session with a different name, and you've got two agents working in parallel. Add the tmux flag to launch Claude in its own terminal session for background operation, so you can detach and reattach as needed.

Claude handles cleanup intelligently. If a worktree session ends with no changes, the worktree and its branch are auto-removed. If there are changes or commits, Claude prompts you to keep or remove the worktree. Add the dot-claude worktrees directory to your gitignore so these temporary directories don't clutter your repository.

Custom subagents also support worktree isolation. Add "isolation: worktree" to your subagent's frontmatter and it automatically gets its own working directory. This is particularly powerful for large batched changes and code migrations where you want multiple agents working on different parts of the codebase at the same time without stepping on each other.

The daily rhythm with worktrees looks like this: open two or three worktrees in the morning for the day's tasks. Kick off each agent with a plan-first prompt. Cycle between terminals throughout the day, approving and redirecting as each agent progresses. Close each one with your PR command when the work is done. Review and merge in the evening. Three PRs in parallel time instead of serial time.

Headless Mode and External Automation

The print flag makes Claude Code scriptable. Instead of an interactive terminal session, Claude runs a single prompt, produces output, and exits. Combined with JSON output format, it returns structured data including cost, duration, session ID, and the actual result. This is the bridge between interactive development and full automation.

You can chain sessions using the resume flag with a captured session ID. Start a session that analyzes a problem, capture its session ID from the JSON output, then continue the conversation with a second command that implements the fix, and a third that opens a PR. Each command picks up where the last left off, maintaining the full conversation context. This means you can break complex multi-step workflows into discrete commands that compose together.

This is also how Claude Code connects to external automation platforms, and that connection is where the real transformation happens.

Part 3: SWE Orchestration, the Automated Pipeline

This is the section that matters most. Everything above, the MCP servers, the hooks, the worktrees, the headless mode, they're building blocks. Here we wire them into a system that automates the entire software engineering lifecycle, from ticket to deployment. This is what separates a developer who uses AI from a developer who runs an AI-powered engineering operation.

The Vision: Ticket In, Deployment Out

Imagine this: someone files a GitHub issue describing a bug or requesting a feature. Without you touching anything, an agent picks up the ticket, reads it, explores the codebase, plans an approach, implements the fix, writes tests, runs the test suite, opens a pull request, gets an automated review, fixes any review feedback, monitors CI, and auto-merges when everything passes, all into a staging branch. The only thing left for you is to review the merged result and approve the production deployment.

This isn't hypothetical. Every piece of this pipeline exists today as of February 2026. The question is which tools you use to wire it together. There are several approaches, each with different tradeoffs in complexity, flexibility, and infrastructure requirements.

Approach 1: Claude's Native GitHub Integration

This is the simplest path and the one to start with. Anthropic's official GitHub Action, called claude-code-action, turns Claude into a first-class participant in your GitHub workflow. You install it by running slash-install-github-app inside Claude Code, which walks you through setting up the GitHub app and the required secrets.

Once installed, you create a workflow file in your repo's GitHub Actions directory. The workflow triggers on issue comments, PR events, or custom events. When someone mentions @claude in an issue or PR comment, the action activates. Claude reads the context, analyzes the relevant code, and either answers questions or implements changes directly, pushing commits to the branch.

The real power comes when you set it up to trigger automatically. You can configure a workflow that fires whenever an issue is labeled with something like "claude-autofix." When that label is applied, Claude reads the issue, implements a fix, writes tests, and opens a PR, entirely unattended. You can also trigger on new issue creation, on a schedule, or on any GitHub event that the Actions platform supports.

As of February 20, 2026, literally two days before the research for this guide was written, Claude Code's desktop app added auto-fix and auto-merge capabilities for pull requests. Here's how that works: when you open a PR, Claude monitors its CI status in the background. If CI fails, auto-fix reads the failure output and attempts to resolve the issue automatically. If auto-merge is enabled, Claude merges the PR via squash once all checks pass. The workflow implication is significant. You open a PR, move on to your next task, and by the time you circle back, the original PR is either ready for final review or already merged into your target branch.

Putting all of this together, the full pipeline with Approach 1 looks like this: an issue gets filed. A GitHub Action triggers Claude to implement a fix and open a PR against a staging branch. Claude monitors CI on that PR, auto-fixing any failures that come up. When all checks pass, Claude auto-merges into staging. You review the staging branch periodically, and that review is your human checkpoint. When you're satisfied with what's accumulated, you merge staging to main, which triggers your existing deployment pipeline.

The beauty of this approach is that it requires almost no infrastructure beyond what you already have. GitHub Actions runs the compute. Claude's API handles the intelligence. Your existing CI/CD pipeline handles deployment. You're just adding a trigger at the beginning and a human gate at the end.

Approach 2: n8n or Workflow Automation Calling Headless Claude

If you want more control over the orchestration, or if your trigger isn't a GitHub event, workflow automation platforms like n8n give you a visual pipeline builder that can call Claude Code in headless mode.

The proven pattern, demonstrated by NetworkChuck's n8n-claude-code-guide, uses n8n's SSH node to connect to a server where Claude Code is installed. The flow works like this: a trigger fires. That trigger could be a Slack message, a scheduled cron job, a webhook from your project management tool, an email, anything n8n can listen to. n8n extracts the relevant information from the trigger event. An SSH node executes a Claude Code headless command on your server, passing the task as a prompt. A code node parses the JSON output. Then n8n takes action based on the result, which might mean creating a GitHub issue, sending a Slack notification, updating a database, or chaining into another Claude session for the next step.

For deeper integration, a dedicated n8n node for Claude Code called n8n-nodes-claudecode provides native support for local, SSH, and Docker execution with persistent session management. This makes Claude Code a first-class n8n node alongside HTTP Request, Slack, GitHub, and everything else in the n8n ecosystem. You can maintain conversation context across multiple n8n workflow executions using session IDs, so a complex multi-step task can span multiple triggers and still maintain continuity.

Going the other direction, the n8n-MCP server gives Claude Code deep knowledge of all twelve hundred and thirty-six n8n nodes. So you can describe what you want in natural language, something like "build me an n8n workflow that monitors Stripe webhooks and creates Linear tickets for failed payments," and Claude generates a deployable workflow configuration.

The n8n approach is more powerful than the GitHub Action approach because your triggers aren't limited to GitHub events. A Slack command that says "fix issue 42" can kick off the entire pipeline. A scheduled nightly job can scan for outdated dependencies, update them, run tests, and open a PR with the changes. A webhook from your error monitoring service can trigger Claude to investigate and fix production bugs as they're reported. The triggering possibilities are essentially unlimited.

Approach 3: OpenClaw as the Orchestration Layer

OpenClaw, the open-source personal AI assistant formerly known as Clawdbot, takes a fundamentally different approach. Where Claude Code is scoped to your codebase and n8n is scoped to workflows you explicitly define, OpenClaw is what its creators call a "life OS" that runs on your device and connects to everything: WhatsApp, Slack, Telegram, your file system, a headless browser, and yes, Claude Code.

The relevant capability for our purposes is that OpenClaw can act as an always-on orchestrator that triggers Claude Code headless sessions based on events from any channel. You could message your OpenClaw bot on Telegram saying "The authentication is broken in production, see Sentry alert 4521." OpenClaw reads the Sentry alert, spins up a Claude Code headless session pointed at your repo, has it investigate the issue and implement a fix, then reports back to you on Telegram with a link to the PR it opened. The entire cycle happens through a chat message.

OpenClaw's Heartbeat feature, its built-in cron scheduler, means it can run proactive tasks without any trigger at all. You could configure a morning routine that checks your GitHub notifications, summarizes open PRs, identifies stale issues, and auto-assigns any that match patterns you've defined. All of that runs before you even sit down at your desk.

The tradeoff is significant though. OpenClaw is self-hosted, requires Docker, and has had real security vulnerabilities, including a critical remote code execution exploit that was publicly disclosed. It requires substantially more DevOps knowledge to set up safely than either the GitHub Action or n8n approaches. Security researchers recommend running it in an isolated environment with strict sandboxing. If you're comfortable managing infrastructure and understand the security implications, it's the most flexible option. If you're not, start with Approach 1 or 2 and come back to OpenClaw when your needs outgrow those tools.

Approach 4: Direct Headless Scripts on Your Machine

The simplest version of automation doesn't require any platform at all. You write shell scripts that call Claude Code in headless mode and trigger them however you like: cron jobs, filesystem watchers, git hooks, or manual invocation.

A nightly maintenance script, for example, runs Claude Code with a prompt to check for outdated dependencies, update any that have security vulnerabilities, run the test suite, and either open a PR with the updates or create an issue if something breaks. You add it to your crontab and forget about it.

An issue-to-PR script takes a GitHub issue number as an argument, starts a Claude session to read the issue and analyze the codebase, captures the session ID, then continues the session to implement the fix, write tests, and open a PR. You could trigger this manually when you triage issues in the morning, or wire it to a GitHub webhook for automatic execution.

The advantage of direct scripts is zero infrastructure. Just your machine and your Claude API key. The disadvantage is that it only runs when your machine is on and connected. For many solo developers, that's a perfectly acceptable tradeoff.

The Complete Pipeline: Putting It All Together

Regardless of which approach you choose for the trigger and orchestration layer, the end-to-end pipeline follows the same shape. Seven stages from intake to production.

Stage 1 is Intake. A ticket is filed, either by a human, by an error monitoring service, by a scheduled audit, or by you manually triaging and labeling. This is the trigger that starts everything.

Stage 2 is Implementation. Claude reads the ticket, explores the codebase, plans an approach in plan-first mode, implements the change, and writes tests. This happens in a headless session or a GitHub Action, working in its own branch or worktree.

Stage 3 is Self-review. Before opening the PR, Claude runs a code review process. In the most sophisticated version, this launches four parallel review agents that independently score each finding for confidence. Anything above the threshold, which defaults to eighty out of a hundred, gets flagged as a genuine issue. Claude addresses its own review findings. This is the automated equivalent of a developer reviewing their own work before requesting a peer review.

Stage 4 is PR and CI. Claude opens a pull request against your staging or development branch. CI runs. If CI fails, Claude's auto-fix reads the failure output and iterates on the code. This loop continues until CI passes or a maximum retry count is reached, at which point it flags the issue for human attention.

Stage 5 is Auto-merge to staging. When all checks pass, auto-merge lands the PR into your staging branch via squash merge. No human involvement needed for this step.

Stage 6 is Human review. This is the critical gate, and it's the step that should never be automated away. You review what's accumulated in staging: reading the PRs, checking the diffs, testing the behavior. This is where your judgment, your understanding of the product, and your architectural vision matter most. The agents did the implementation work. You make the decision about whether the result is right.

Stage 7 is Production deployment. You merge staging to main. Your existing CI/CD pipeline, whether that's GitHub Actions, Vercel, AWS, or whatever you use, handles the actual deployment. This part of your workflow doesn't change at all. You just added a five-stage automated pipeline before it.

The key insight is that the human never leaves the loop. You've just moved from Stage 2, where you were writing the code, to Stage 6, where you're reviewing the result. That's a fundamentally different job, and it's one that scales in a way that writing code by hand never can.

Automated Review by Separate Agents

The pipeline described above uses Claude reviewing its own code, which is better than no review but has obvious limitations. You wouldn't ask a developer to be the sole reviewer of their own pull request, and the same logic applies to agents.

The more sophisticated pattern, described in an O'Reilly article on auto-reviewing Claude's code, uses a separate subagent with a deliberately critical mindset to do the review. The main agent that wrote the code never reviews it. A fresh agent with a different prompt, one that's explicitly told to be skeptical and thorough, evaluates the work independently. The separation matters because the reviewing agent has no sunk-cost attachment to the implementation choices.

You can layer this further with Anthropic's official code-review plugin, which spawns four parallel review agents that independently assess the changes. Each finding gets a confidence score, and only issues scoring above your configured threshold make it into the final review. This filtering eliminates noise and focuses human attention on genuine problems.

For teams, the GitHub Action itself becomes the reviewer. When any team member opens a PR, Claude automatically reviews it and leaves comments. The team member addresses the feedback, Claude re-reviews, and the cycle continues until the code is clean. This creates a continuous review loop that catches issues before human reviewers ever look at the code, which means human review time is spent on architectural and product questions rather than catching lint errors and missing test cases.

Part 4: Stepping Up the Ladder

From Code Writer to Engineering Operator

Here's the career shift that agentic engineering enables, and it's the real reason to invest in all of this infrastructure.

A traditional software engineer spends most of their time in Stage 2 of the pipeline: writing code. They read tickets, implement features, fix bugs, write tests, and submit PRs. Their value is measured in output. How many features shipped, how many bugs fixed, how fast they can turn around a request.

An engineer who has automated Stages 2 through 5 spends their time in a completely different way. Their day looks like this: they wake up and review what the agents did overnight. They triage new issues, deciding which ones are worth automating and which need human design work. They review accumulated PRs in staging, exercising judgment about correctness, security, and product fit. They do the creative work, the system design, architecture decisions, cross-cutting concerns, and user experience decisions, that agents can support but should never own. And they maintain and improve the automation itself: tuning CLAUDE.md instructions, adding hooks for newly discovered failure patterns, refining prompts based on what worked and what didn't, and analyzing logs to find systemic issues.

This is a different job. It's closer to a tech lead or engineering manager than a traditional individual contributor. You're making decisions, not keystrokes. You're designing systems, not implementing them. You're evaluating quality, not producing it.

Two days before the research for this guide was written, a Claude Code creator at Anthropic predicted that the software engineering title would "start to go away" in 2026, saying the tool now writes nearly a hundred percent of internal code at Anthropic. Whether or not you agree with that timeline, the direction is clear. The engineers who will be most valuable aren't the ones who code fastest. They're the ones who can orchestrate, evaluate, and make judgment calls at scale.

The Data-Driven Flywheel

The best agentic engineers treat their setup as a system to optimize, not a static configuration you set up once and leave alone.

Claude Code stores session logs in the dot-claude projects directory on your machine. These logs contain everything: what Claude tried, what failed, what it had to retry, what you corrected, and how long each step took. The raw data for understanding your automation's performance is already being collected.

The practice is to periodically review these logs, or better yet, have Claude review them, looking for recurring patterns. If Claude consistently gets the import style wrong in your project, add the correct convention to CLAUDE.md. If it keeps trying to use a deprecated API, add a hook that catches the pattern and blocks it before execution. If a particular type of task always requires multiple correction cycles, create a skill or custom command that front-loads the right context so Claude gets it right on the first try.

Research from Arize found that optimizing CLAUDE.md through this kind of iterative prompt learning yielded five percent or more general improvement and eleven percent or more on repo-specific benchmarks. Those numbers might sound modest, but the gains compound. Every improvement reduces future correction cycles, which means faster completion times, lower API costs, and fewer interruptions to your own work. Over weeks and months, the cumulative effect is substantial.

Context Management as a Meta-Skill

One practice cuts across everything else in this guide: aggressive context management. Claude Code's two-hundred-thousand-token context window is finite, and power users treat it like a budget that needs active management.

Clear context liberally. When you start a new task, start with a fresh window. Don't let old conversation from the previous task pollute new work. The context from debugging a CSS issue is not helpful when you're implementing a database migration, and carrying it forward just wastes tokens.

Delegate research to subagents. When Claude needs to explore an unfamiliar part of the codebase or investigate a complex dependency, use a subagent to do the investigation in a separate context window. The subagent does the deep dive, then summarizes its findings back to the main session. This keeps the main session's context focused on the actual task.

Watch the compaction warnings. When context gets compressed, your early instructions lose fidelity. That's exactly why the SessionStart hook with the compact matcher exists, to re-inject critical reminders after every compaction event. If you're not using that hook, your long sessions are gradually losing the instructions you set up at the beginning.

Use worktrees instead of context switching. Don't try to work on feature A and bug B in the same session. Each task gets its own worktree, its own context window, its own focus. Context switching within a single session means both tasks get worse context instead of one task getting good context.

Keep CLAUDE.md short, under a hundred lines. Every token in that file loads on every session start and survives every compaction cycle. The developers finding measurable improvement from CLAUDE.md optimization are the ones keeping it tight, focused on rules that actually change Claude's behavior, not the ones writing novels full of general advice that Claude already knows.

Part 5: The Three Setups

Three progressive setups. Start with the first. Come back for the second when it feels natural. The third is for when you're ready to let agents run without you watching.

Setup 1: The Disciplined Solo Developer

You're a developer who uses Claude Code daily but mostly in a reactive prompt-and-accept loop. You type what you want, Claude produces it, you accept or reject. You want to level up to the plan-first, quality-enforced workflow.

What you'll set up: a lean CLAUDE.md, a starter MCP stack of three servers, formatting and safety hooks, and the plan-first workflow discipline.

Start with your CLAUDE.md and keep it under a hundred lines. Include your build and test commands so Claude knows how to verify its own work. Include your architecture at a high level with pointers to key directories so Claude knows where things live. Include code style rules that your linter can't enforce, the things that are conventions rather than syntax. And include your workflow expectations: plan before implementing, write tests first, commit at checkpoints. Leave out generic advice, long style guides, and inline file contents. If Claude needs to read a file, it can read the file.

Install three MCP servers at user scope: Context7 for accurate library docs, Sequential Thinking for architectural decisions, and Brave Search for when Claude needs to look something up. That's your starter stack. You can add more servers when you have a specific need for them, but these three cover the most common gaps.

Set up three hooks in your project settings: a PostToolUse hook that auto-formats every file Claude edits, a PreToolUse hook that blocks writes to protected files like environment configs and lockfiles, and a Notification hook that pings you when Claude is waiting for input. These three hooks cover the most common failure modes at this level.

Then practice the workflow on a real task. Pick an actual feature from your backlog. Clear context. Enter Plan Mode. Describe what you need. Push back on the plan. Have Claude create a TODO with milestones. Switch to implementation mode. Work through the milestones one at a time. Run the test suite at the end. The key habit you're building is resisting the urge to skip planning, because skipping planning feels faster but consistently isn't.

When you've internalized this, when planning feels natural, hooks handle formatting automatically, and you're comfortable with the approve-and-redirect rhythm, you're ready for Setup 2.

Setup 2: The Parallel Operator

You've hit the ceiling of one agent doing one thing at a time. You want to run multiple agents in parallel, automate PR creation, and have Claude handle your GitHub workflow end to end.

What you'll add: GitHub MCP, worktrees for parallel agents, custom slash commands, and the GitHub Action for review automation.

Add GitHub MCP to your project configuration so Claude can interact with your repository's issues and pull requests directly. Create three slash commands in your project's dot-claude commands directory. The first is a "feature" command that kicks off the plan-first workflow from a feature description. The second is a "pr" command that runs tests, runs the linter, fixes any issues it finds, and opens a PR via GitHub MCP. The third is a "review" command that diffs against main and analyzes for correctness, security, missing tests, and style violations.

Start running parallel agents with worktrees. When you have a feature to build, a bug to fix, and a refactor to do, open three worktrees and give each one its own task. Cycle between terminals throughout the day. Close each with your PR command when the work is done. Three PRs open in parallel time instead of serial time, which can triple your throughput for independent tasks.

Set up the GitHub Action for automated review. Create a workflow file that triggers on PR creation and issue comments. Now anyone, including you from your phone, can comment @claude on any PR for analysis or on any issue for implementation guidance. This also means your own PRs get an automated first-pass review before you ask a human to look at them.

The daily rhythm at this level: check what Claude's GitHub Action reviewed overnight. Triage issues for the day. Open two or three worktrees for the day's work. Cycle through them, approving and redirecting. Close each with PRs. Review and merge in the evening.

Setup 3: The Autonomous Pipeline

You've got the interactive workflow down cold. Now you want the full pipeline from Part 3: agents running without you present, triggered by events, executing headless, and reporting results.

What you'll add: the automated ticket-to-PR pipeline, headless scripts for common operations, the auto-review and auto-merge loop, and a human gate before production.

Start by setting up the GitHub Action to trigger on issue labeling. When an issue gets labeled "claude-autofix," Claude reads it, implements a fix, and opens a PR, entirely unattended. This is your first taste of fully autonomous operation, and it's low risk because you still review and merge the PR manually.

Enable auto-fix and auto-merge in Claude Code's desktop app for your PRs against the staging branch. Now when Claude opens a PR, it monitors CI in the background, auto-fixes any failures, and auto-merges when everything passes. You've automated Stages 2 through 5 of the pipeline.

Write headless scripts for recurring tasks. A nightly maintenance script that checks for outdated dependencies and security vulnerabilities. A weekly audit script that reviews code quality metrics and opens issues for anything that's degraded. A triage script you run manually that takes an issue number and produces a PR.

If you want triggers beyond GitHub events, set up n8n with an SSH connection to a server running Claude Code. Now a Slack message, a cron schedule, a webhook from Sentry, or an email can all kick off the pipeline. For maximum flexibility with more security overhead, OpenClaw can act as the always-on orchestrator, responding to messages on any platform and coordinating Claude Code sessions.

Establish your human gate. Set a protected staging branch as the target for all automated PRs. Review staging periodically, daily or after a batch of changes lands. When you're satisfied, merge to main, which triggers your deployment pipeline. The agents handle implementation. You handle judgment.

The daily rhythm at this level: wake up and review what agents did overnight. Check the staging branch. Merge what's good. Leave feedback on what needs adjustment, either as comments for the GitHub Action to pick up or as refinements to your CLAUDE.md and hooks. Spend your active hours on the work that actually requires you: system design, architecture decisions, product direction, and improving the automation itself.

What's Next

Claude Code is on a trajectory where the tool itself becomes increasingly invisible. The web version runs in Anthropic-managed cloud sandboxes, so you connect a GitHub repo and you're coding from a browser with no local setup. Sessions sync across desktop, web, and mobile. The Chrome integration lets Claude interact with your browser directly. Remote sessions continue running even if you close your laptop.

The more interesting trajectory is what Karpathy flagged: we're early in the decade of agents. Today's power-user setup, the worktrees and hooks and headless scripts and auto-review and auto-merge, is the manual version of what will increasingly be automated and abstracted away. The developers who understand the principles of agentic engineering, plan before implementing, enforce quality mechanically, manage context ruthlessly, maintain human oversight on judgment calls, will adapt as the tooling evolves underneath them. The ones who memorize specific commands and configurations will be starting over every six months when the tools change.

The biggest shift isn't technical though. It's about role. The engineer who automates their own job doesn't lose it. They graduate from it. They stop being the person who writes the code and start being the person who decides what gets built, how it's architected, and whether the result is good enough. That's always been the higher-leverage job. The automation just makes it accessible to a solo developer working from their kitchen table.

This guide is already a snapshot. The setups will change. The discipline won't.

MLA 024 Agentic Software Engineering

Vibe Coding Mini Series

Resources

Show Notes

@media (min-width:0px){.css-6k8fz8{display:none;}}@media (min-width:1200px){.css-6k8fz8{display:block;}}Learn Faster with a Walking Desk@media (min-width:0px){.css-1rb0nos{display:block;}}@media (min-width:1200px){.css-1rb0nos{display:none;}}Walk While You Learn

The Shift: Agentic Engineering

Tooling and Context Efficiency

High-Impact Workflows

The Automated Engineering Pipeline

Career Transition

Never Run Out of ML ContentGenerate Your Own Episodes

Transcript

Agentic Software Engineering

From Vibe Coding to Agentic Engineering

Part 1: The Power Tools

Your MCP Stack

Notable Skills and Plugins

Hook Patterns That Matter

Part 2: Power-User Workflows

The Plan-First Discipline

Worktrees: Running Agents in Parallel

Headless Mode and External Automation

Part 3: SWE Orchestration, the Automated Pipeline

The Vision: Ticket In, Deployment Out

Approach 1: Claude's Native GitHub Integration

Approach 2: n8n or Workflow Automation Calling Headless Claude

Approach 3: OpenClaw as the Orchestration Layer

Approach 4: Direct Headless Scripts on Your Machine

The Complete Pipeline: Putting It All Together

Automated Review by Separate Agents

Part 4: Stepping Up the Ladder

From Code Writer to Engineering Operator

The Data-Driven Flywheel

Context Management as a Meta-Skill

Part 5: The Three Setups

Setup 1: The Disciplined Solo Developer

Setup 2: The Parallel Operator

Setup 3: The Autonomous Pipeline

What's Next

Learn Faster with a Walking DeskWalk While You Learn