Tutorial Part 3: When the Agent Goes Off the Rails
Tutorial Part 3: When the Agent Goes Off the Rails
Part 3 of the Command-Line LLM Tools tutorial series. Previously: Part 2 — When the Agent Forgets — managing context, compaction patterns, files-as-memory.
Part 2 solved memory. The agent stays sharp now — context doesn't rot, instructions don't get diluted, sessions reset cleanly. This article solves the next problem: the agent that does too much.
You asked it to fix one bug. It refactored four files. You asked it to draft an email. It rewrote three other documents you didn't ask about. You said "fix the test" and it deleted the test. The output is plausible. It's also not what you asked for, and you only notice when you're already invested in the response.
The solve has three layers. Git as the time machine that lets you recover. Approval modes as the soft guarantee that the agent asks before doing risky things. Hooks as the hard guarantee that certain things just can't happen, period.
This is the trust-to-enforcement transition. Most beginners run their CLI agent in trust mode and graduate to enforcement when it bites them. The point of this article is to install the discipline before it bites.
The three layers of "safe"
Trust → Soft Guarantee → Hard Guarantee. Each layer fails in a specific way.
| Layer | Mechanism | Failure mode |
|---|---|---|
| Trust | Instructions file rules | Agent forgot the rule, or you have too many rules |
| Soft guarantee | Approval modes ("ask before X") | Agent asks but you say yes too fast |
| Hard guarantee | Hooks (deterministic blocks) | Code runs before the agent acts; can't be bypassed |
| Recovery | Git | Doesn't prevent damage, but reverses it |
Layered together, these form an envelope. The agent can do real work inside the envelope. When something goes wrong — and it will — the envelope makes the wrong recoverable.
Git as the time machine
Most readers know what git is in the abstract. Here's how to make it tactical for AI-assisted work.
The micro-commit habit
Commit after every successful step the agent takes — not at the end of the day, not after a full feature ships. After each individual change you've reviewed and accepted.
Why: when something goes wrong, you want to recover to a few minutes ago, not yesterday. The granularity of your commits is the granularity of your safety net.
The commits will be messy. Lots of them. Half-baked messages, partial work, things you'll squash later or never. That's fine. Commit fearlessly. The point isn't the history — it's the safety net.
The "rewind" recovery
When something goes wrong:
git reset --hard <recent-commit>
Goes back to the last known good state. Anything uncommitted is gone, but anything committed is recoverable.
git checkout HEAD~1 -- <file>
Recovers one specific file from the previous commit. Useful when the agent corrupted one file but other recent work is fine.
The mental shift this enables: every change is reversible if you committed first. That changes what you're willing to let the agent attempt. The riskier the work, the more you commit beforehand.
What to commit (and what not to)
Commit: source files, brain files, decision notes, knowledge files. Anything that represents the work or its rules.
Don't commit: secrets, generated artifacts, anything in .gitignore, large output files. Make this a brain-file rule explicitly — the agent has a tendency to want to commit everything it's just produced.
Branching as work isolation
For experimental work, use branches:
git checkout -b experiment-rewrite-auth
The agent operates on the branch. If the experiment fails, throw the branch away. Your main branch only ever has work you've validated.
This is especially useful when you're going to let the agent try something you suspect won't work. Failure on a branch is free.
What this doesn't solve
- Files outside the repo (the agent might still touch them)
- External system actions (sent emails, API calls, posts to services)
- Operations that don't show in diffs (running scripts that change other systems' state)
For everything inside the repo, git is the safety net. For everything outside, you need the next two layers.
Approval modes — soft guarantees
Most CLI agents have approval settings. The shapes:
- Ask for everything. Every file write, every command. Slow, safe.
- Auto-approve scoped operations. File reads always auto, file writes ask, shell commands ask.
- Auto-approve everything (yolo mode). Fast, dangerous.
Where to start
Default to "ask before writes and commands" for the first few weeks. Read every plan the agent proposes. It's slow and annoying. It also works.
You're not just using the tool. You're learning the agent's tendencies — when it over-delivers, when it skips reading, when it interprets ambiguous requests in surprising ways. You can't learn any of that on yolo mode, because you never see the choices it almost made.
When to upgrade trust
Specific kinds of operations on specific kinds of files become safe to auto-approve once you've built confidence. "Always auto-approve writes inside this scratch directory" is reasonable — the worst case is throwing away the directory. "Always auto-approve shell commands" is not — the worst case is much worse.
Move toward auto-approval one scope at a time. Each upgrade should be a deliberate choice, not a tired-Andy-at-midnight choice.
The gotcha
The "ask first" guarantee is soft because you are the bottleneck. Tired-Andy at midnight says yes to everything. The agent asked. You approved. Damage done.
This is why hooks exist.
Hooks — hard guarantees
The transition from "trust the agent to follow the rules" to "code enforces the rules."
What a hook is
A small script that runs before certain agent actions. It can inspect the action and refuse it. Refusal is non-negotiable — the agent cannot proceed.
Examples: pre-tool-use hook (runs before any tool call), pre-write hook (before writes), pre-commit hook (before commits).
The shift in posture
- Trust: "Agent please don't delete files in /sacred/."
- Soft guarantee: "Agent ask me before deleting files in /sacred/."
- Hard guarantee: "Hook blocks any delete in /sacred/, regardless of approval."
The third one survives tired-Andy at midnight. The first two don't.
Examples worth writing
- Block writes to specific paths — config files, secrets directories, anything you know shouldn't be touched
- Refuse shell commands matching dangerous patterns (
rm -rf,git push --force, anything that touches production) - Validate file shape before write (must be valid JSON, must have required frontmatter, etc.)
What hooks don't solve
They protect against agent mistakes. They don't protect against your mistakes — if you bypass the hook deliberately, that's on you.
They don't help with subtle damage — the agent writes plausible but wrong code, the hook can't tell. Hooks catch categorical violations, not quality problems.
They have to be written and maintained. They're code. Treat them like any other code: review them, test them, remove them when they're no longer load-bearing.
Putting it together
A real walkthrough: the agent is doing a multi-file refactor.
- Brain-file rule says "plan before doing." Agent drafts a plan. You read and approve.
- Approval mode says "ask before writes." Agent confirms each write before making it.
- Hook blocks writes to
secrets.env. Plan didn't include touching it, but if a misinterpretation suggested it, the hook would refuse. - After each successful write, micro-commit.
- Halfway through, you realize the plan had a structural bug.
git reset --hardto before the refactor started. - New plan. Same protections. Second attempt succeeds.
This is what mature CLI use feels like. Not paranoid. Not yolo. Structurally protected.
Three habits that cement the pattern
Make commits cheap
git commit -m "wip" is fine. Commit fearlessly. Squash later if you care about history. The cost of a messy commit history is approximately zero. The cost of an unrecoverable mistake is non-zero.
Read the plan
Every plan. Every time. Even when it's tedious. The agent's plan is the contract — what it's about to do. If you skip reading it because the agent has been good lately, you'll skip reading the one that mattered.
Reading takes thirty seconds. Recovering from an unread plan takes an hour.
Write a hook the moment you notice a class of mistake
The agent did something destructive once. Decide whether it was a one-off or a class. If it's a class, write a hook so it can't happen again — not "won't if it remembers the rule," but can't.
Hooks accumulate slowly. Don't try to write them all upfront. The right rate is "one new hook per week" until you have the obvious safety surface covered.
Closing: what's next
The agent now operates inside a safety envelope. It can't easily damage your work. It asks before risky operations. You can recover from anything via git.
But you've started noticing a different problem: the agent doesn't know enough. It doesn't have access to your database, can't see your specific docs, doesn't have specialist knowledge for the work you do most. Part 4 covers extending the agent's capability — skills and sub-agents.
The Command-Line LLM Tools Tutorial Series
- Why Use Command-Line LLM Tools — the case for moving from chat to terminal
- Part 1: Getting Started with Claude Code & Codex CLI — install, instructions file, day-one rules
- Part 2: When the Agent Forgets — managing memory, context, and compaction
- Part 3: When the Agent Goes Off the Rails (you are here)
- Part 4: When the Agent Doesn't Know Enough — extending capability with skills and sub-agents
- Part 5: When Configuration Becomes Architecture — what happens when you outgrow a single instructions file