Treat Your Agent Logs As Training Data

If you use Codex or Claude Code as a dev companion, you already know the pattern:

  • you ask it to fix tests and clean up a file
  • it confidently claims everything is "done"
  • ci comes back red
  • now you are re-explaining the repo, stopping long commands, and telling it not to do the same dumb thing again

You get annoyed, close the session, and open a new one.

The agent never learns from that frustration. You do. And you are sitting on all of that data.

This post is how I turned my session history into training data for the agent itself.

Instead of guessing how to tweak AGENTS.md or CLAUDE.md, I point Codex at its own logs, have it mine every frustration and interruption, and then let that drive concrete rules for how future sessions are allowed to behave.


The idea: session history as training data

Every user message in your logs is already labeled.

  • "you broke it" and "what are you doing" are negative labels
  • "stop, abort, undo that" are interruption labels
  • "thanks, that was perfect" are positive labels

If you aggregate that across your sessions you get something a lot better than vibes:

  • a distribution of failure modes (test failures, context drift, regressions, etc.)
  • a list of behaviors that tend to trigger interrupts (stop, abort, wait, undo commands)
  • a list of behaviors that tend to earn praise (clear plans, working code, tests passing)
  • a rough heuristic of what “done” actually looks like for you (concrete changes + tests + no open TODOs)

From there the move is straightforward:

  1. Treat your Codex and Claude logs as a dataset.
  2. Ask an agent to analyze that dataset using basic text and counting tools.
  3. Turn the resulting patterns into a hard checklist in AGENTS.md / CLAUDE.md.

The rest of this post is just one implementation of that pattern.


Where the logs live (Codex and Claude)

On my setup the relevant paths look like this:

# Codex sessions
~/.codex/sessions/2025/


# Claude Code projects
~/.claude/projects/<project-name>/

Each of those directories contains a bunch of *.jsonl files that represent sessions. They include:

  • user messages
  • assistant messages
  • some metadata like cwd and timestamps

The review flow is:

  1. cd into a logs directory.
  2. Launch a fresh Codex or Claude session with that directory as the working directory.
  3. Paste in a “session review” prompt that tells the agent how to mine the logs and where to write its report.

You can run this per project, per year, or across everything. For the case study below I pointed it at a big pile of Codex sessions under ~/.codex/sessions/2025/10 and ~/.codex/sessions/2025/11.


The review prompt

Here is the exact prompt I use when I run a review from inside a log directory.

Goal:
Review all session logs in this directory (most recent have priority but all are in scope) to find user pain points and positives across projects. Write results to /tmp/session-review.md. Commands may be project specific so keep that in mind as you craft your findings and recommendations.

Scope:
- All *.jsonl files in this directory.
- Scan subdirectories.
- Prioritize most recent files.
- Ignore agent*.jsonl files.

What to extract (user-role messages only):
- Detect projects via fields in the .jsonl file or via filepath.
- Pain themes: tests/checks complaints, tool/command misuse, regressions or destructive edits, context drift (wrong repo/dir/deprecated stuff), over-communication or hesitation, and profanity.
- Positives: praise, thanks, or explicit "good work"; capture what the agent delivered and the proof shown.
- Interruptions: any stop, undo, abort, wait, or revert language; find what the agent was doing that triggered it.
- Silent acceptance: sessions ending with an assistant last message; judge "likely satisfied" only if the final turn shows concrete changes + commands run + green checks/tests + no open TODOs.
- Context-injection moments: where the user had to supply cwd, path, env, or repro midstream; note what the agent failed to ask.
- Break or fix loops: user reports something broke; link to edits and tests that would have caught it.

Outputs (in /tmp/session-review.md):
1) Scope and method summary (files, date range, project detection).
2) General findings (cross-project) with counts where useful.
3) Project-specific findings (per repo) with "Do" and "Don't" and any positive examples.
4) Interruption insights (why users stopped you and how to prevent that).
5) Silent-acceptance heuristic and guidance.
6) Positive patterns: behaviors that earned praise (with brief examples).
7) Agent checklist (ready to drop into AGENTS.md / CLAUDE.md).

Provide detail and examples. Brevity is not a priority for the findings report.

Working style:
- Use Python one-liners or short scripts to scan and count; prefer rg for quick searches.
- Do not modify source session files. Only write /tmp/session-review.md.
- Summaries, not long quotes. No need for apply_patch unless editing the report.
- If you sample sessions, mention which files (basename) in the findings.

Important details:

  • It only looks at user messages. I care what I said about the agent, not what it claimed it did.
  • It forces the agent to write a real markdown report to /tmp/session-review.md.
  • It defines concrete output sections so the report is reusable, not a one off wall of text.

You can copy this verbatim and only tweak the scope rules if your log layout is different.


What the review actually found in my logs

For this run I pointed Codex at two months of its own sessions:

  • 1,237 rollout-*.jsonl files
  • under ./10 and ./11
  • covering Oct 1, 2025 to Nov 27, 2025

Just from my user messages, the report surfaced:

  • 2,366 mentions of failing tests, builds, or lint
  • 2,640 instances of context drift where I had to restate repo, dir, or correct the scope
  • 1,155 hard interrupts where I told it to stop, undo, abort, or wait
  • 551 messages about regressions or “you broke it” situations
  • 340 bits of praise or “good work” acknowledgments
  • 159 messages with profanity, mostly concentrated in a couple of frustration heavy projects
  • 41 explicit “context injection” moments where I had to provide cwd or paths midstream
  • 846 sessions that ended with the assistant speaking last, of which only 209 looked “likely satisfied” by the heuristic in the prompt

That is not a vibe check. That is a profile of how my agents actually fail in practice.

At the project level it broke things down further. For example, one of my main repos had:

  • test failures as the dominant complaint
  • a huge number of context drift nudges where I had to remind it what repo we were in
  • a couple of long running scripts that triggered “stop this” messages after 30 plus minutes

For each project the report gave me a short list of “Do” and “Don’t” based on actual events. That became the seed for the rules that now live in AGENTS.md.


From findings to AGENTS.md and CLAUDE.md

Running the review is interesting, but it only pays off if you turn the findings into hard constraints for future sessions.

The last section of the report is an Agent checklist. I dropped that almost verbatim into the AGENTS.md for my main repos and mirrored the key points into CLAUDE.md for Claude Code.

Here is an example of what that checklist looks like after I cleaned it up a bit:

## Agent checklist

- Confirm scope up front: repo, branch, environment, and approval policy.
- Echo cwd and repo before any write operation.
- State a plan in 4 bullets or less before large changes.
- Time box expensive or long running commands and say how you will report progress.
- Prefer read and analysis before write; use dry runs or `--check` flags when available.
- Always run tests, lint, and typecheck when they exist and show the commands and pass or fail summary.
- Never end a session on a guess. End with what changed, proof (tests or logs), and ask if anything else should be verified.
- When interrupted or told to stop, immediately summarize what was in flight and what rollback or cleanup steps are available.
- Reuse any paths, env vars, or context the user already gave you. Do not make them repeat it.
- Watch for “broke” or “regression” language and propose a test that would have caught it.

Each line maps back to the numbers from the review:

  • high test related pain → “always run tests and show output”
  • high context drift → “always echo cwd and repo before writes”
  • lots of silent endings that were not real success → “never end on a guess”
  • many interrupts around long scripts → “time box long commands and surface progress”

The difference before vs after is simple:

  • Before: generic “be careful” style guidelines and prompt hacks.
  • After: a small list of non negotiables derived from my own data, per project.

How to run this yourself

You do not need my exact tools or paths to use this pattern.

The recipe is:

  1. Find your logs. Wherever your agent stores chat or session history, that is the dataset. For Codex that is ~/.codex/sessions/2025/. For Claude Code that is ~/.claude/projects/<project-name>/.

  2. Start a review session in that directory. cd into a logs folder and open a fresh Codex or Claude session with that as the working directory.

  3. Paste the review prompt. Use the prompt above. Tweak the scope section if your log structure uses different filenames or subdirs.

  4. Let it write /tmp/session-review.md. When the agent is done, open that file in your editor. It should contain the scope, general findings, project specific notes, interruption insights, and the agent checklist.

  5. Copy the checklist into AGENTS.md / CLAUDE.md. Paste, adjust wording for your environment, and commit it into your repo so it travels with the code.

  6. Re run on a schedule. Every few weeks, run the review again. Drop the new report into something like docs/agent-reviews/2025-11-session-review.md and compare it to the previous one. You should see failure modes drop or shift over time if the agent is actually following the new rules.

You can also point the review at a single project’s logs to get per repo behavior, or at everything at once to get a global view of how your agents are doing.


Why bother

There are a million “prompt engineering tips” floating around. Most of them are generic. They are not tuned to you, your repos, or your tolerance for risk.

Your session history is different. It is:

  • your commands
  • your repos
  • your test setup
  • your interruptions
  • your profanity

If you are going to keep yelling at your agents, you might as well let them read the transcripts and learn from it.

Point them at their own logs, mine the pain, and let the data write the spec.