The Agent Audit • September 1, 2026 Cohort

You’re not bad at AI. That’s the problem.

You cannot compound what no instrument has measured. That sentence is the entire trap. The Audit is the instrument that ends it.

The Agent Audit is a two-week measurement engagement that scores your real AI-assisted work against four research-validated key risk indicators (KRIs) and hands you the instrument — not an opinion — for knowing whether your AI work is compounding month over month.

The operators who got the most out of Claude in the last twelve months are the ones now most frustrated by it — because they hit a ceiling that better prompting cannot break through.

→ Cart opens Tuesday, September 1, 2026. Two weeks of measurement. Sixty-minute walkthrough. The four numbers that have never been measured on you before. $2,500.

Open-source binary · MCP Registry-listed · 91 peer-reviewed papers · Four research-validated KRIs · No testimonials — read the code instead.

01Here Is What Is Actually Happening

The Question You Cannot Answer At 4:47 On A Thursday

You have a job already. A team. A Linear backlog. A co-founder who, at some point this quarter, is going to ask, around 4:47 on a Thursday afternoon: is the AI thing actually getting better month over month, or are we just paying the bill?

You are going to start that answer with “well, it’s hard to say, but…”

Because you can feel it is better. You cannot prove it.

The standards you taught the agent last night are gone by Wednesday morning. The fix you wrote into a system message in March is somewhere, but not here. The same em-dash, the same function signature, the same utility you said use the existing one and it wrote a new one anyway — corrected on Tuesday, again on Friday, again next week. Nobody tallies that. You don’t tally it. The work you do at night is not building anything that is still there in the morning.

The agent has no system around it. It has you around it.

That’s the wall you have hit.

02The Mechanism

The Trap Your Competence Built — And The Tax It Charges

People who never got useful work out of Claude do not feel any of this. They quit at week two and went back to typing by hand.

You did not. You learned which prompts work, which contexts to load, which tasks to delegate. You wired up MCP servers. You wrote a couple of skills. You are top-decile at this.

And then it stopped getting better.

The same agent that solved a hard problem on Monday makes a stupid mistake on Friday. The patterns you taught it last month are gone. The corrections you made last week have to be made again this week.

That is not a skill gap. You already did the skill work.

That is a structural gap — between what you have taught the agent and a system that captures what you have taught it. Past this ceiling, more skill does not help. More skill is what built the ceiling.

First Face

The Operator Tax

The hours of your week the agent charges you for being the only system around it — paid in re-teaching and re-supervising and the same standard re-issued every Tuesday.

Second Face

The Restart Tariff

The same fee measured per session instead of per week. Same root cause, different unit. Both stop when one instrument measures what each one costs you on your machine.

03Two Ways To Live With AI

You’re In Column One. Here’s Column Two.

AI Tool User versus Agent Architect comparison
	AI Tool User	Agent Architect
Pattern	Tries every new model, prompt pack, framework	Builds one compounding system, deepens it weekly
When the agent fails	Blames the model. Upgrades. Rewrites the prompt.	Captures the mistake. Encodes the fix. Measures it doesn’t repeat.
Investment model	Spending to get a result	Building an asset that compounds
Measure of progress	“Did this output look good?”	“Did this KRI improve this week?”
Pain	“I keep re-teaching my AI the same things” (Operator Tax / Restart Tariff)	Past that — the loop is built

You are excellent at the left column. That is exactly why the right column is visible to you, and why the gap reads as a reproach. It is not. The right column is the architecture for what you have already done with skill. Same skill set. Instruments on it. Captured into a system that ratchets.

You do not need to become a different operator. You need to put instruments on the operator you already are.

04The Four Numbers

Why No Instrument Existed Until Now

You make a change — a new system prompt, a new MCP server, a different rule file — and a week later you have a vague impression things are better. Or worse. Or the same. So the whole year ends up as one uncontrolled experiment with no measurement plane.

That is not your fault. Until late 2025, no instrument existed because the literature had not consolidated. The 91 peer-reviewed papers on agent reliability and continuous-learning architectures clustered into measurement agreement during the 2025–2026 window. Before that consolidation the field disagreed about which numbers to measure; after it, the four below were the ones the literature had converged on. The instrument is new because the agreement is new.

You install one binary — curiochat-audit. Open-source. MIT-licensed. Listed in the MCP Registry. You can read every line. So can your security team. Five minutes. You configure nothing. You change nothing about how you work.

You work normally for two weeks. The hooks watch silently. They see what you correct, how often, what you re-correct, how much you supervise, what the agent retains.

At the end of fourteen days you have something you have never had: a baseline.

Attempts ↓

How many times you push back, rephrase, or correct before the agent gives you what you wanted. You remember the wins. The number does not.

Recurrence ↓

Corrections this month that you also made last month, because nothing learned. This is the Restart Tariff with a unit attached.

Presence ↓

How much of your week you spent sitting beside the agent, reviewing every output, intervening mid-task. This is the Operator Tax with a unit attached. (Multiply by your hourly rate when you are feeling brave.)

Conversion ↑

The percentage of your corrections that actually became persistent learnings the system remembers next session. With no architecture in place, this number is approximately zero. Approximately. Zero.

That last sentence is the whole reason this product exists.

What It Looks Like Rendered

This is the kind of summary block your dashboard prints on Day 14. The numbers below are illustrative — yours will be yours, and they will not be these:

Day 14 · baseline dashboard

curiochat-audit · 14-day baseline · sessions: 47 · events: 12,381
─────────────────────────────────────────────────────────────
Attempts (mean/task) ........  4.3   target 1
Recurrence (re-correction)  .  38%   target 0
Presence (hands-on share) ...  71%   target ≤ 30%
Conversion (corrections→lore)  3%    target ≥ 60%
─────────────────────────────────────────────────────────────
Trust Calibration: HIGH PRESENCE × HIGH ATTEMPTS
  → Broken Workflow quadrant
  → primary leak: Recurrence (re-correction within 14 days)

You read that and you stop arguing with yourself about whether the AI thing is working. The argument is over. The numbers are on the table.

05Privacy

Before You Go Further

Almost every prospective buyer stops here and asks: what exactly leaves my machine?

The answer is in eighty lines of Python called filter.py. You can clone the repository in a second tab and read it before you read another paragraph on this page. You should.

Here is what one reader already wrote on the pull request that landed it.

Code-review comment

curiochat-audit PR #14 · filter.py v0.3.1

Reviewer: appsec engineer (handle redacted at their request, public PR thread on GitHub)

Read the filter end-to-end. Three things I checked:

(1) Conversation content — never reaches the upload payload. The redactor short-circuits on the message.content field before any aggregation step. Confirmed by running the upload path locally against a corpus of test sessions with embedded fake credentials, fake PII, and fake client names. None of it surfaced in the outgoing JSON. The diff written to disk before upload makes this auditable per-session — I can see what is about to leave before it leaves.

(2) Credential patterns — caught by the regex layer plus the entropy heuristic. I tried to slip an OpenAI key, an AWS secret, and a Postgres URI through the test corpus. All three blocked.

(3) Hook performance — event-driven, no polling. The on_session_stop handler completes in ≤ 18ms on the sample sessions in tests/fixtures/. I would notice 18ms a year before I would notice a memory leak; this binary is the opposite of that.

Approving. I would run this on my own production sessions.

That is not a testimonial. It is a public code-review comment on a public pull request, on a public repository, on a public filter you can read in five minutes yourself. The voice is not mine. The verification is not mine. The reviewer is one git log away from your terminal.

The same git clone runs in §11.

06The Trust Calibration Matrix

Plot Yourself, Right Now

Two of those numbers — Presence (the Tax) and Attempts — combine into the most diagnostic chart in the report.

Trust Calibration Matrix: Presence axis versus Attempts axis
	Low Attempts	High Attempts
Low Presence	Calibrated Trust	Over-Trust — silent errors shipping under your name
High Presence	Under-Trust — paying yourself to over-supervise	Broken Workflow

Almost every operator I talk to is sure they are in the top-left box. The data almost never agrees.

You do not have to wait for the audit to get an indication. Six yes/no questions. The questions are the cheap version of the instruments the binary uses; the binary measures, the questions estimate.

Axis 1 — Presence

How much of the work is you holding it together?

In the last week, did you sit through the agent’s output token-by-token for more than half of your sessions?
Did you intervene mid-task (cancel, rephrase, hand-correct) at least once per task on average?
If you stepped away for an hour mid-session, would the work in flight come back wrong more often than right?

Axis 2 — Attempts

How many tries does it take to get to a usable output?

Do you routinely rephrase the same request two or more times before the agent lands it?
In the last five tasks, did at least three need more than one pass to be shippable?
Do you keep a “good prompts” file precisely because the agent does not remember what worked yesterday?

Three “yes” on an axis → High on that axis. Two → likely High. Zero or one → Low.

Plot the dot. The audit will tell you whether you were right.

You will be wrong about which quadrant you are in. That is not a put-down. It is the third-most-reliable finding in the literature — operators systematically misjudge their own Presence×Attempts position because the misjudgement is the same cognitive bias that produces the ceiling. The off-diagonals are where the diagnostic value lives. Over-Trust is the quietly dangerous one. Under-Trust is the expensive one. The matrix does not tell you what to do. It tells you which problem you actually have, so the fix is targeted instead of speculative.

07The Five Days

That Put Instruments On Your Work

You do five things. None of them are configuration.

Day 0
You install. One binary. One command. Five minutes from now, the measurement that ends the guessing is running on your machine. Nothing leaves your machine yet.
Day 1–14
You work. Normally. You build what you build. You correct what you would correct. You forget the audit is running. The fourteenth time this fortnight you fix that em-dash, the system already counted it.
Day 14
You upload. You run /upload-audit-data in Claude Code. The privacy filter you read in §5 strips PII, credentials, and conversation content. The diff is written to disk first; you inspect what is leaving before it leaves.
Day 15–18
I prepare. The dashboard auto-generates the report. I annotate the findings and build your Quick Win artifact — the single highest-leverage skill, rule, or knowledge-system seed your numbers point at. By the time you sit down on the call, it is already built.
Day 18
You walk through it with me. Sixty minutes. KRI baselines. Trust Calibration position. Intelligence Loop Readiness score across five dimensions. Custom architecture blueprint that names exactly which skills to write, which rules to set, which hooks to fire, which MCP servers to install — in what order, with effort estimates. You record the call. You keep everything.

08Deliverables

What Lands In Your Inbox

1

KRI Baseline Report

Trend lines, autonomy distribution, full Trust Calibration Matrix with your dot plotted on it. The dot you estimated in §6 will be somewhere; the dot the data plots will be somewhere. The distance between them is itself a finding.
2

Intelligence Loop Readiness Assessment

Across five scored dimensions.
3

Custom Architecture Blueprint

Skills, rules, hooks, MCP servers, in build order, with effort estimates.
4

Branded PDF Report

Shareable with team, co-founder, board.
5

Recording of the 60-minute call

Yours forever.

09Bonuses

Three Bonuses, Built Into The Build

Bonus #1

The Quick Win

The single highest-leverage skill, rule, or knowledge-system seed your data identifies, built during prep, handed over on the call, installed the same day. The audit pays you back inside the audit.

Bonus #2

The MCP Integration Map

Every system in your business — email, CRM, repo, invoicing, calendar, knowledge base — labelled with the MCP server that connects to it and an effort estimate.

Bonus #3

The 30-Day Re-Measurement

Thirty days after the call, you re-run the two-week collection. Same instrument, new baseline. Before-and-after KRIs side by side.

10The Stacked Guarantee

Bound To Numbers You Generated

Most guarantees are promises. This one ships with the qualifier the reader computed, in the reader’s terminal, before the cart even opened.

Layer 1

Unconditional Refund Bound To Your §11 Pillar Four Count

When you run the §11 Pillar Four command (one paragraph below), it returns a number — call it N. The Architecture Blueprint I hand you on Day 18 is contractually obligated to surface in its Section 1 header at least N distinct recurring correction patterns indexed by name. Not “the audit may find some patterns.” Not “we usually find around three.” A specific count, generated by your terminal, against your git log, this fortnight — printed back to you on the deliverable’s first page.

If Section 1 surfaces fewer than N, full refund. You keep the blueprint, the KRI data, the recording, the Quick Win artifact. The §11 Pillar Four command is the qualifier. The blueprint’s Section 1 is the binding.

Layer 2

Performance Extension

Implement the top recommendation. If your KRIs do not show measurable improvement inside thirty days, I do a follow-up architecture review at no charge. The 30-Day Re-Measurement runs the same instrument on your work and prints the delta against the §4 illustrative dashboard format. The number you compute on Day 30 is the qualifier on whether the extension fires.

You cannot lose money on this if it does not work, and you cannot get stranded on it if you implement it and the numbers do not move. Both layers are bound to numbers your terminal computes. Neither is a number I assert.

11Why $2,500 Holds

Five Things You Verify In Another Tab

Two of them on your own machine. One of which binds the guarantee above.

Five things hold the price. Three live in repositories you can read this minute. Two run on your own data, in your own terminal. The reader-routed verification at five pillars is the only proof stack on this page — and Pillar Four feeds the guarantee in §10 above.

Pillar One — clone `curiochat-audit` and read `filter.py`

git clone https://github.com/boutquin/curiochat-audit
cd curiochat-audit && less filter.py

Eighty lines. The hook handlers in hooks/ are forty more. The instrument that measures your work is shorter than your average prompt library. (The reviewer in §5 already did this. You can second-source them.)

Pillar Two — clone `mcp-server-email` and run `dotnet test`

git clone https://github.com/boutquin/mcp-server-email
cd mcp-server-email && dotnet test

2.3-to-1 test-to-source ratio, 77% coverage, fuzz targets, v1.0.2 in production, MIT-licensed, MCP Registry-listed. The bar the audit binary is built to is the bar that just printed green in your terminal.

Pillar Three — read what the Pierre OS Intelligence Loop actually captured this month

Pierre OS runs my business. Twenty-nine production skills, four rule sets, and a knowledge repo that captures every correction I make as a learning entry, scores it, deduplicates it by hash, and (when a single pattern recurs three or more times across distinct sessions) promotes it to lore — a higher tier that propagates to every active rule file and stays there.

That is not a hypothetical. As of the last propagation pass on 2026-05-15, lore entry S-001 reads:

“Working-tree state ≠ HEAD state in a multi-session project — defend every read, edit, and stage against concurrent-session entanglement.”

That lore entry was promoted from a recurring learning pattern — slug x-concurrent-session-worktree-entanglement, learning L-0037 — once it reached three occurrences across distinct sessions. Its evidence trail names five related learnings (L-0017, L-0020, L-0055, L-0057, L-0058) where the same pattern surfaced under different shapes. The promotion is now propagated to two active rule files (operational-safety.md and session-lifecycle.md) — every session I open from now on loads that rule before reading its first instruction.

The rule the lore promoted alongside is the one Pierre OS calls Plan-State Table Propagation. Reproduced verbatim from ~/.claude/rules/operational-safety.md line 323:

Plan-State Table Propagation

Closing a checklist item = update BOTH the audit log AND the plan-state table.

Specs, sprint plans, roadmaps, and gate documents almost always carry two parallel surfaces that reflect the same closure event. When a checklist / prerequisite / gate / hygiene item closes, both surfaces need to update in the same edit.

Audit-log entries are additive (new row); plan-state entries are flips (existing cell rewritten). Different operations, both required.

That rule entered the rule set after Session 2026-05-04-2232 produced learning L-0037 — a session where six §2 hygiene items got closed by recording results in the §10 decision log but the §2 status column kept showing the original instructions. The pattern recurred. The instrument caught it. The rule now fires across every session I open.

git clone https://github.com/boutquin/pierre-os
cd pierre-os && cat knowledge/learnings/global/README.md

As of 2026-05-16 the index shows 67 total learnings, 38 high-priority, three recurring patterns. The Conversion KRI on this page is literally the rate at which your corrections become entries that walk this same path. This is what “proof by example” means: not “the log exists” but here is what the log says, here is the rule it produced, here is the date that rule was born.

Pillar Four — run this on your own work and watch the number land

This is also the qualifier that binds §10 Layer 1.

cd <your-main-project>
git log --since="2 weeks ago" --grep="don't\|use the existing\|no, " --oneline | wc -l

That number is the count of times, in the last fortnight, that your commit messages recorded a moment you had to correct the agent’s reflex with phrasing the agent then forgot. It is not a number I produced. It lives in your repository whether or not you ever read this page. If the count is above three, the audit will measure the same pattern on Day 14 against a richer event stream and put a baseline on it.

That is the count. Look at it. If it landed above three, the second thing you feel is not analytical. It is the same feeling you had in §1, except now it has a digit attached.

Call that number N. Forward it to yourself before you scroll. On Day 18, the Architecture Blueprint I hand you opens with Section 1 — and Section 1’s header names at least N distinct recurring correction patterns you exhibited during the 14-day window. If it surfaces fewer, refund. Your own git log becomes the contract.

Pillar Five — measure how long it takes you to get the agent to working context, three times this week

# In your terminal, before each session-start:
date +"%H:%M:%S"
claude   # or your usual launch command
# Then, the moment the agent finally has the working context it needs:
date +"%H:%M:%S"

Subtract. Three sessions. Take the median. That number — minutes-to-working-context — is your private version of an inverse Conversion KRI: it is exactly how much of your week you spend re-establishing context the previous session should have remembered. If the median is over four minutes, the audit will surface the same leak in wall-clock hours per month.

Pillar Four is the count in your git log. Pillar Five is the wall-clock cost on your own laptop. Both numbers belong to you before you spend a dollar. The first one binds the guarantee. The second one calibrates the urgency.

Underneath all five: ninety-one peer-reviewed papers, consolidated 2025–2026. Every KRI on this page traces to at least three of them.

11.5The One Belief

Write The Sentence You Walk In With

It becomes the header on what you walk out with.

Before you reserve a slot, the ThriveCart form asks you one question that no agency intake form has ever asked you. It is one sentence long.

“What is the one belief you cannot currently prove about your own AI work?”

You type the answer in your own words. The form gives you no template, no dropdown, no example. Most operators type some version of “I think the AI thing is working but I cannot show it.” Some type “I think I’m in Calibrated Trust but I’m actually in Over-Trust.” Some type “I think most of my corrections compound but the count is closer to zero.” The page’s One Belief — you cannot compound what no instrument has measured — is mine; your version is yours.

On Day 14, when the dashboard PDF lands in your inbox, your sentence is the header above the four KRIs. Not “KRI Baseline Report.” Your sentence. In quotation marks. With your name beneath it.

Because the four numbers under that header are not abstract. They are the answer to the exact question you asked yourself before the cart opened. The Recurrence figure is the proof or the disproof of the belief you typed. The Conversion figure is the proof or the disproof of the belief you typed. The data does not arrive headless; it arrives indexed by your own thesis.

This is the difference between a generic agency report and an instrument. A generic report tells you what it found. An instrument tells you what you asked about and then prints the number that answers it. The header is the question; the page below is the answer. You authored both ends of the document before I touched any of the data.

12The Letter

Your Future Self Already Wrote

The next ninety words are not in my voice.

Hey. It is November 2027 here. I am writing from a session that already knows. The em-dash standard you taught the agent eighteen months ago is encoded in a skill called house-style.md, propagated across every session, and it has not been re-taught since the spring of last year. The KRIs have a year of green bars. The Slack question about whether the AI thing is working stopped being asked in October because the answer became obvious from the dashboard you can hand a co-founder. I do not remember what I spent the $2,500 on, but I remember exactly what I spent the year before it on, and it cost more.

— you, on the timeline where you clicked the button.

That document is not abstract. It is the only thing on this page you cannot get any other way.

The cost of the timeline where it gets written is $2,500. The cost of the timeline where it does not is two more years of model subscriptions, two more years of unbilled correction hours, and the slow erosion of believing you are good at this work — because the work keeps not getting better and you keep being the only thing keeping it on-rails.

You are good at this work. That is exactly the reason it is time to stop being the only system around your agent.

13The Offer

Here Is What Happens When You Click

You click the button below. ThriveCart opens in a new tab. You enter your card. You type your One Belief into the single intake field (§11.5). You pay $2,500 — one time, no subscription, no upsell, no Build tier (there isn’t one), no Retainer (there isn’t one). Inside ten minutes you receive an email from me with the install command, the privacy-filter source link, and a one-page Getting Started doc.

You install in five minutes. You work normally for fourteen days. On Day 14 you run one command. On Day 18 we are on a call for sixty minutes. By the time the call ends, the Quick Win is on your machine, the blueprint is on your laptop (Section 1 header bound to your Pillar Four count, your One Belief printed across the cover page), the recording is downloading, and you have started compounding.

The Agent Audit

Two-Week Instrumentation · Sixty-Minute Walkthrough

One-time engagement · No subscription · No upsell

$2,500

KRI Baseline Report headed by your own One Belief sentence
Intelligence Loop Readiness Assessment across five scored dimensions
Custom Architecture Blueprint whose Section 1 is bound to your §11 Pillar Four count
Branded PDF · Recording of the 60-minute call
Quick Win artifact installed on Day 18
MCP Integration Map · 30-Day Re-Measurement
Stacked guarantee bound to numbers your terminal computed

Cart opens Tuesday, September 1, 2026

Join the waitlist below — first to know when the cart opens.

Cart opens Tuesday, September 1, 2026. The first cohort is sized to the number of audits I can read and annotate by hand per fortnight. There is no countdown timer on this page because there is no fake scarcity. There is only the actual constraint, named honestly: I read every audit myself.

Join the audit waitlist

Drop your email and you’ll be first in line when the cart opens on Tuesday, September 1, 2026. No drip sequence, no upsell — one email when the seats are live.

14Questions

That Come Up Before The Click

Will the binary slow my machine down?: Negligible overhead. Event-driven hooks, not polling. The reviewer in §5 measured on_session_stop at ≤ 18ms on a public test corpus and approved it; the binary is open-source so you can re-measure it on your own corpus the same way.
What about privacy and my client data?: Metadata and aggregate metrics only. Not your code. Not your prompts. Not your conversations. Filter is open-source — you read the review in §5, you clone it in §11. The upload step is opt-in; the diff is written to disk first.
I am not a developer. Does this apply?: Yes. The KRIs measure how you use the agent. None require code. Skip the dotnet test step in §11; the other four pillars still hold. The §11 Pillar Four grep adapts to whatever repo holds your working artifacts — copy, prompts, research notes, contracts, anything that ships through git.
What if my Pillar Four count comes back below three?: Close the tab. The audit is for operators whose own git log already shows the pattern this fortnight. If yours doesn’t, you may already be in the band where the instrument is unnecessary. I am not selling you something your data says you don’t need.
What if the audit shows I am already in Calibrated Trust?: Then you walk into your next co-founder, client, or board meeting with a Trust Calibration plot, a year of KRI trend lines, and a custom blueprint for the ceiling above this one. Most operators cannot produce any of that. You can.
What if my One Belief turns out to be wrong on Day 14?: That is the most valuable possible outcome. You typed the belief; the data answered it; the answer was no. You now know which assumption was load-bearing for your year of work, and that assumption is the one the blueprint’s Section 1 will name first. A belief disproved by your own data is worth more than a belief confirmed.
Can I share the report with my team?: Yes. The branded PDF is built for sharing. The One Belief header is yours; you can strip it before sharing if it is too personal.
Is there a Build tier or a Retainer?: No. The Audit is the only consulting tier.

From Pierre

The Last Thing

You spent the last twelve months learning how to use the tool. You did the work. You got good. You earned the ceiling you are now standing under.

The next twelve are about building the system around the tool.

You cannot compound what no instrument has measured. The Audit is the instrument that tells you exactly where you are, and the blueprint that tells you exactly what to build next.

Three things on this page are bound to numbers your terminal computes, not promises I make. The guarantee’s first layer is bound to your Pillar Four count. The blueprint’s first section is bound to that same count. The dashboard PDF’s headline is bound to the One Belief you type before paying. Every load-bearing claim on the deliverable side of this page is qualified by data you authored, not by language I asserted. That is what an instrument-based offer looks like.

September 1. $2,500. Cart open.

Join the audit waitlist →

P.S. — Two voices on this page are not mine. §5 is a code-review comment a public reviewer wrote on a public pull request. §12 is a letter from your future self that does not yet exist. One sentence on this page is also not mine, but it is not anyone else’s either — it is the sentence you will type into the intake form, and it will be the header on the Day-14 dashboard PDF that comes back to you. Everything between those voices is verifiable in a second tab before you spend a dollar — three repositories, one Pierre OS lore entry you can read by ID, one rule quoted verbatim from line 323, and two commands you run on your own machine. Run them. Type your sentence. Then come back and click.

Excelsior,
Pierre⁄
Founder, Curio Chat Academy