PulsePremiumJune 3, 20267 min read

The Agent Audit System That Tells Me When Something Breaks Before a Client Does

Most agent failures are silent. Here's the monitoring layer I built to catch them first.

At some point, every automated workflow fails in a way you don't catch for hours. Maybe days. A client sends a follow-up email wondering why they haven't heard back. You dig in and realize the agent that was supposed to handle onboarding stopped firing three days ago because an API response shape changed and nobody noticed.

That's not a technical problem. That's a trust problem. And once it happens a few times, you start second-guessing every workflow you've built.

I built an audit layer specifically to solve this. Not just error logging, not just uptime checks. An actual system that tells me when agent behavior drifts, when outputs stop looking right, and when something is technically "working" but producing garbage.

Here's how it's structured.

The Problem With Standard Error Logging

Most people set up basic error logging and call it monitoring. A webhook fails, you get a Slack ping. A timeout hits, you see it in the logs. That's fine for catching hard failures.

But agent workflows fail softly all the time. The agent runs. It completes. It returns a 200. And the output is completely wrong because the input data was malformed, the context was stale, or the model responded in a format the downstream step couldn't parse cleanly.

Standard logging doesn't catch that. You need output validation baked into the workflow itself, not just error trapping around the execution.

There are three categories of failure I care about:

  • Hard failures: Execution errors, timeouts, API failures. Easy to catch.
  • Soft failures: The workflow completes but produces wrong output. Hardest to catch.
  • Drift failures: The workflow was producing good output, then gradually started producing worse output over time. Almost impossible to catch without baselines.

The audit system has to handle all three.

The Audit Layer Architecture

Every agent workflow I run now has four components layered on top of the core logic: an input validator, an output scorer, a run logger, and a comparison engine.

Input Validator

Before any agent runs, the input gets checked against a schema. Not just type checking. Semantic checking. If an agent is supposed to summarize a client intake form and that form arrives with three of five required fields empty, I want to know before the agent produces a half-baked summary that goes out to someone.

I use a lightweight JSON schema validation step in n8n that runs as the first node in every agent workflow. If it fails, the workflow halts and a notification fires. The agent never sees bad input. This alone kills probably 30 percent of the soft failure scenarios I used to deal with.

Output Scorer

This is the piece most people skip and it's the most valuable. After an agent produces output, a second lightweight model pass evaluates that output against a rubric before it gets used downstream.

The rubric is specific to the workflow. For a client-facing email draft, I check: Does it address the right person? Is the length within expected range? Does it contain the required sections? Does it include anything that looks like a template placeholder that didn't get filled in?

For a data extraction workflow: Did it return the expected number of fields? Are the values in expected formats? Does anything look like hallucinated data?

The scorer doesn't need to be a heavy model. I run most of these checks with GPT-4o mini because speed matters here. The scorer returns a pass/fail plus a confidence score. Anything below 0.85 gets flagged for human review before it moves forward.

This catches the silent failures that would otherwise make it all the way to a client.

Run Logger

Every workflow execution gets logged to a simple table in Supabase with five fields: workflow ID, timestamp, input hash, output hash, and scorer result. That's it. I don't store the full content in the log table, just the hashes. The actual content goes to a separate storage bucket and the log just points to it.

This keeps the log table fast and queryable. When I want to investigate a specific run, I pull the hash, find the content, and look at it. When I want to see patterns across runs, I query the log table.

The input hash is useful for deduplication. If the same input fires twice, I can spot it immediately instead of wondering why a client got two emails.

Comparison Engine

This is how I catch drift. Once a week, a separate workflow pulls the last 50 runs for each agent and compares output quality scores over time. If the average scorer result for a workflow drops more than 10 percent week over week, I get a report.

Drift usually means one of three things: the model got updated and behavior changed, the input data quality degraded, or the prompt is getting stale relative to how the use case has evolved. The comparison engine doesn't tell me which one. It just tells me something changed. Then I go look.

Usually takes five minutes to find the root cause once I know where to look.

The Notification Stack

Notifications have to be tiered or you start ignoring them. I have three levels.

Level 1, immediate ping to Slack: Hard failure, output scorer below 0.7, same workflow failing three times in a row. These need eyes on them within an hour.

Level 2, daily digest: Scorer results between 0.7 and 0.85 that got routed to human review. Workflows that ran but had input validation warnings that didn't halt execution. These I look at each morning.

Level 3, weekly report: Drift detection results. Volume stats. Any workflows that didn't fire at all during the week, which sometimes means a trigger broke silently. This one has caught more bugs than anything else in the stack.

The weekly "didn't fire" check is underrated. A scheduled workflow that stopped running because a cron trigger got corrupted won't throw errors. It just won't run. Without explicitly checking for expected run frequency, you'll never know.

Setting Up Baselines

The comparison engine only works if you have baselines. When I deploy a new workflow, I let it run for two weeks before I turn on drift detection. During those two weeks, I'm just collecting data and manually reviewing a sample of outputs to calibrate the scorer rubric.

Rushing this part is a mistake. If you set up drift detection on week one, you're comparing against a baseline that might not represent healthy behavior. Let the workflow stabilize first.

After two weeks, I look at the distribution of scorer results across all runs. If 90 percent of runs are scoring above 0.85, that's my baseline. I set the drift threshold at 10 percent below that. If scorer results start clustering lower, I want to know.

The Human Review Queue

Anything the output scorer flags below 0.85 goes into a review queue. This is a simple Supabase table with a status column. New items show up as "pending." I look at them each morning, decide if they're actual problems or false positives, and mark them accordingly.

False positives inform rubric refinement. If I keep flagging something as a false positive, the rubric is wrong and I need to update the scorer prompt. After about a month of tuning, most workflows have less than 5 percent false positive rates on flagged runs.

The review queue also serves as a training dataset. When I eventually want to improve an agent, I have a labeled set of runs that shows exactly where it struggled.

What This Actually Costs to Run

The scorer model calls add cost. For a high-volume workflow, that's real money. For most workflows I run, the volume is low enough that the scorer adds maybe a few dollars a month per workflow. Worth it without question.

For higher volume workflows, I batch the scoring. Instead of scoring every output in real time, I queue outputs and score them in batches of 20 every 15 minutes. This cuts latency from the scoring pass but still catches problems within a reasonable window.

The logging infrastructure is essentially free. Supabase free tier handles the volume for most of what I run. Storage for content is a few cents a month.

The Mindset Shift

Building this system changed how I think about shipping workflows. Before, I'd build something, test it manually a few times, and call it done. Now, a workflow isn't done until the audit layer is in place. The core logic and the monitoring are part of the same delivery.

That doubles the build time up front. It cuts the maintenance burden by probably 80 percent. For anything that touches a client, that trade-off is obvious.

If you're running a solo operation on agents, the thing that kills you isn't the complexity of the workflows. It's the time you spend firefighting invisible failures. Build the audit layer first and you get that time back.

Premium article

Unlock the full article

This article is part of the 47 Vibe Coding Playbook (lifetime, $147) and Inner Circle ($47/mo). Members get every premium article, every prompt, and every CLAUDE.md template.

Already a member? Sign in.

KZZY

Written by KZZY

47 Industries has been home since the beginning, from 3D printing operations to leading all software development across MotoRev, BookFade, and the 47 platform.

Ready to Build?

Get a quote on your project. We build websites, web apps, mobile apps, and SaaS products for businesses across Florida and the US.