Stop measuring token costs, start measuring outcomes

One silent error can wipe out the savings from a dozen clean runs. This is why the question worth asking is not whether agents work. It is whether anyone is measuring both sides of the arithmetic.

The technology changes. The arithmetic does not.

01 · The arithmetic that doesn't change

If the value per task is lower than the cost per task, you stop running the task. That is true of every automation decision ever made. It was true of the first spreadsheet macro, and it is true of the most advanced agentic system running in production today.

Most firms deploying agentic tools can tell you the cost per task to the penny, because the token cost arrives on an invoice at the end of the month. The value per task is harder to see. It sits in hours that were never billed, in work that did not need a second pass, in errors that were never made.

No invoice arrives for errors that never happened.

Until a firm measures both sides, the comparison that decides whether automation pays cannot happen.

02 · Where the value hides

Accounts payable is a useful place to see this clearly, because the value side is unusually easy to price.

Take invoice processing. A finance team receives an invoice, extracts the data, matches it to a purchase order, routes it for approval, and posts it to the ERP. Done by hand, the fully loaded cost of that work sits somewhere between $12 and $22 per invoice once you count labour, approval routing, and the time spent correcting the inevitable errors. Industry benchmarks have held in that range for years. That number is your value per task. It is what a successful autonomous run is worth, because it is the cost you no longer pay.

03 · How Gysho closes the gap

Knowing the value per task is only half the equation. The other half is knowing, with evidence, that the agent will actually capture that value before it ever touches a live invoice.

At Gysho, we do not deploy agents on promise. We deploy them on a tested, measured convergence pattern that separates what an agent can reliably own from what still requires human judgement.

First, we build a gold-standard test stack. Before the agent processes a single production invoice, we assemble a representative test set drawn from real documents: different vendors, formats, currencies, edge cases, and PO match scenarios. This is the benchmark that defines success.

Next, we run parallel automated test cycles at scale. The agent configuration is iterated through multiple pipeline variations, prompt strategies, and validation rules. Each variation is run against the gold-standard set autonomously. We are not looking for a single lucky clean run. We are measuring consistency: how often the extraction is complete, the PO match is correct, the routing decision is accurate, and the ERP post is valid.

Then we converge on a threshold. The configuration does not go to production until it hits a target consistency threshold (typically ninety percent or higher) against that test stack. Below that threshold, the economics do not hold; one failure in ten invoices erodes the margin too fast. The testing process itself runs in the background, often for extended periods, until the numbers prove the agent is ready.

Finally, we define the hand-off. Anything the test cycles identify as outside the consistency threshold (ambiguous vendor details, non-standard line items, PO mismatches that fall outside validated rules) is not left to the agent's best guess. It is automatically flagged and routed to a human operator. The agent handles production volume; the human handles judgement at the edge. That boundary is not an afterthought. It is engineered into the workflow from the start, giving the finance team both control and an audit trail for every decision the system makes.

This changes the question from "Will the agent work?" to "What, exactly, has the agent proven it can do, and what is explicitly reserved for human review?" When you can answer that, you have control. When you cannot, you are experimenting with your general ledger.

04 · When the run fails

Now put a converged agent on the same task. A clean run, where the agent extracts the fields, matches the PO, and posts without a human touching it, costs well under a dollar in tokens. Against $12 to $22 of avoided cost, the return is not a close call. It is the easiest decision a finance leader will make this year.

But this math only holds when the agent succeeds.

Without the test stack and the threshold, the agent pulls the wrong total, or the PO match does not resolve, and the error lands in the ERP unnoticed. Now you are not comparing a dollar of tokens against $16 of labour. You are comparing the tokens, plus the retries, plus the roughly $53 it costs to find and fix a single posted error, plus the downstream cost of a payment that went out wrong.

One silent failure can wipe out the savings from a dozen clean runs. The unit economics did not change. Your measurement of them did, the moment you started counting outcomes instead of attempts.

Scenario	Cost	Value
Clean run	<$1 tokens	$12–22 saved
Failed run (unmeasured)	<$1 + $53 fix + downstream	Negative
Failed run (with hand-off)	<$1 + human review at $8–12	Controlled, visible

The third row is the one most deployments miss. When the agent knows its own boundary and escalates, the failure becomes a known, bounded cost, not a silent liability.

05 · What most deployments miss

Token tracking tells you what you spent. It tells you nothing about whether the task succeeded. A system that completes 95 percent of runs cleanly and a system that completes 70 percent can show the same token bill and have completely different economics. The difference only appears when you track outcomes with the same discipline you track cost.

Most teams skip the test-stack phase. They prompt, they eyeball a few results, and they switch on the tap. That is how you end up with a production system that looks automated but is actually generating a hidden queue of corrections, rework, and quietly compounding errors. The token bill is low. The true cost is invisible until reconciliation.

06 · The standard we hold ourselves to

So the standard we hold ourselves to at Gysho is simple. Before we put an agent into production, we want three numbers on the table: what a successful task is worth, what consistency threshold the agent has proven against a representative test set, and what the system actually costs to deliver that success, including the human hand-off for everything that sits outside the threshold.

If we cannot state all three, we are not deploying an automation. We are running an experiment.

Most firms are still only watching the cost side. The ones who win the next few years will be the ones who learned to measure the value, prove the accuracy, and design the boundary where automation ends and human judgement begins.

07 · Your next step

If you're deploying agents today, audit one workflow. Write down the value per task, the cost per successful completion, and the consistency threshold your agent has actually proven against real data. If you can't fill in all three numbers, you're not ready for production.