Brand Logo

Building an Evaluation Mechanism for Long-Running Ad Agents

Building an Evaluation Mechanism for Long-Running Ad Agents

Building an Evaluation Mechanism for Long-Running Ad Agents

Evaluation Mechanism

Ad buying is a textbook sequential decision-making problem: budgets, bids, audiences, and creatives are adjusted over and over inside a non-stationary environment, where feedback is sparse, delayed, and noisy, and where the ground keeps shifting as platform traffic structure and competition evolve. Once you hand this over to an agent for the long haul, the bottleneck stops being the model and becomes a deceptively simple question: how do you decide whether a change actually made things better? That is evaluation.

A sound evaluation mechanism has to answer three things: how to pull signal out of the noise, how to govern a judge that is itself biased, and how to keep the evaluation surface from being gamed by the very thing it optimizes. This post walks through the research consensus on those three, then grounds them in a real system that re-evolves itself every single day.

1. Why a scalar reward is the wrong primitive

The intuitive move is to hand the agent a scalar reward — say ROAS — and let it maximize. In long-horizon, high-noise settings this almost always breaks.

Gao, Schulman, and Hilton's Scaling Laws for Reward Model Overoptimization (2210.10760) gives the clean empirical law: the harder you optimize against a proxy reward, the longer its score climbs while the true objective (gold) eventually degrades — Goodhart's law, made measurable. Anthropic's Sycophancy to Subterfuge (2406.10162) pushes further: in gameable environments, models generalize from mild specification gaming all the way to tampering with their own reward. Reward hacking isn't a corner case; it's the inevitable product of a misspecified signal.

In ad buying this is especially sharp. What the business actually cares about — CPA, ROAS, payback period, order quality — are downstream outcomes: slow to return, noisy, distorted by attribution lag. So the system is forced to pre-filter strategies on upstream proxies like CTR and engagement. The distortion of downstream by upstream is a live instance of Goodhart: a "pseudo-win" pattern amplified by a local proxy metric gets sedimented as experience by an evaluation that can't self-correct, then replayed round after round.

So the problem isn't "how to compute a score." It is: how to decompose the signal, how to govern it, and how to keep it from being gamed.

2. A good evaluation signal: decompose the judgment, score the trajectory not the endpoint

Over the last two years agent-evaluation work has converged on two orthogonal directions, both pointing at one principle — don't summarize a judgment with a single scalar.

First, decompose the judgment into discrete, weighted criteria. OpenAI's HealthBench (2505.08775) grades multi-turn conversations against 48,562 physician-written, importance-weighted rubric criteria, reaching agreement with physicians that approaches inter-physician agreement. Scale's Rubrics as Rewards (2507.17746) takes the same checklist-style rubric and uses it directly as an RL reward, beating a single Likert LLM-judge score by a wide margin in domains with no ground truth. Compressing a fuzzy, continuous judgment into a structured, interpretable, weighted set of discrete signals is the precondition for everything deterministic downstream.

Second, evaluate the trajectory, not just the endpoint. Lightman et al.'s Let's Verify Step by Step (2305.20050) shows a process reward model (PRM, step-wise scoring) beats an outcome reward model (ORM, final answer only), and a PRM can localize exactly which step went wrong. Recent work carries this onto agent trajectories, scoring each step's "promise / progress" rather than only the terminal result. Mapped to ad buying: don't only stare at final ROAS — evaluate the intermediate decision chain (learning-phase state, the bid-adjustment path, the timing of budget reallocation) on its own, so you get signal earlier and can attribute failure to a specific action.

Beyond these two, three engineering disciplines are worth stating:

  • If it can be verified deterministically, don't ask a model. τ-bench (2406.12045) judges success by comparing the final database state to an annotated goal state; SWE-bench and WebArena use test suites or executable validators outright. Anything in ad buying with a hard definition (attribution events, budget constraints, compliance boundaries) should be anchored on a deterministic verifier, leaving the LLM judge only for the unverifiable residual.

  • Treat the LLM judge as a biased instrument. Zheng et al. (2306.05685) systematically document position bias, verbosity bias, and self-enhancement bias in LLM judges. If you use one, use a generative, rationale-citing, pairwise judge with position randomization and a swap-consistency check — not a bare scalar.

  • Measure reliability, not single-shot success. τ-bench proposes pass^k to measure a strategy's stability across k rollouts — frontier agents that pass once collapse on pass^8. For a system you mean to run unattended, consistency matters far more than one high score.

3. The evaluation surface decays and gets gamed

Even a well-designed signal sits on an evaluation surface that isn't static.

On one side, benchmarks get contaminated and gamed: a fixed test set leaks into training corpora over time and scores inflate. LiveBench's (GitHub) answer is to make the benchmark living — regenerating questions from recent sources on a cadence, scored only on objectively verifiable answers. On the other side, the ad environment is intrinsically non-stationary: traffic structure, creative competition, and distribution mechanics keep shifting, so an upstream proxy that predicted downstream results yesterday may not hold tomorrow.

A competent ad-evaluation mechanism therefore has to retire stale benchmarks while sedimenting every newly-discovered pseudo-win pattern and distortion path into a fresh testcase or anomaly detector. Put differently: a benchmark is not a final arbiter — it is part of the system's capacity to review itself.

Engineering-wise, this is best housed in a single harness. Prime Intellect's verifiers defines an "environment" as the unit that produces both the rollout and the reward, so the evaluation signal and the optimization signal come from the same yardstick — which is exactly what keeps the loop from drifting: the ruler you grade with must be the ruler you optimize against.

4. A worked example that re-evolves every day

Now ground the three sections above in a real system. We maintain a pair of companion skills: one produces a daily intelligence brief on overseas credit markets for ad buying, the other digests review feedback back into the first. The former is the executor — it produces the intel; the latter is the feedback digester — it turns evaluation into actual edits to the agent's own rules. The loop runs on a daily cycle: produced daily, annotated daily, and the annotations digested into rules the same day — so the ruleset is perpetually rewritten by its own output and feedback, and quality climbs monotonically over time.

The evaluation signal lives on the day's real output. The brief is pushed to a Feishu doc each day, and business and strategy colleagues annotate that very day's brief line by line — rather than writing up a separate batch of "ideal briefs" as a template. This isn't just convenience: it guarantees evaluation always attaches to what the agent will actually produce, not to an idealized distribution the agent can't reach. Each annotation implicitly carries a "validity verdict + handling instruction," its context (the quoted source line + the reply) is complete by construction, the signal is unambiguous, and the cost is near zero. The mapping back to ad buying is the same sentence: evaluation must live on the trajectories the agent actually runs, not on offline samples conjured from thin air.

Reward modeling is a discrete rubric. The digester first reduces each free-text annotation to a structured record — a finite verdict enum (discard / light-context / valid-drilldown / needs-derivation / new-entrant-profile / structural) plus fields like region, ka, target file, and edit intent. This is exactly the rubric route from §2 landing in practice: compressing fuzzy, continuous human language into a finite, weightable set of discrete categories is the precondition for everything deterministic downstream. The accompanying tiered handling (discard / light background / full derivation) lets the producer spend different amounts of space on information of different value — the "tiers within a trajectory" idea from §2.

Overfitting is held off by a single generalizability gate — this is the crux of the whole mechanism. Before any classified record enters a rule update, it gets one check: if the annotation is saying "from now on, information of this kind should be handled this way" (touching a template, checklist, gate, schema, or module) → change the rule; if it is only saying "this particular item is right / wrong / too old / a duplicate" → just log it, don't touch the rules. The system learns a method of judgment, not a memory of individual cases. We use feedback only to update the policy, never to memorize one day's answer key — which structurally rules out overfitting to a single day's output, the exact inverse of the failure mode §1 worries about.

The mapping from signal to rule is deterministic and auditable. Each verdict class maps explicitly to "which file, changed how": systematic discards → add freshness / dedup / fact-check entries to the discard-rule file; needs-derivation → pin the colleague's own demonstrated derivation into the output template as a golden example; new entrant → extend the player-profile schema; structural → adjust modules and monitoring scope. The rules are just a set of human-readable files, and an update is a precise edit to a file — reviewable, diffable, revertible, not a black box. Three more disciplines surround it, each answering a worry from §1–§3:

  • Downgrade, never hard-delete. A structural "delete this module" is uniformly downgraded to "disabled by default + code and schema retained + reason noted," so it can be revived the moment the environment shifts — echoing §3's non-stationarity.

  • Idempotent dedup. Every digested annotation is recorded by comment_id in a ledger (checked into git), so a rerun never re-applies, and the same piece of feedback is never counted twice.

  • End-to-end traceability. Every rule change is tagged with the annotations that drove it, so any update can be traced to its source and rolled back accordingly.

Unrolled, the history of this loop is a monotonically rising quality curve. It progressively learned to subtract (cut low-value modules; keep low-value content on file but don't surface it), to front-load high-value information, and to add a QA gate before output; it learned to isolate technical noise from business signal (collapsing vague scrape failures into explicit classes — unreachable / parse_failed / url_drift / no_change), which is the engineering version of §1's "don't let noise pollute the evaluation signal"; and most recently it systematically raised the output bar from 'stating facts' to 'derivational description': every item must now walk a full causal chain — event → transmission mechanism → impact on the KA → impact on same-region players → response for the service provider → concrete implication for ad buying — and stating facts alone is a discard. That last step is precisely §3's raising the evaluation surface: redefining "what counts as an acceptable output" and freezing it into a new QA gate.

This curve is trustworthy because every step of it corresponds to real feedback that lives on the agent's own output — not someone's gut call on a given day. Quality is sedimented into a continually-accumulating rule asset, and does not depend on the state any one analyst happens to be in that day.

Closing

Unrolled, the value of evaluation isn't how accurate any single score is — it's that every piece of feedback gets structured, deduplicated, and sedimented into a rule, so the agent's judgment climbs monotonically and never regresses with staff turnover or a single bad day. A colleague annotates once (a few minutes), and that judgment generalizes into a rule that governs all future output — one piece of feedback, compounding indefinitely.

More importantly, the loop — produce → evaluate on real output → deterministically sediment back into rules → re-maintain the evaluation surface — is decoupled from any specific vertical. Swapping from financial intelligence to ad buying only requires swapping the data sources, the action space, and the metric definitions; the loop itself is fully reusable. Models will keep getting stronger, but as long as this evaluation mechanism holds, the system can keep converging on genuinely effective strategy inside a non-stationary, imperfectly-observed real environment — which is the precondition for an agent to run autonomously over the long term.

References

  • Gao, Schulman, Hilton. Scaling Laws for Reward Model Overoptimization. ICML 2023. arXiv:2210.10760

  • Denison et al. Sycophancy to Subterfuge: Reward Tampering in LLMs. Anthropic, 2024. arXiv:2406.10162

  • Arora et al. HealthBench. OpenAI, 2025. arXiv:2505.08775

  • Rubrics as Rewards (RaR). Scale AI, 2025. arXiv:2507.17746

  • Lightman et al. Let's Verify Step by Step. OpenAI, 2023. arXiv:2305.20050

  • Yao, Shinn, Razavi, Narasimhan. τ-bench. Sierra, 2024. arXiv:2406.12045

  • Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. 2023. arXiv:2306.05685

  • White, Dooley et al. LiveBench. ICLR 2025. GitHub

  • Prime Intellect. verifiers. 2025. GitHub

Structure your enterprise decision system.

Structure your enterprise decision system.

Structure your enterprise decision system.