An agent stores yesterday's winning creative combination as "experience." The traffic structure shifts overnight, and it keeps reusing the combo anyway. That isn't a bad memory — it's too good a memory. When you hand long-running ad operations to an agent, the bottleneck usually isn't that it can't learn. It's that it learns everything: single-day noise, false wins inflated by upper-funnel proxies, strategies that have already gone stale — all of it gets laid down as experience and replayed, over and over, in an environment that never stops changing. This is the eleventh piece in our "AI Growth Decisions" series. The last one was about how to tell whether a change actually made things better. This one is about the other half — once you've judged something, how do you accumulate it, and how do you throw it away.

Once an Agent Learns to Remember, the Trouble Starts
Over the past two years, adding memory to agents has gone from "should we" to "how do we do it well." One number shows how much it matters: on a benchmark built to test multi-turn, cross-session tasks, swapping a memory-managing agent for one that relies on long context alone drops completion on the interdependent tasks from over 80% to around 45%. Put differently, the gap between having memory and not having it is often wider than the gap between two different underlying models.
But memory cuts both ways. Over the same two years, the field has come to recognize a nagging open problem: memory goes stale. A frequently-recalled fact about a user's employer becomes a confidently-wrong fact the moment they change jobs. And most memory designs have no clean mechanism for detecting and handling that staleness.
In ad buying this gets amplified. Ad operations is already a sparse-, delayed-, and noisy-feedback environment that drifts continuously with platform traffic and competitive dynamics. When an agent runs autonomously in a place like that, every piece of experience it writes down carries an unwritten expiration date. If that memory isn't governed, the agent's judgment doesn't trend upward by default — it slowly poisons itself by default.
This piece breaks down what it means to govern memory: what deserves to be written down, when it should be thrown out, and on what basis you dare throw it out.
Instance vs. Rule: Before Writing to Long-Term Memory, Pass a Gate
The first question is a plain one — what actually deserves to be written into long-term memory?
Start with something counterintuitive. The biggest trap the field has fallen into on agent memory is the most natural move of all: storing raw trajectories. One line of work on skill libraries (SkillRL) puts it bluntly — most memory schemes just store raw trajectories, and raw trajectories are long, messy, and noisy, which actually prevents the agent from distilling anything reusable. In plain terms: the more completely you remember, the less you learn.
This is glaring in ad buying. A creative won yesterday, so the agent stores "this creative + this bid + this audience" as a single block of experience — it looks like learning, but what it has stored is a point, not a rule. The moment the traffic structure shifts, that point is dead, yet it's still being replayed. Last time we covered Goodhart: the distortion of downstream results by upper-funnel metrics gets laid down as experience by an evaluation loop that can't self-correct. What this section adds is the back half of that sentence — the act of laying down experience itself needs a gate.
From "Is This One Right" to "How Should All of This Type Be Handled"
To explain the gate, you first have to accept that memory is layered.
The academic lineage here is fairly clear by now. Early on, reflections got jotted into a scratchpad (the Reflexion family). That grew into sedimenting callable skill libraries (Voyager), and then into abstracting successful and failed episodes into natural-language insights and constraints, retrieved when needed (ExpeL). By 2026 the line has converged into an easy-to-remember three layers: what specifically happened → the rule induced from many such happenings → the high-confidence rule promoted into something that directly takes effect.
One paper (AEL) has an example that captures it perfectly: what it sediments isn't "on some date, some indicator failed for some stock," but "momentum indicators are reliable in trending regimes but misleading in reversals." Notice the shape of that sentence — it carries conditions of applicability. It's a rule, not a lone instance.
That is exactly what the gate in our own skill setup does. We mentioned it last time; here's the criterion on its own:
If a comment is saying "from now on, all information of this type should be handled this way" (touching a template, a checklist, a gate, a schema, a module), then change the rule. If it's only saying "this particular item is right / wrong / too old / a duplicate," then just log it — don't touch the rule.
Drop that into the three layers and it's obvious: the gate governs whether a piece of feedback can climb one level, from "an evaluation of one specific output" up to "a rule that takes effect on all future outputs." The single test is whether it describes a point or a rule with conditions attached. What the system should learn is a method of judgment, not the memorized answer key from one particular day.
Why You Can't Record Every Win
The easiest mistake is treating every "good result" as a "right move."
A creative posts a stellar first-day ROI. That could be a genuinely good strategy, or it could be an illusion stitched together by that day's traffic, a competitor's temporary absence, and attribution delay. If you record once for every win you see, what you sediment isn't strategy — it's overfitting. You've mistaken noise for a rule.
This is why "is it stable" matters far more than "was it high that one time." Whether a change is truly better or just lucky shows up in whether it holds across repeated trials, not in its single most glamorous moment. So behind the gate is a very dumb piece of discipline: feedback is used to improve the method of judgment, not to memorize the right answer for one particular day. A piece of experience earns its way into the rules only if it still holds in another time, another market — not because it precisely described one past point.
Tellingly, you often can't even tell in the moment whether a preference is stable. Codex's memory design admits this openly — it keeps "possibly useful but not-yet-confirmed-durable preferences" separate from "rules validated repeatedly," holding the former for observation while only the latter gets promoted. That itself is a "don't be in a hurry to trust it" layered design.
On Disk: Rules on One Side, a Ledger on the Other
All this judgment ultimately has to land in two concrete places, or it's just talk.
One is the rules: human-readable files whose changes you can diff like code and roll back. An update is a precise edit to a file. What gets in here is the rule that passed the gate — it genuinely changes the agent's default action next time.
The other is the ledger: written down, but it doesn't change behavior. Every digested comment is logged by ID under version control, so rerunning doesn't re-edit. The items judged to be "just this one right/wrong" live here — queryable, but never polluting the rules.
This split looks like engineering fastidiousness, but it's the crux of the whole thing not poisoning itself. Compare two systems you may know better. Claude Code records as it goes — fast to respond, but precisely because it's fast, its setup is stuffed with "what not to record" and "treat recalled memory with suspicion." In essence it substitutes a pile of constraints for one explicit gate. Codex mines after the fact, and before mining it passes a checkpoint: "will a future agent do better because I recorded this? If not, don't write it." Both paths, however they wind around, are asking the same question: what not to record. Our gate puts that question out in the open, and keeps "recorded as a rule" and "only logged to the ledger" physically separated in storage.
One last note: these three layers aren't three separate modules in the system. They're different states of the same rule files being rewritten by feedback over and over. A piece of information may sit in the ledger as an instance for a long time, until a comment of the same kind shows up a third and fourth time — only then is it deemed a rule and promoted. So memory is never a write-once action; it's a process of continuous re-evaluation — which is exactly the other half of this process, and the subject of the next section: once a memory has made it into the rules, how does it get demoted and retired after the environment shifts?
Memory Expires: In an Ever-Changing Environment, Old Experience Is Harmful
The previous section dealt with what deserves to be written. But however strict that gate is, a thornier problem remains: a rule that's right today doesn't mean it's still right three months from now.
There's a turn here a lot of people miss. In a static environment, an old piece of experience is at worst useless — it sits there, neither helping nor hurting. But ad buying isn't that kind of environment. Traffic structures change, competitors switch tactics, platform distribution mechanisms get reworked every so often. In a place like that, an expired piece of experience isn't politely "useless" — it's harmful. Because it still looks valid, the agent treats it as grounds for action and actively steers budget down a path that has already failed. A useless memory just wastes a slot; an expired memory will actively drive you into the ditch.
This Is Precisely What the Whole Industry Does Worst
Worth flagging: this isn't a difficulty unique to us. It's the shared soft underbelly of nearly every memory scheme out there.
One industry survey says it plainly: a memory system has three jobs — what to record, how to retrieve it, and how to make an invalidated fact cleanly disappear — and while everyone competes hard on the first two, almost no one seriously designs the last. It points to a very specific failure mode: many schemes dump memory into a vector store and retrieve by "semantic similarity." The trouble is that an expired fact and a current fact often sit equally close in vector space. So you search "how should I run this audience now," and the system may serve up a six-month-old, long-invalid piece of experience alongside one validated yesterday — together, and with equal confidence. It can't tell new from old, because it never stored "newness" in the first place.
So the schemes doing this more seriously are all finding ways to give memory a time dimension. Some (the Zep family, for instance) simply stamp every fact with a validity window — when it started holding, when it was overturned — so retrieval can tell whether it's still alive. Behind this is a plain recognition: a memory is "something that was true at a point in time," not "something true forever."
Claude Code's setup reflects this too — it auto-attaches a note to any memory older than a day: this is an observation from a moment in time, not live state; verify against current conditions before using. A memory naming a specific field or a specific bid is only saying "it was this way the moment I was written," not "it's still this way now."
The Approach: Demote, Don't Hard-Delete
So you've found a piece of strategy experience that's expired. What do you do with it? Deleting is the most intuitive move, but our approach is demote, don't hard-delete — mark it "no longer used by default," but keep the whole thing in the library, together with "the conditions under which it was originally validated, and why it later failed."
Why not delete it clean? Because the ad environment swings back. A platform distribution tweak may revert in a couple of months; an angle users got tired of may come back to life with a fresh wrapper after a while. If you'd deleted it outright, then the next time that window reopens, the agent has to re-test the whole path from scratch and pay the tuition all over again. But if it was only shelved with the reason written beside it, then the moment the environment swings back, that experience can be re-awakened at nearly zero cost. This matches the instinct of any veteran buyer: when a tactic stops working, you don't erase it from your head — you put it away and pull it back out when the timing is right.
The Two Disciplines Holding It Up
For the library to live this way, two dumb-but-critical disciplines have to sit beside it.
One is don't let technical noise impersonate strategy failure. The account tied to a piece of experience suddenly has no data — maybe the strategy really has stopped working, or maybe a scrape failed and the callback dropped. These two must be told apart, or the agent will misread "I didn't get data" as "this strategy is dead" and demote a perfectly fine piece of experience. Technical "no results" must be explicitly classified and kept out of strategy experience. This is last time's "don't let noise pollute the evaluation signal," landed on the memory side.
The other is end-to-end traceability. Every promotion and demotion of every entry — which buying results and which feedback drove it up or down — must be marked. That way, when an entry underperforms, you can trace back exactly what got it written in, and roll it back precisely if the call was wrong. The library shouldn't be a black box; it should be like git, every entry queryable and reversible. It's precisely because of this that "demote" is safe to do — you always know how a piece of strategy experience got to where it is today.
Lay this library's history out flat and it isn't a curve that only adds things — it's a curve with ups and downs, both leaving a trace. Behind every step is the machinery above: the gate decides which tested-out finding is worthy of promotion to a rule, the time dimension and demotion decide which faded tactic should sink, and traceability lets every rise and fall be traced to its source. Together, the three make the agent's ad-buying judgment trend upward over time, rather than getting dragged down by the expired tactics it accumulated.
And this machinery is decoupled from any specific platform or vertical: today it governs creative angles in one market; tomorrow, on another platform, another vertical, all you swap is the data source, action space, and metric definitions — the gate, demotion, and traceability skeleton ports over wholesale.
Closing
If the last piece was about how to judge whether a change actually made things better, this one is about how to bank what you've judged — and how to throw it away. One hands you the ruler, the other the ledger; together they're the foundation that makes an agent's long-term autonomy actually stand up.
Around the long way, this piece really only wants to make one thing clear: whether a long-running agent's judgment trends upward over time turns not on how much it remembers, but on how well its memory is governed. Let it sediment what generalizes, retire what expires, and keep every entry traceable to its source. The models will only get stronger — but if this memory-governance machinery doesn't hold, even the strongest model will be slowly dragged down, by the expired experience it accumulated, in a real environment that never stops changing and whose feedback is never perfect.
Remembering a lot was never the skill. Knowing what not to record, what not to trust, and when to forget — that is.
This is the eleventh piece in our "AI Growth Decisions" series. From defining the category, through last time's eval mechanism and this time's memory governance, we keep taking apart the same thing: in the age of industrialized AI, how an enterprise growth-decision system should be built.












