On Scrapers That Heal Themselves: Building AI Ownership Into the Data That Feeds Andri

I was two months into Andri when I found out what keeps me up at night wasn't the model.

It was a public court publisher that had decided, some time between Friday evening and Sunday morning, to change the way it structured its responses. Our scraper ran on its usual cron, got a 200 back, parsed zero documents, wrote nothing, and declared success. By the time a lawyer asked us why a case from Monday wasn't in the index, we'd silently skipped three publication cycles.

Nothing was broken. That was the worst part. The site worked. Our code worked. The two of them just didn't agree about the shape of the world anymore.

I think about that weekend a lot.

The thing nobody says about legal AI

Every company in this space will tell you about their model. About their retrieval pipeline. About how they chunk and embed and rerank. Nobody mentions the scrapers.

But the scrapers are the whole thing. If the data isn't there, none of the clever retrieval matters. If the data is wrong, you've built a confidently wrong search engine. If the data is stale, you're letting lawyers cite law that's been amended.

The scrapers are also where all the drudgery lives. Every legal publisher has their own quirks. Some run on infrastructure from the early 2000s and occasionally deploy proof-of-work bot protection. Others publish XML that technically violates the spec. Some are SharePoint deployments of varying vintages, with search APIs that return 503s more often than not.

None of this is anyone's fault. These are public services run on modest budgets by people who are not, on the whole, building for downstream AI consumption. The only reason we can use any of this data is that it's public and the rules for accessing it are published in robots.txt files and terms-of-use pages, both of which we treat as binding.

So. Dozens of scrapers. Dozens of ways for things to quietly go wrong.

What we tried first

We did what everyone does. CloudWatch alarms. A Slack channel called #scraper-alerts. Wake up next morning, fix it, move along.

This worked. It also didn't scale. Three failures a week across dozens of sources means a Slack fire drill every other day, and the investigations all followed the same steps: pull the last orchestration run, read the final step's input, scroll through logs until something says ERROR, cross-reference with object storage to see if documents were written, cross-reference with the search index to see if they were indexed, then form a theory, confirm or reject it with a second query, and patch.

The problem with this workflow isn't that it's hard. The problem is that it's identical every time. The shape of the investigation doesn't depend on which scraper broke. What depends on the scraper is the specific error, the specific file path, the specific selector that needs changing.

An engineer doing the same investigation twelve times a month is an engineer about to introduce a subtle regression out of boredom. We've all been that engineer. Nothing good happens after hour three.

So I asked the only useful question: what if the investigation ran itself?

A note on what we're not building

There's a growing body of writing on self-healing software. Most of it is about services that fix themselves at runtime: a pod crashes, the orchestrator restarts it; a node degrades, the load balancer routes around it; an anomaly detector trips, a rollback kicks in. Closed-loop, internal, human-free. That is a real and useful discipline. It is not what we're doing.

Our problem lives one layer up. We don't need a service to restart itself. We need the repetitive human parts of incident response (reading logs, cross-referencing storage and indexes, writing a ticket, opening a PR) to stop consuming engineer-hours. The "healing" ultimately lands as a normal code change, reviewed by a human, merged through normal CI. Nobody's pod is restarting itself. An agent is doing the on-call grunt work a tired engineer used to do at 2 a.m., and doing it during office hours instead.

Calling this "self-healing" is generous. It's closer to agentic incident response with a human holding the merge button. We use the term because it's the vocabulary people reach for, not because we think we're building the thing the literature describes.

The thing the runtime literature misses

Here's where our problem diverges from the one most self-healing writing addresses.

Runtime self-healing assumes the system doing the healing is the system you own. Your service is down, your AI diagnoses it, your agent restarts the pod. Clean. Internal. Nobody outside your VPC participates.

Our problem isn't like that. Half our failure modes live on someone else's infrastructure. A publisher adds an access page that wasn't there last week. Another switches document formats over a weekend. Another quietly retires an archive. None of that is a bug in our code. Strictly speaking, none of it is broken. The world just moved.

A system that only looks inward will diagnose every one of those as an outage on our side, spend compute trying to restart something that isn't crashed, and never close the loop. The fix has to reach into the code itself: a new parser, an updated selector, a different extraction path. Writing that fix is a cognitive task, not a restart.

The other thing the runtime framing misses is more uncomfortable, so I'll just write it down: nobody talks about permission. The assumption is that the system you're healing is one you own. Scraping is different. The data we collect is produced by other people, published under rules we did not write, and we only get to keep doing this because we respect those rules.

A system that "repairs" itself by hammering a blocked endpoint is worse than a broken one.

I'll come back to this.

What we actually built

Every scraper gets a state machine. Every state machine gets a doctor.

The doctor is an agent. One per scraper, running as a nightly job, responsible for the health of its own state machine. It doesn't touch the scraper's code. It observes.

Each morning at 07:00 UTC it wakes up, pulls the registry of every scraper we run, and fans out across the fleet. For each one it does the five-step investigation I listed above, but in parallel, using a small toolkit:

A health check that pulls the last N orchestration runs, counts files written across 24h/7d/all-time, and counts indexed documents for the same periods. Three signals in one request.
A log query tool scoped to a single scraper's log streams, biased toward our structured agent log lines before falling back to raw error text.
Infrastructure sanity checks: is the schedule still enabled, is anything piling up in a dead-letter queue, is there a state machine running that isn't in our registry.

The agent classifies each scraper as OK, WARNING, or CRITICAL, and for every finding with a fixable root cause, it's required to send a remediation ticket to a second agent.

The two agents are deliberately separated. Diagnosis and repair are different cognitive tasks and shouldn't share context. In practice it's also about accountability. The health agent owns "something is wrong and here's the proof." The fix-it agent owns "here's the code change that addresses it." When they're fused into one step, neither is good at their job.

The rule that changed everything

The first version of our health agent produced confident-sounding tickets with thin evidence. We spent a while debating whether it was hallucinating. It wasn't. It was just reporting a first guess as a diagnosis, because we hadn't told it not to.

The fix was structural, not a prompt tweak.

The agent is now required to gather evidence from at least two independent sources before sending a ticket. Orchestration runs plus logs. Or object storage plus search index. Or logs plus dead-letter queue. If it only has one signal, it doesn't ticket. If the search index shows zero documents but the object store has files, the scraper is fine: the indexer is broken. That's a different ticket, to a different component, with different remediation. The rule catches that distinction automatically.

Ticket content is a strict schema: scraper name, state machine identifier, failure category, one-sentence root cause, evidence quoted directly from logs (not paraphrased, not summarised), remediation steps, affected components, and a diagnostic blob tailored to the failure category.

We've defined nine categories. They cover every scraper incident we've seen in eighteen months:

SOURCE_CHANGED: website structure or API changed. Scraper returns 200, extracts zero documents.
BOT_PROTECTION: site added a WAF, access challenge, captcha, or similar control. Always routes to human review, never auto-bypass.
IAM_PERMISSION: a job tried to do something the cloud provider didn't grant.
CONFIG_ERROR: missing environment variable, wrong identifier, stale bucket name.
BATCH_FAILURE: LLM batch job didn't complete.
INDEXING_BREAK: documents in storage, absent from the search index.
CODE_BUG: runtime throws. We want the traceback.
SCHEDULE_DISABLED: the scheduler rule is off. We've done this to ourselves during deploys.
INFRA_FAILURE: OOM, timeout, dead-letter queue accumulation, orchestrator history limits.

Each category tells the fix-it agent which playbook to follow and which part of the system to touch. SOURCE_CHANGED means the scraper module for that jurisdiction. IAM_PERMISSION means infrastructure-as-code. INDEXING_BREAK means the indexer, not the scraper. The category is a routing decision before it's a content decision, and getting the routing right is sixty percent of getting the fix right.

The difference shows up in the tickets. Our first version said things like "scraper failing, investigate." That's detection with a sprinkle of diagnosis. Our current version reads more like: "scraper X is now receiving a redirect to an access page that didn't exist last week (see evidence block). The previous fetch path no longer applies for this source. Fix: flag for human review, do not auto-bypass, reach out to the publisher to confirm the new intended access pattern."

That is a diagnosis. The fix-it agent can act on it. A human can review the resulting PR in ninety seconds.

The boundary we chose

The fix-it agent writes a code change. It opens a pull request. A human reviews it. The merge goes through our normal deployment pipeline.

That's the whole design. No runtime auto-remediation, no self-configuration, no agent-initiated deploys. If the system changed itself at runtime, it would be bypassing code review. Requiring a human to review every change, even an agent-authored one, preserves the property that our infrastructure-as-code and our application code are the source of truth for what production is doing. That's the boring but correct answer to "how much autonomy do you give an agent." Our version of oversight is GitHub's review button.

The fix-it agent can add a new parser. Fix a selector. Handle a new document type. Adjust an infrastructure permission. Update an environment variable. What it cannot do, under any circumstance, is change our rate limits.

That brings me back to the uncomfortable part.

The rule nobody in the literature talks about

Our scrapers touch other people's servers. That's not a detail. It's the whole operating constraint.

A self-healing system that doesn't respect public data rules is, in legal tech, a self-destructing system. So our agents have an extra rule you won't find in the self-healing literature, one we had to work out ourselves:

The agent never heals by pushing harder on a source that has asked us to back off.

In practice this means several specific things.

The health agent will never recommend increasing scrape frequency. If a daily scraper is missing documents, the remediation is to parse better, not to poll more. A scraper that finds zero new documents for two days on a source that usually produces dozens is flagged CRITICAL and assumed to be broken, not behind.

The fix-it agent is explicitly blocked from changing rate-limit constants. Our limits are set by the engineering team with the source's operator in mind, with conservative gaps between requests for most sources, wider gaps for backfill, and tighter ones only for lightweight probes. If the agent thinks it needs to tighten them, the ticket goes to a human channel instead.

Backfill-awareness is explicit. If a state machine has zero all-time executions, it is classified as "not yet activated," not broken. No ticket. This rule came from an early false-positive where the agent paged us for every scraper in a newly-deployed jurisdiction because we hadn't run the backfills yet.

Any SOURCE_CHANGED remediation ticket for a site with a restrictive robots.txt flags for human review rather than automated PR. We don't want an agent "helpfully" extending our coverage into paths the site owner has asked us not to crawl. Those decisions get made by people, in email, with the source.

None of this is hard to state. All of it is trivial to forget once you're measuring yourself on mean-time-to-resolution. Cost-benefit on a data centre is one thing; cost-benefit on a relationship with a public records office is another. Same shape. Different stakes.

What this actually gets us

The scraper health agent runs every morning. Dozens of state machines, investigated in parallel across five or six iterations. Produces a Slack health report that reads like a radiologist's note: "Fifteen of eighteen active scrapers healthy, three need attention, nine not yet activated." Sends between zero and six remediation tickets. Logs a full audit trail.

The full run costs a few cents in model tokens, thanks to prompt caching. That pays for itself many times over in the iterations after the first.

The fix-it agent picks up tickets, opens PRs against the right components, runs tests, and assigns reviewers. A human reviews, merges, and the next scheduled run verifies the fix. The round trip that used to be a 2 a.m. page followed by three hours of log diving is now a GitHub notification at breakfast.

We are not at zero human interventions. We don't want to be. A scraper whose source has genuinely changed is a piece of institutional knowledge we need to absorb consciously. The goal was never to eliminate humans from the loop. The goal was to eliminate the repetitive human steps, the ones that look identical scraper-to-scraper, and reserve human attention for the things that require judgment.

The things I'd do differently

If I were starting this over, I'd write the failure taxonomy first and the code second. The nine categories we ended up with are not arbitrary. They map to concrete playbooks. But we discovered them by watching incidents for six months before naming them. That was too slow. A team starting today could read our nine and use them as a floor.

I'd commit on day one to the evidence-from-two-sources rule. Most of what looked like "the agent is hallucinating" turned out to be the agent reporting a first guess because we hadn't told it not to. Structural constraints beat prompt engineering every time.

And I would draw the permission boundary first, before writing any remediation logic. Not as a policy document. As code. If a failure category could conceivably lead to changing how we interact with a source, that category goes to a human channel by default, and the default has to be set in the tool schema, not in a system prompt you're hoping the model remembers.

Anyway

The biological metaphor is tempting, and it's partly right, just not in the "one clever immune system" way. Human bodies aren't healed by one brain. They're healed by many specialised systems, each one narrowly competent, each one operating in parallel, each one reporting to a central nervous system that knows how to triage.

That's the architecture we're moving toward. Not one giant self-healing brain. A hospital.

The scraper health agent is the first doctor. The fix-it agent is the first surgeon. Next will be a search-quality agent that reviews the relevance of what we're indexing. A metadata agent that catches enrichment regressions. A cost agent that notices when a batch provider starts charging differently than we expect. Small, well-scoped doctors. Each one owning a single dimension of system health, reporting to the same review channel, all of them ultimately accountable to humans reading pull requests.

I genuinely don't know if this will stay the right shape in a year. The technical and ethical ground keeps moving. What I can say is that we made a deliberate choice to build healing infrastructure where the human always gets the last word, where the system respects the sources it depends on, and where "the agent fixed it" always means "the agent proposed a fix, a human approved it, and we have the audit trail to prove it."

Whether any of this is right, time will tell. The people using Andri every day, and the people running the sites we scrape, are the real test.

If it helps lawyers do better work, catches data problems before they matter, and maintains the goodwill of the public publishers we depend on, it worked.

If not, we'll learn something and try differently. That's also what self-healing looks like, I suppose.

A note on framing: most writing on "self-healing software" is about runtime systems that recover themselves without a human. What we've built is different: agents that do the repetitive human parts of incident response and propose code changes for a human to merge. We use the vocabulary because it's what people reach for, but the important property of our system is that a human always gets the last word, and the source systems we depend on are public records published by people who didn't ask to be scraped.