Why Agentic Reasoning Is the Only Path to Production Legal AI

Why Agentic Reasoning Is the Only Path to Production Legal AI

December 7, 2025

Why Agentic Reasoning Is the Only Path to Production Legal AI

Here's the thesis: if you're not building an agentic system—a loop of planning, tool use, reasoning and document generation grounded in a specific matter—you're not building production legal AI. You're building autocomplete with a legal skin.

We operate in the UK and the Netherlands, with lawyers working on real disputes and transactions. That gives you a brutal feedback loop on what actually works. Let me walk through it.


The Data Point That Should Make You Suspicious

On Andri, 95%+ of queries are grounded in documents from an actual matter. Less than 5% happens without case-specific context.

Think about what that means. Lawyers aren't waking up asking "What does contract law say about misrepresentation in general?" They're asking: "My client signed this SaaS agreement with these limitation clauses. The system went down for three days. Given the emails and change logs in this matter, what is our exposure and how do we defend?"

If your core thesis is "we're building better generic legal research", you're already pointed the wrong way. You're solving "what legal frameworks intersect with this question?" when the real question is "given this messy, specific fact pattern, under this legal system, what is a defensible position?"

Generic legal research is a feature. Case-centric reasoning is the product.


Search Is Plural

Here's where most architectures quietly reveal their mental model.

There is no single thing called "search". There's a toolbox:

  • Lexical (BM25): Your bread-and-butter for exact terms, citations, statute sections—"6:162 BW", "CPR 3.9", "section 49 CRA 2015"
  • Semantic: Concepts that don't share vocabulary—"duty of care of IT suppliers in large implementations"
  • Graph traversal: "What has limited, followed, or criticised this decision? What's the line of authority?"
  • Metadata filters: Slice by court, year, jurisdiction, outcome
  • Regex: Pull specific clauses, amounts, dates from contracts and pleadings

Each has different failure modes. Semantic search hallucinates on statute numbers. BM25 misses conceptual similarity. Graphs need maintenance. The common mistake is picking one and building your identity around it.

The key insight: expose these as tools the agent chooses between—not the user. Nobody wants a toggle saying "BM25 or semantic?" But the reasoning loop needs access to all of them.

A sensible agent handling a UK consumer dispute might use lexical search to grab the exact sections of CRA 2015, semantic search to surface factually similar cases, metadata filters to focus on post-2020 England & Wales decisions, then citation graphs to check whether anything's been quietly undermined on appeal.

If your architecture has a single box labelled "Search" and that box is "Vector DB", you're doing AI theatre.


The Agentic Threshold

Most legal AI products operate in endpoint mode: input text → LLM → output text. Maybe a retrieval step beforehand, maybe prompt templates, but fundamentally one forward pass through a model.

An agentic system is structurally different. The model isn't "the place where the answer lives". It's the controller of a loop that thinks, acts, observes, and thinks again.

def legal_agent(case_context, question):
    # Decompose the task
    plan = llm.plan(case_context, question)

    # Iterate: reason → use tools → update understanding
    while not plan.is_satisfied():
        tool_call = plan.next_tool_call()
        result = call_tool(tool_call)
        plan = llm.update_plan(plan, result)

    # Synthesise into structured reasoning
    reasoning = llm.synthesise(case_context, question, plan.evidence)

    # Self-check / counter-argument pass
    checked = llm.verify(reasoning, plan.evidence)

    # Render into actual work products
    return render_documents(checked, case_context)

The system decomposes questions into sub-tasks, has multiple tools at its disposal, and chooses tools based on intermediate state rather than a hard-coded pipeline. It can change its mind, ask a different question, dig deeper. And crucially—it ends with documents, not chat bubbles.

"Chat with your documents" is a subroutine. "RAG over a vector database" is a subroutine. A knowledge graph query is a subroutine. The agent is the outer loop that decides what to do given this case in this jurisdiction.

If your system doesn't have that outer loop, you've imposed a low ceiling on what it can ever do.


Progressive Disclosure: Legal Research as Tree Search

Watch how an experienced barrister actually works on a hard point.

They don't read "all of contract law" then answer. They skim the file and build a mental map. Identify the axes of dispute—breach vs causation, limitation, quantum, procedure. Pull a small set of anchor authorities. Check subsequent history. Compare the fact pattern to the tests. Explore counter-arguments. Draft. Bounce between draft, law and facts until the narrative is coherent.

This is progressive disclosure. Each step reveals what the next step should be. The process is effectively tree search in legal space—nodes are "partial understanding of the problem", edges are "read this case", "check this statute", "verify this procedural point". You don't expand the whole tree. You choose branches dynamically as you learn.

Agentic systems match this pattern. Start broad: "what are the relevant areas of law?" Narrow to doctrines. Drill into specific cases and statutory sections. Loop between matter documents and external law until the answer stabilises.

A single-shot "semantic search over case law, then one LLM call over top-k snippets" can't do this. You're betting the right cases, the right paragraphs, and the correct subsequent history all sit neatly in a few retrieved chunks. Sometimes you get lucky. Often you don't. And the system has no way to realise it hasn't thought hard enough.


Scaling Laws and Why Think-Time Matters

Zoom out to the modelling perspective.

At inference, you're always trading between speed, cost, and accuracy. Most non-agentic systems implicitly pick speed and low cost—one wide forward pass, return whatever comes out. If they want higher accuracy, they reach for a larger model and accept higher per-query cost.

Agentic systems make a different trade: accuracy and cost-efficiency, with variable speed. They treat deliberation as a budget.

Instead of one massive call, they break the problem into pieces, use deterministic tools wherever possible (BM25, regex, citation graphs—these are cheap and precise), use smaller model calls for local reasoning, and spend more steps only on hard queries.

Think of it as trading width for depth. A non-agentic system does one wide, expensive pass. An agentic system does many narrower passes, interleaved with tools, with the ability to stop when it's "good enough".

Here's why this matters: as base models improve, agentic systems get compound gains. Planning improves. Tool selection improves. Local reasoning improves. Verification improves. These gains multiply along the reasoning chain. The gap between "single-shot + vector DB" and "agent with tools and deliberate loops" doesn't close—it widens.


Knowledge Graphs and Semantic Search: Tools, Not Architectures

Two fashionable patterns that need to be put in their place.

Semantic search as "the whole product"

The pitch: "We embed all case law. We run semantic search. We feed top-k chunks into an LLM. Done."

This is a decent component. We use semantic search extensively. But alone it has no concept of matter context, no multi-step plan, no way to know if it's looked far enough, and no ability to dynamically decide "this is tricky, go deeper".

Knowledge graphs as "the platform"

The pitch: "We've structured law into a knowledge graph—cases, statutes, concepts, relationships. Query the graph; the graph is your legal brain."

As a tool, fine. We use graph structure for citations and doctrinal clusters. As an architecture, it misses the core reality: legal reasoning is contextual. Does this client's conduct fall within that three-limb test? Does this email chain amount to acceptance? The graph knows abstract relationships. It doesn't know this matter.

Worse: the law changes daily. New decisions, new regulations. Keeping the graph fresh becomes its own project. And whoever defines the schema hard-codes their interpretation of legal relationships. But that interpretation is exactly what you want to do dynamically at query time, with access to the full case context.

The only sane positioning: knowledge graph = one tool among many, used by an agent when it's helpful. If the graph is your platform, you've baked reasoning into a static structure that can't adapt fast enough and can't see the file.


Production Systems, Not Advisory Tools

Here's a fundamental confusion in legal AI about what product category we're in.

Most platforms position as advisory: "Ask a question, get an answer, do the rest yourself." But watch what happens when a lawyer gets a good answer. They still have to draft the advice memo, write the client email, prepare the skeleton argument, mark up the contract. The AI gave intelligence. They did 70% of the work.

Wrong product category.

The shift that happened in software: in the 2010s, tools advised developers—linters, docs, Stack Overflow. "Here's what's wrong, you fix it." In the 2020s, tools started executing—Copilot writes the code, Cursor completes the implementation. "Here's the artifact, you approve it."

The productivity gap isn't incremental. It's 10x.

Legal AI is stuck in Phase 1. The system tells you what to write. You write it.

A production system doesn't give you "the answer". It gives you the deliverable. The drafted Grounds of Defence with numbered paragraphs and your firm's formatting. The redlined SaaS agreement with tracked changes. The client advice memo with executive summary and risk assessment.

Simple test: "If the AI does its job perfectly, how much work does the lawyer still have to do?"

  • Advisory system: "Draft the memo, email, contract. 1-2 hours."
  • Production system: "Review, tweak 2-3 paragraphs, approve. 10-15 minutes."

If your answer is closer to the first, you've built a very expensive research interface.


Where Moats Actually Live

If your platform is semantic search over public case law, or "RAG nicely packaged", or a static knowledge graph with a chat wrapper—your moat is thin. Embeddings are a commodity. Vector databases are a commodity. Legal texts are largely public. Prompt templates leak. A capable team replicates you in months.

Advisory tools have weak lock-in: intermittent usage, easy switching, work product lives in Word anyway.

Production systems have structural lock-in: templates customised to the firm, matter files native to the platform, entire drafting workflow runs through it.

Where defensibility actually accumulates:

  • Tool breadth: Multiple search strategies, jurisdiction-aware indexation, robust parsers for contracts and judgments, DMS and email integrations
  • Reasoning policies: How you decompose different question types, when to escalate from quick answer to deep dive, how to handle conflicts between sources
  • Context handling: Matters as first-class objects, persistent context across weeks, multi-jurisdiction and bilingual workflows
  • Document pipeline: Templates matching how lawyers actually draft, outputs ready to file without rewriting

These compound. Every tool makes the agent smarter. Every reasoning improvement makes each tool more valuable. Once a team's matters and processes are embedded, they're not switching for "slightly better semantic search". They'd be ripping out their workflow.


Minimum Bar

If you're building in this space and want to be taken seriously:

  1. Case-first design. Every meaningful interaction anchored to a matter with documents, not a blank chat.
  2. Multi-step agent loop. Planning, tool calls, reflection, verification—not one-shot LLM invocation.
  3. Plural search. BM25, semantic, graph, metadata, regex—all available as tools.
  4. Progressive disclosure. Coarse-to-fine research trajectories baked in, not bolted on via "ask follow-up questions".
  5. Production-native outputs. Word, PDF, Excel as first-class outputs with templates.
  6. Think-time as a resource. Ability to budget deliberation—"fast and rough" vs "slower but deep".

If most of these are missing, the problem isn't that you're "early". You're optimising the wrong thing.


Closing

A lot of legal AI is autocomplete with better marketing. Prompt template + LLM + vector search = "AI research assistant". Demos well. Answers easy questions.

But legal work is messy, adversarial, and heavily contextual. It ends in filed documents and real decisions, not chat transcripts. It requires multi-step reasoning over evolving law and incomplete facts.

Agentic systems—with diverse tools, deep matter context, deliberate reasoning, and real document generation—are the only architecture that matches that reality.

The usage data is in: ~95% of work is case-specific. When lawyers get a complete draft they can refine rather than an answer they have to rewrite, engagement goes up 4-5x. They're not using a research tool. They're using a production system that happens to be intelligent.

The question is whether the people building "legal AI platforms" are paying attention.