On Thinking Slowly: Why We Built Andri to Deliberate

I remember watching an experienced barrister work through a genuinely hard problem. This was years ago, before I was anywhere near AI.

What struck me wasn't the brilliance, though she was brilliant. It was the willingness to be wrong. She went down one line of argument for nearly an hour, pulling cases, building a structure, and then just... stopped. Said it wouldn't hold. Started again from somewhere else entirely.

Most people, when they're good at something, want to look good at it. She didn't seem to care about that. She cared about getting it right, and getting it right meant being willing to throw away an hour's work when it wasn't leading anywhere.

I think about that a lot now.

The thing about AI

Here's how most AI works: you ask a question, it generates an answer. One pass through the model. Done.

And look, this is often fine. For summarising documents, translating text, answering factual questions, it works surprisingly well. The model has seen millions of examples, it pattern-matches to something reasonable, you get your answer.

The problem comes when you need actual reasoning. When the first plausible-sounding answer might be wrong. When you need to check your work, or try a different approach, or recognise that you're confused.

The model can't do any of that. It doesn't know whether its answer is good. It just produces output with the same confident fluency regardless of whether the underlying reasoning holds up.

This bothers me for legal work specifically. Law is a domain where confidence and correctness are poorly correlated. I've seen junior lawyers write memos that sound completely convincing but miss something crucial. Seniors read the same facts and immediately spot three problems. Not because they're smarter exactly, but because they've learned to be suspicious of easy answers.

Can you build a system that's learned to be suspicious of its own easy answers? That's the question we kept coming back to.

Some things we've read

I should be honest: the science here is young. A lot of what seems promising now might turn out to be wrong. But there are findings that have shaped how we think about this.

There's a body of research on what happens when you ask language models to plan over multiple steps. Not just answer a question, but navigate towards a goal through a sequence of decisions. The consistent finding is that single-pass generation falls apart. The model reasons about individual steps fine, but loses coherence over the whole thing. Gets stuck in loops. Forgets constraints it established earlier.

What helps is giving the model a process for thinking more carefully. Let it take a step, observe what happened, revise its understanding, think again. Researchers call this "ReAct": Reason, Act, Observe. Success rates jump dramatically on problems that single-pass approaches couldn't solve.

There's related work on what's called "Tree of Thought", the idea that when a problem has multiple possible approaches and you don't know which will work, you should explore systematically. Try path A, see where it goes. Hit a wall, back up, try path B. Prune dead ends.

This is just... how people solve hard problems? It's how that barrister worked. It's how scientists do research. The insight from the AI literature is that it doesn't emerge automatically from training. You have to build it in.

None of this is settled. These are controlled experiments on narrow tasks. We don't really know how well it generalises. We're making a bet.

Why this matters for law

Law has properties that make this stuff particularly relevant.

Legal problems branch. You're not following one thread. You're considering breach, limitation, causation, quantum, procedure, all at once, and they interact in complicated ways. The right answer often depends on exploring several paths before you can see which is strongest.

Legal reasoning is adversarial. Whatever you argue, someone's going to attack it. If your AI thinks once and produces an answer, that answer might have weaknesses a human opponent will spot immediately. The point of deliberation isn't just finding the right answer. It's stress-testing your answer before someone else does.

Legal work involves long tasks. Drafting a skeleton argument isn't one question. It's understanding the claim, identifying issues, researching each one, structuring a response, drafting, checking consistency, verifying citations. Each step feeds into the next. Errors compound.

We've built systems that do some of this. Systems that plan research, evaluate what they find, decide whether to dig deeper. Systems that simulate opposing perspectives. Systems that handle long documents without losing the thread.

But they still fail in ways that surprise us. That's the honest truth. The science of reliable reasoning is hard.

What we're actually trying to do

There's a temptation when building AI products to optimise for demos. A system that thinks once and produces a fluent answer looks impressive. Feels like magic.

A system that deliberates, that sometimes backtracks, that sometimes says "hang on, let me look at this differently", is less dramatic. But it's more honest about how hard problems get solved.

I think this matters beyond product quality. It matters for what kind of relationship people can have with AI.

If systems always sound confident, users either stop trusting them entirely (and verify everything themselves, which defeats the point) or they trust when they shouldn't, and mistakes get through. Neither is good.

If systems show their reasoning, flag uncertainty, check their work, users can engage differently. They can see where the system is sure and where it's guessing. They can step in at decision points.

That's what we're trying to build. Not an oracle. A tool that reasons in a way you can follow and push back on.

I genuinely don't know if we'll get there. The technical challenges are real. We're making bets that might be wrong.

But it seems like the right thing to try.

Anyway

I want to be careful not to overclaim. The research we're drawing on is promising but not definitive. Our systems are better than simpler approaches on many tasks but can still fail. What works in controlled experiments doesn't always translate.

What I can say is we made a deliberate choice to build systems that think more than once. We did it because legal work (branching, adversarial, long-horizon) seems to demand it. And because we think trust between lawyers and AI depends on the AI being honest about its process, not just fluent in its output.

Whether any of this is right, time will tell. The people using Andri every day are the real test.

If it helps them do better work, if it catches problems before they matter, if it earns trust through reliability rather than demanding it through confidence, then it worked.

If not, we'll learn something and try differently. That's also what deliberation looks like, I suppose.

Papers that shaped our thinking: Zhou et al. on spatial-temporal reasoning, Long on Tree-of-Thought, and Yao et al. on Tree of Thoughts. For accessible explanations of these techniques, the Prompting Guide is excellent. The translation from academic findings to production systems involves a lot of choices that are ours, not theirs.