How do you stop it hallucinating findings?

The pipeline is retrieval-grounded. The model answers from the documents in front of it and cites the passage. Reviewers check the citation, and we sample and measure recall before anyone relies on it.

Is this defensible if challenged?

Yes, when it is built for it: a documented protocol, source citations on every finding, sampling to validate, and a human reviewer of record. We set it up so the methodology survives scrutiny.

What about privileged material?

Privileged and confidential documents are segregated and access-controlled, and never enter a training path. We deploy on terms with training opt-out and a DPA.

AI Document Review | Eutin Warren

Document review is where AI saves a firm the most hours, and where it is most often deployed badly. The difference between a review pipeline that survives scrutiny and one that quietly introduces risk is not the model. It is whether the output is checkable, the accuracy is measured rather than assumed, and a named human stands behind every call. We build review pipelines for due diligence, disclosure, and contract analysis that a junior can defend in front of a partner, and a partner can defend in front of a court. It is one strand of our Legaltech & AI practice, and it leans on the same system integration and governance discipline as the rest.

Accuracy is not one number

Vendors quote “accuracy” as a single figure. The number that matters for legal review is the balance of two: recall (of the issues that exist, how many did the system catch) and precision (of the issues it flagged, how many were real). The combined measure is the F1 score, and best-in-class legal tools target 90% and above. For diligence and disclosure, recall is usually the one that keeps you up at night. A missed change-of-control clause, or a missed privileged document, is the expensive error.

Measure	Question it answers	What a miss costs you
Recall	Of the real issues, how many were caught	A missed clause or privileged doc reaches the other side
Precision	Of the flags raised, how many were real	Reviewer time wasted clearing false positives
F1 score	The balance of the two	The honest headline number, on your documents

So the pipeline is tuned and validated against recall, with sampling to prove the rate before anyone relies on the result. Reviews that skip this are the second of the failure modes we see most often.

Defensibility: build it so a junior can stand behind every call

An AI review is only useful if its output is trusted, and it is only trusted if it is traceable. We design for four things from the start: every finding cites the source passage so reviewers verify rather than trust, a full audit trail records what the system did, sampling validates recall and precision against a human-reviewed set, and there is a human reviewer of record. Those four are also exactly what makes a methodology hold up if it is challenged by the other side or the court.

The tool landscape, honestly

The right tool depends on the work, and matters less than how it is set up. We are not a reseller and hold no preferred-vendor relationships, so the recommendation follows the matter, not a commission.

Tool	Best for	Approach	Watch-out
Kira	High-volume M&A extraction	Trained clause models	Setup and training effort
Luminance	Anomalies in unfamiliar sets	Unsupervised pattern analysis	Less prescriptive output
Harvey	Q&A, summary, drafting	Broad legal copilot	Not diligence-specialised
Relativity aiR	Disclosure at eDiscovery scale	AI in the eDiscovery flow	eDiscovery-centric
Claude-based pipeline	Bespoke review protocols	Retrieval-grounded plus governance	You build the pipeline

Privilege and confidentiality in the pipeline

Privileged and confidential material is segregated, access-controlled, and kept out of any training path. We deploy on terms with training opt-out and a data processing agreement, document the data flow, and keep the audit trail in the matter file. Hallucination is contained by grounding the model in the documents in front of it. It answers from the set and cites the passage, and reviewers check the citation. The account-tier and data-path discipline is the same one we apply when we deploy Claude across a firm.

Common pitfalls we are brought in to fix

Trusting one accuracy number. A high headline figure can hide poor recall on the issues that matter. Measure both, on your documents.
No validation step. Without sampling against a human-reviewed set, you cannot state a recall rate, so the result is not defensible.
No reviewer of record. AI is an input, not the signatory. Someone owns the call.
Privilege leakage. Confidential sets fed to a training-enabled tool. The data path is designed before any document moves.
Generic protocol. A template review misses what is specific to the matter. We start from your protocol.

A worked example

On a recent acquisition, a data room of several thousand contracts had to be triaged in days, not weeks. We built a retrieval-grounded pipeline against the deal’s own issue list, validated recall on a partner-reviewed sample, and produced a ranked, issue-coded output where every flag cited its source clause. Senior time went to the flagged minority instead of the whole set. The first pass landed faster, and because every finding was traceable, the team could defend the methodology to the buyer’s counsel without re-reading the room.

The review type changes the build

One pipeline does not fit every review. M&A due diligence is extraction and issue-coding against a known checklist, where recall on a fixed clause set is the whole game. Litigation disclosure is relevance and privilege screening at far larger scale, where defensibility and a sampling protocol matter most. Contract intake and triage is high-volume, repeatable classification where speed and consistency win. We build the pipeline to the review, not the other way round, and we are explicit about which type we are solving for before any document moves.

How we prove the recall rate

A recall number you cannot show is a recall number nobody should trust. We take a statistically meaningful sample, have a human reviewer code it independently, and compare the pipeline’s output against that ground truth. That gives a measured recall and precision on the actual documents, not a vendor benchmark on someone else’s corpus. If the rate is not good enough for the matter, we tune and re-measure before go-live. The sample, the method, and the result go into the matter file, so the methodology is defensible the day it is questioned rather than reconstructed afterwards.

When not to use AI review

Sometimes the honest answer is no. A handful of bespoke, high-stakes documents are faster and safer read by a senior lawyer than wrapped in a pipeline. A set too sensitive to leave the building may not have a deployment that meets the bar in the time available. We say so. The point of the practice is a defensible result, and occasionally the defensible result is human review.

How we set up a review pipeline

We start from your review protocol, not a generic template, and build a repeatable pipeline from data room or disclosure set to a ranked, summarised, issue-coded output. We validate with sampling before go-live and prove the recall rate, and integrate the pipeline with your DMS or data room so reviewers work where they already work, under the governance framework that keeps it defensible. The result is a first pass that focuses senior time on the few percent that actually need it.

AI Document Review

Accuracy is not one number

Defensibility: build it so a junior can stand behind every call

The tool landscape, honestly

Privilege and confidentiality in the pipeline

Common pitfalls we are brought in to fix

A worked example

The review type changes the build

How we prove the recall rate

When not to use AI review

How we set up a review pipeline

What AI Document Review delivers

Review pipeline

Checkable output

Privilege controls

Integration

Representative AI Document Review work

M&A due diligence

Disclosure review

Common AI Document Review questions

Have a AI Document Review project in mind?