When an LLM fabricates a case citation, someone files it in court. Here are the 6 engineering techniques that prevent this from happening.
In 2023, a New York attorney submitted a brief citing six cases that did not exist. ChatGPT had generated them. The attorney was sanctioned, and the incident became the defining cautionary tale for legal AI adoption. But the problem isn't that LLMs were used. The problem is that no engineering safeguards existed between the LLM output and the court filing. This guide covers the 6 techniques that legal AI systems must implement to prevent hallucinations from reaching production.
The Stakes:
Legal hallucinations are not just wrong answers. They are fabricated legal authorities that can result in sanctions, malpractice claims, and client harm. The standard for legal AI is not "mostly right." It is verifiably correct or explicitly flagged as uncertain.
Understanding the root causes helps you engineer better defenses:
Never let the LLM cite cases from its parametric memory. Ground every legal reference in a verified retrieval source. This means connecting your RAG pipeline to authoritative databases like Westlaw, LexisNexis, or your firm's internal case management system.
The prompt must explicitly instruct the model: "Only cite cases that appear in the provided context. If the context does not contain a relevant case, state that no relevant authority was found." This is the single most important safeguard. For enterprise RAG architectures that support this, read about multi-tenant RAG with Pinecone.
After the LLM generates a response, a separate post-processing pipeline extracts every citation and verifies it against a legal database API. This pipeline checks:
Does this case actually exist? Query the case database with the citation. If it returns no results, flag the citation as unverified.
Is this case still good law? Check for subsequent history (overruled, reversed, distinguished). Westlaw's KeyCite and LexisNexis's Shepard's provide this data via API.
Does the case actually stand for the proposition the LLM claims? Pull the headnotes or holding from the database and compare it to the LLM's characterization of the case.
This three-layer verification catches the vast majority of hallucinated citations before they reach a human reviewer.
Every statement in the LLM output should carry a confidence indicator. Implement this by asking the model to self-assess: "For each legal claim, indicate whether it is directly supported by the provided sources, partially supported, or based on general legal knowledge."
Statements flagged as "general knowledge" (not grounded in retrieved sources) should be highlighted in the UI with a warning badge. This transparency lets the reviewing attorney know exactly which parts need manual verification.
Instead of free-form text generation, constrain the LLM's output to match a legal ontology. For contract analysis, define the output schema: issue, applicable clause, relevant case law (from retrieval only), and risk assessment. The model fills in structured fields rather than writing prose, which dramatically reduces hallucination surface area.
# Structured output schema for contract review
class ContractIssue(BaseModel):
clause_reference: str # e.g., "Section 4.2(b)"
issue_type: Literal["risk", "ambiguity", "missing_term"]
description: str
supporting_authority: Optional[str] # only from retrieval
confidence: Literal["high", "medium", "low"]
recommended_action: str
For high-stakes legal work, the AI should draft and the human should approve. Build the workflow so that the AI output is presented as a "draft" with inline citations, confidence indicators, and flagged uncertainties. The attorney reviews, edits, and explicitly approves before any output reaches a client or court.
This is where LangGraph's interrupt/resume capability is valuable. The agent pauses at the review step and resumes only after human approval.
Maintain a benchmark dataset of 200+ legal questions with verified correct answers and citations. Run your system against this benchmark weekly and track hallucination rates over time. Any model update, prompt change, or retrieval modification should be validated against this benchmark before deployment.
For evaluation methodology that goes beyond subjective assessment, see our guide on evaluating LLMs properly.
Legal AI systems must also satisfy regulatory requirements. The ABA's formal opinions on AI use require attorneys to understand the technology's limitations and maintain supervisory responsibility. Your system should support this by:
For healthcare-specific compliance requirements, see our companion piece on agentic workflows and HIPAA compliance.
Yes, with engineering safeguards. An LLM with RAG grounding, citation verification, and human review is a powerful accelerator. The key is using it as a research assistant, not an autonomous legal authority. The attorney remains responsible.
Zero for citations. Any fabricated case citation is unacceptable. For legal analysis and reasoning, the standard is that every claim must be traceable to a source. If the system cannot ground a claim, it must say so.
Use self-hosted models or enterprise API tiers with data processing agreements (DPAs) that guarantee your data is not used for training. For multi-client deployments, implement strict data isolation with multi-tenant RAG using per-client namespaces.
We build legal AI systems with citation verification, grounding pipelines, and compliance-ready audit trails.
Discuss Your Legal AI Project