Structured AI extraction for historical legal records — court proceedings, land grants, edictos judiciales, probate records, and registry documents — searchable by case, party, date, and concept.
Historical legal records combine the worst of both worlds for OCR: dense, formulaic repeated text (legal notices share boilerplate phrases across dozens of entries on a single page) interleaved with unique case-specific names, dates, and amounts that must be captured precisely. Early attempts at hallucination detection in AI pipelines often falsely flag repeated legal formulae as hallucination and delete them — a failure mode that destroys legally significant content. Separately, many historical legal records appear as embedded sections within newspapers (edictos judiciales, public land notices) rather than as standalone documents, requiring content-type classification to extract them correctly.
Historical legal records are primary evidence for property law disputes, inheritance claims, indigenous land rights cases, and constitutional history research. Errors in transcription aren't just academic inconveniences — they can affect ongoing litigation referencing historical precedent. Formulaic repetition in edictos judiciales, foreclosure notices, and land registry records is legally meaningful: each notice is a distinct legal act, even if the surrounding boilerplate is identical.
Hallucination detection tuned specifically for legal documents — distinguishes genuine formulaic repetition (legal notices sharing boilerplate) from AI-generated hallucination using content-ratio analysis rather than naïve deduplication
Content classification identifies and separately stores legal notices, public announcements, and classified legal ads from editorial articles on the same page
Semantic search across case names, party names, property descriptions, and legal concepts — finds relevant records without knowing exact historical spelling
Entity extraction (Research Lab) identifies and maps parties, dates, locations, and legal terms across an entire archive of records
Export as structured JSON preserving case metadata, parties, dates, and legal text as separate fields for integration with legal research databases
ALTO/XML export for preservation-grade archiving compatible with national archive and court record repository standards
State bar association digitizing 100 years of bar association proceedings and disciplinary records for historical and precedent research
Land registry preservation project extracting 19th-century property transfer records, organized by location and date for historical GIS mapping
University law library making its collection of 18th-century colonial court records searchable for legal historians and indigenous rights researchers
Newspaper-embedded legal notice extraction — separating edictos judiciales from editorial content across 50 years of a regional newspaper
Law school libraries and court archive programs typically fit the Professional ($149/month) or Institution tier ($499/month); individual legal historians generally start on the Researcher tier ($79/month).
ArchiveLM is in private beta. We review each request and typically respond within 1–3 business days.
Request Beta AccessApproved accounts receive hands-on onboarding support to validate results on your own documents.
Early hallucination detection algorithms naïvely flagged any text with repeated phrases, destroying pages of legitimate legal notices that share formulaic language. ArchiveLM's pipeline uses a unique-text ratio threshold: if more than 40% of the page text is unique (not repeated), the repetition detection is bypassed. Real hallucination produces ratio scores of 0.15-0.35; legitimate legal pages with formulaic notices score 0.45-0.75. This distinction was discovered through production testing and is documented in the platform's extraction quality reports.
Yes. The Research Lab's Entity Map tool extracts named entities (people, organizations, locations, dates) across your selected corpus. For systematic structured extraction of specific fields (case number, plaintiff, defendant, date), the JSON export can be post-processed with the extracted article structure. Custom structured extraction for specific legal record types is available under Enterprise agreements.
Yes. All uploaded documents are stored in your private account and are never shared with other users. Row-level security ensures complete data isolation. For institutional programs handling sensitive historical records, the Institution tier includes a private deployment option and data processing agreement.
Layout-aware OCR that reads historical broadsheets as they were typeset — column by column, ad by ad — and makes every article semantically searchable.
Purpose-built pipeline for Hansard and legislative records — extracts speaker-attributed debates, committee proceedings, and legislative journals into a fully searchable, citable corpus.
OCR and semantic search platform built on Spanish-language Latin American primary sources — colonial-era typography, 19th-century broadsheets, and archaic orthography handled natively.