Layout-aware OCR that reads historical broadsheets as they were typeset — column by column, ad by ad — and makes every article semantically searchable.
Historical newspapers were typeset in 5-7 column layouts with rotated advertisements along page edges, dense classified sections, mastheads, legal notices, and tables all competing for the same physical space. Standard OCR pipelines read pages left-to-right as a single text block and produce scrambled, unreadable output — merging columns mid-sentence and dropping rotated content entirely. Layout-aware extraction with proper column segmentation has historically required expensive, proprietary platforms or custom engineering that most archives can't afford.
Researchers cite passages from digitized newspapers in peer-reviewed academic work, legal filings, and government reports. Scrambled reading order isn't just inconvenient — it makes the extracted text unusable for semantic search and produces nonsensical context in RAG chat. Self-healing verification catches the column-detection failures that single-pass OCR produces, ensuring the archive you build is trustworthy enough to publish.
3-segment OCR with layout-aware column detection — left, center, and right segments processed independently then merged with cross-deduplication
Auto-classifies each item as article, classified advertisement, display advertisement, legal notice, public announcement, masthead, or business directory
Cross-language semantic search — English-language queries find Spanish, French, or Portuguese source documents
ALTO/XML v4 export for IIIF and DPLA interoperability with existing library infrastructure
Searchable PDF export — original scan image with invisible OCR text layer for standard PDF readers
Self-healing verification (patent pending) automatically re-processes pages where extraction gaps are detected
85,000-page Paraguayan broadsheet collection spanning 1870-1930, processed at ~$0.15-0.20 per page
University student-newspaper retrospective covering 60 years of campus issues for a centennial anniversary project
State library newspaper morgue digitization program serving public researchers and genealogists
Local historical society scrapbook and clipping archive made searchable for the first time
Most newspaper digitization programs fit the Professional tier ($149/month, 300 pages) or Institution tier ($499/month, unlimited pages) depending on batch size.
ArchiveLM is in private beta. We review each request and typically respond within 1–3 business days.
Request Beta AccessApproved accounts receive hands-on onboarding support to validate results on your own documents.
Yes. The platform was built on Spanish-language Latin American newspapers from the 19th century. OCR works across languages, and cross-language semantic search means English queries return Spanish, French, or Portuguese articles by meaning — not just keyword match.
The layout detection stage identifies rotated content along page edges and assigns it to dedicated extraction segments. The OCR model is prompted to extract rotated and sideways text explicitly. Display ads are classified separately from editorial content so they don't pollute article search results.
ALTO/XML v4 (standard for newspaper digitization, compatible with IIIF viewers and DPLA), searchable PDF (scan + OCR layer), JSON (full structured data), CSV (tabular article index), and Markdown. ALTO export is included on Professional and Institution tiers.
Accuracy depends on scan quality and print condition. On clean 300 DPI scans of 19th-century newspapers, ArchiveLM's Gemini 2.5 Pro pipeline consistently captures 90%+ of readable text. The built-in image preprocessing API can apply grayscale conversion, denoising, and deskewing before OCR to improve results on difficult scans.
Purpose-built pipeline for Hansard and legislative records — extracts speaker-attributed debates, committee proceedings, and legislative journals into a fully searchable, citable corpus.
OCR and semantic search platform built on Spanish-language Latin American primary sources — colonial-era typography, 19th-century broadsheets, and archaic orthography handled natively.
OCR pipeline optimized for the linear, single-column structure of historical books and manuscripts — from 16th-century printed books to 19th-century institutional records — with semantic search over the full text.