OCR pipeline optimized for the linear, single-column structure of historical books and manuscripts — from 16th-century printed books to 19th-century institutional records — with semantic search over the full text.
Historical books and manuscripts present a different OCR challenge than newspapers: the layouts are simpler, but the degradation is often more severe. Foxing, water damage, bleeding ink, gutter shadow in bound volumes, inconsistent marginalia, interleaved manuscript corrections, and early-modern typefaces (blackletter, secretary hand) all defeat tools trained on clean contemporary documents. Manually scanning page-by-page into a generic OCR tool produces text that requires extensive post-processing correction before it's usable — often consuming more archivist time than the scanning itself.
Special collections holdings are frequently the only surviving copies of their content. A failed or low-quality digitization doesn't just inconvenience researchers — it can mean a primary source is effectively inaccessible to anyone outside the physical reading room. High-quality OCR with semantic search transforms a closed stacks collection into an open research resource, extending the institution's mission and justifying digitization investment to funders.
Document pipeline with full-page single-pass OCR — optimized for linear book and manuscript layouts without newspaper-style column segmentation overhead
Image preprocessing API applies grayscale conversion, denoising, and deskew before OCR to improve results on foxed, stained, or warped pages
Quality check stage detects under-captured pages (fewer than 1,500 characters on a text-dense scan) and routes to a higher-accuracy fallback model
Text-based PDF detection — digitally-created PDFs from post-1980s publications are extracted directly at zero cost with perfect accuracy
Research Lab enables cross-volume analysis: generate summaries, find recurring themes, map named entities, and build data tables across an entire monograph collection
Searchable PDF export produces a layered file with original scan image + invisible text, preserving visual fidelity for scholarly citation
51-page pilot digitization of three rare 19th-century volumes for a partner researcher, completed in under an hour
Monastic library digitizing its complete holdings of 18th-century theological manuscripts for a Vatican-funded preservation project
University special collections processing a 200-volume institutional history collection for a centennial digital exhibition
Antiquarian estate archive — 400 years of bound family records, ledgers, and correspondence scanned and made searchable for genealogical research
Individual researchers and small collections typically start on the Researcher tier ($79/month, 100 pages); university special collections departments generally fit the Professional ($149/month) or Institution tier ($499/month) depending on digitization volume.
ArchiveLM is in private beta. We review each request and typically respond within 1–3 business days.
Request Beta AccessApproved accounts receive hands-on onboarding support to validate results on your own documents.
ArchiveLM's current OCR pipeline performs best on typeset historical documents (printed books, broadsides, newspapers). Handwritten manuscripts in secretary hand or other pre-modern scripts are significantly more difficult — this is an area where Transkribus has a specialized trained model advantage for cursive handwriting. For typed or printed historical documents, ArchiveLM's pipeline is well-suited.
The image preprocessing API includes a deskew pass that can reduce the effect of page curvature at the spine. For severe gutter shadow, we recommend re-scanning with a book scanner that uses glass platen flattening. ArchiveLM's quality check stage will flag pages with low character yield so you know which pages may need rescanning.
Yes. All uploaded documents share a single searchable corpus within your account. Semantic search queries all volumes simultaneously. The Research Lab allows you to select specific volumes or all volumes as the source for research tools like summaries, timelines, and entity maps.
Layout-aware OCR that reads historical broadsheets as they were typeset — column by column, ad by ad — and makes every article semantically searchable.
Purpose-built pipeline for Hansard and legislative records — extracts speaker-attributed debates, committee proceedings, and legislative journals into a fully searchable, citable corpus.
OCR and semantic search platform built on Spanish-language Latin American primary sources — colonial-era typography, 19th-century broadsheets, and archaic orthography handled natively.