OCR and semantic search platform built on Spanish-language Latin American primary sources — colonial-era typography, 19th-century broadsheets, and archaic orthography handled natively.
19th-century Spanish-language documents use orthographic conventions and typefaces that post-date modern OCR training corpora. Ligatures, archaic letter forms (long-s, ç variants), inconsistent accent marks, period-correct abbreviations, and colonial-era legal formulae all confuse models trained on contemporary text. Most commercial OCR platforms are optimized for English and Western European languages — Spanish-language historical documents are an afterthought, producing error rates that make the extracted text unusable for serious research.
Latin American colonial and 19th-century primary sources are the evidentiary foundation for entire subfields of history — labor history, independence movements, Indigenous rights, land tenure, religious history. When OCR fails on these documents, research stops or reverts to expensive manual transcription. Cross-language semantic search is particularly critical: researchers writing in English need to find Spanish sources by concept, not keyword. A platform that can't bridge that language gap serves only native Spanish-speaking researchers.
Gemini 2.5 Pro OCR trained on multilingual corpora handles archaic Spanish orthography, period ligatures, and colonial legal formulae with high fidelity
Cross-language semantic search — English queries conceptually match Spanish source documents using shared multilingual embedding space
AI Librarian (RAG chat) answers research questions in English and returns cited passages in Spanish with context
Built-in translation — translate button on any article converts Spanish extraction to English for non-specialist users
AI enrichments add historical context (in English or Spanish) to each extracted article, grounded only in actual source text
Self-healing verification compares extracted text against the source image to catch OCR failures on difficult typography
Paraguayan government newspaper archive from 1870-1930 — 85,000 pages of Spanish-language broadsheets processed for a historical society
Colonial-era legal record digitization for a Latin American national archive, covering land grants and ecclesiastical records
19th-century Buenos Aires press collection at a university library, made searchable for labor history researchers
Rare colonial-era manuscript collection at a European Iberian studies center, transcribed and semantically indexed
Academic departments and university libraries processing smaller pilot collections typically start on the Researcher tier ($79/month); larger institutional programs move to Professional or Institution.
ArchiveLM is in private beta. We review each request and typically respond within 1–3 business days.
Request Beta AccessApproved accounts receive hands-on onboarding support to validate results on your own documents.
The OCR model (Gemini 2.5 Pro) is a large multimodal model trained on diverse historical text corpora. It generally handles archaic letterforms better than specialized legacy OCR tools because its training included varied historical typography. The AI enrichment stage can optionally normalize archaic forms to modern equivalents in the display text, while preserving the original in the raw extraction field used for search.
Yes — this is one of ArchiveLM's core capabilities. The vector embedding model maps concepts across languages in a shared semantic space. An English query for 'land rights disputes' will surface Spanish articles about 'disputas de tierras' without any translation step. The AI Librarian can also answer questions in English and pull citations from Spanish source documents.
Yes. The underlying models are multilingual and handle Portuguese with comparable fidelity to Spanish. Cross-language semantic search works across Portuguese documents as well.
Layout-aware OCR that reads historical broadsheets as they were typeset — column by column, ad by ad — and makes every article semantically searchable.
Purpose-built pipeline for Hansard and legislative records — extracts speaker-attributed debates, committee proceedings, and legislative journals into a fully searchable, citable corpus.
OCR pipeline optimized for the linear, single-column structure of historical books and manuscripts — from 16th-century printed books to 19th-century institutional records — with semantic search over the full text.