Purpose-built pipeline for Hansard and legislative records — extracts speaker-attributed debates, committee proceedings, and legislative journals into a fully searchable, citable corpus.
Historical Hansard volumes and parliamentary records present a distinct digitization challenge: complex speaker-attribution formatting, dense multi-column layouts in older volumes, inconsistent typefaces across decades, and procedural annotations interspersed with debate text. Generic OCR tools treat a parliamentary page as undifferentiated text, collapsing speaker names into speeches and losing the structural metadata that makes Hansard legally and historically meaningful. Researchers end up with raw text walls they can't efficiently query across sessions.
Legislative debates are the primary evidence base for constitutional historians, public policy researchers, legal scholars, and journalists investigating how laws were made. Losing speaker attribution or mixing up procedural text with debate content corrupts the historical record. An accurately structured Hansard corpus enables queries like 'every speech mentioning Indigenous land rights between 1890 and 1940' — impossible with undifferentiated text.
Document pipeline specifically tuned for linear legislative formatting — single-pass full-page OCR with layout preservation for Hansard's standard two-column style
Structured JSON output preserves document hierarchy: session, speaker, speech, procedural annotation
Semantic search across decades of debates — find thematic threads without knowing exact terminology
RAG chat (AI Librarian) answers cross-session research questions with source citations to specific debate pages
Research Lab generates summaries, speaker timelines, entity maps, and key-theme analyses across selected sessions
ALTO/XML export for integration with institutional repositories and existing parliamentary archive systems
23,686-page Hansard corpus from a Commonwealth parliamentary library, pilot-processed across 50 pages before full-scale commitment
Provincial legislature digitization program covering 50 years of session records for public policy research access
Academic research project tracking the evolution of language rights legislation across 30 years of federal debate
Truth and reconciliation research center searching for Indigenous affairs mentions across a century of parliamentary record
Parliamentary digitization programs at institutional scale typically fit the Institution tier ($499/month, unlimited pages) or the Enterprise tier for multi-year bulk processing commitments.
ArchiveLM is in private beta. We review each request and typically respond within 1–3 business days.
Request Beta AccessApproved accounts receive hands-on onboarding support to validate results on your own documents.
Yes. The document pipeline routes automatically based on scan characteristics. Older typeset volumes use OCR with image preprocessing. Digitally-printed PDFs (post-1990s) are detected as text-based and extracted directly at zero cost with perfect accuracy, since the text is already embedded in the PDF.
The platform auto-detects document type on upload. Older Hansard volumes printed in two-column broadsheet format are routed to the newspaper pipeline with layout-aware column segmentation. Linear single-column volumes use the document pipeline. The routing decision is visible in the dashboard and can be overridden.
Yes. Once extracted, speaker names become searchable entities. Semantic search finds speeches by a given speaker. The Research Lab's Entity Map tool can extract all mentions of a named individual across your selected corpus and map their appearances chronologically.
Layout-aware OCR that reads historical broadsheets as they were typeset — column by column, ad by ad — and makes every article semantically searchable.
OCR and semantic search platform built on Spanish-language Latin American primary sources — colonial-era typography, 19th-century broadsheets, and archaic orthography handled natively.
OCR pipeline optimized for the linear, single-column structure of historical books and manuscripts — from 16th-century printed books to 19th-century institutional records — with semantic search over the full text.