About

The HTR knowledge exchange meetings aim to share and gather knowledge about various aspects of digitizing, organizing, and analyzing large collections of historical sources. For example, HTR and OCR, document classification, layout recognition, and the use of metadata to improve these processes. These meetings are an initiative of Marijn Koolen (Huygens Institute) and Milan van Lange (NIOD).

This repository serves as a collection point for tips and tricks, useful links, discussion topics, and questions.

Meetings

Meetings notes are mostly in Dutch:

22 October 2025 Hands-On ATR Workshop: Loghi, Pagexml-Tools
17 September 2024 How do I find my way in ATR?
2 April 2024 Evaluation, HTR projects, round table
16 January 2024 Spelling variation and Named Entities
24 October 2023 Loghi, OCR/HTR quality
15 March 2023 Working with PageXML, PageXML task and tools
11 January 2023 Registration cards, Nansen-project
16 November 2022 Structuring & classifying

Resources

Collections with HTR transcriptions

HTR-Hub: https://htrhub.dekok.xyz/about
Gemeentearchief Schiedam: https://app.transkribus.org/sites/archief-schiedam
Lokale Kronieken: https://kronieken.transkribus.eu
Noord Hollands Archief: https://app.transkribus.org/sites/noord-hollandsarchief
Regionaal Archief Tilburg: https://app.transkribus.org/sites/jacob
Stadsarchief Amsterdam: https://amsterdam-city-archives.transkribus.eu/#/
Utrechts Archief: https://app.transkribus.org/sites/hetutrechtsarchief

Software

Loghi: HTR and layout analysis toolkit
LayPa: tool for text detection - i.e. where is the text in a scan?
HTR quality classifier: basic layout and HTR quality classifier
OCReval
PageXML tools generic functionality (Python) for working with PageXML data
FuzzySearch: Python module voor fuzzy zoeken van keywords en phrases
Formula Detection: Python module for discovering formulaic language in text corpora
Transkribus: software for HTR and manual transcription, correction and ground truth creationg

Datasets

Mibudera
Mibudera Zenodo community
Mibudera Tracker/Client
HTR-United is a catalog of metadata on training datasets (ground truth datasets) available for the creation of transcription or segmentation models. https://htr-united.github.io/

ATR-Models

OCR/HTR model repository: https://zenodo.org/communities/ocr_models/
Huggingface OCR/HTR models

Best Practices

Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Re-Using Ground Truth-Data, Referencing Models, and Acknowledging Contributions. https://docs.google.com/document/d/11EBSHRoRteZC2ulIimDy0tsdN3ZO1v3h/edit

Projects focussing on HTR/OCR

Oorlog voor de Rechter https://oorlogvoorderechter.nl
REPUBLIC https://republic.huygens.knaw.nl
GLOBALISE https://globalise.huygens.knaw.nl
Oorlog uit Eerste Hand (NIOD) https://www.niod.nl/nl/projecten/oorlog-uit-eerste-hand
HAICu https://www.haicu.science
Golden Agents https://www.goldenagents.org