About
The HTR knowledge exchange meetings aim to share and gather knowledge about various aspects of digitizing, organizing, and analyzing large collections of historical sources. For example, HTR and OCR, document classification, layout recognition, and the use of metadata to improve these processes. These meetings are an initiative of Marijn Koolen (Huygens Institute) and Milan van Lange (NIOD).
This repository serves as a collection point for tips and tricks, useful links, discussion topics, and questions.
Meetings
Meetings notes are mostly in Dutch:
- 22 October 2025 Hands-On ATR Workshop: Loghi, Pagexml-Tools
- 17 September 2024 How do I find my way in ATR?
- 2 April 2024 Evaluation, HTR projects, round table
- 16 January 2024 Spelling variation and Named Entities
- 24 October 2023 Loghi, OCR/HTR quality
- 15 March 2023 Working with PageXML, PageXML task and tools
- 11 January 2023 Registration cards, Nansen-project
- 16 November 2022 Structuring & classifying
Resources
- Collections with HTR transcriptions
- Software
- Datasets
- ATR Models
- Best practices
- Projects focussing on HTR/OCR
Collections with HTR transcriptions
- HTR-Hub: https://htrhub.dekok.xyz/about
- Gemeentearchief Schiedam: https://app.transkribus.org/sites/archief-schiedam
- Lokale Kronieken: https://kronieken.transkribus.eu
- Noord Hollands Archief: https://app.transkribus.org/sites/noord-hollandsarchief
- Regionaal Archief Tilburg: https://app.transkribus.org/sites/jacob
- Stadsarchief Amsterdam: https://amsterdam-city-archives.transkribus.eu/#/
- Utrechts Archief: https://app.transkribus.org/sites/hetutrechtsarchief
Software
- Loghi: HTR and layout analysis toolkit
- LayPa: tool for text detection - i.e. where is the text in a scan?
- HTR quality classifier: basic layout and HTR quality classifier
- OCReval
- PageXML tools generic functionality (Python) for working with PageXML data
- FuzzySearch: Python module voor fuzzy zoeken van keywords en phrases
- Formula Detection: Python module for discovering formulaic language in text corpora
- Transkribus: software for HTR and manual transcription, correction and ground truth creationg
Datasets
- Mibudera
- Mibudera Zenodo community
- Mibudera Tracker/Client
- HTR-United is a catalog of metadata on training datasets (ground truth datasets) available for the creation of transcription or segmentation models. https://htr-united.github.io/
ATR-Models
- OCR/HTR model repository: https://zenodo.org/communities/ocr_models/
- Huggingface OCR/HTR models
Best Practices
- Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Re-Using Ground Truth-Data, Referencing Models, and Acknowledging Contributions. https://docs.google.com/document/d/11EBSHRoRteZC2ulIimDy0tsdN3ZO1v3h/edit
Projects focussing on HTR/OCR
- Oorlog voor de Rechter https://oorlogvoorderechter.nl
- REPUBLIC https://republic.huygens.knaw.nl
- GLOBALISE https://globalise.huygens.knaw.nl
- Oorlog uit Eerste Hand (NIOD) https://www.niod.nl/nl/projecten/oorlog-uit-eerste-hand
- HAICu https://www.haicu.science
- Golden Agents https://www.goldenagents.org