Project Details
PERO - Pokročilá extrakce a rozpoznávání obsahu tištěných a rukou psaných digitalizátů pro zvýšení jejich přístupnosti a využitelnosti
Project Period: 1. 3. 2018 - 31. 12. 2022
Project Type: grant
Code: DG18P02OVV055
Agency: Ministry of Culture Czech Republic
Program: Program na podporu aplikovaného výzkumu a experimentálního vývoje národní a kulturní identity na léta 2016 až 2022 (NAKI II)
Optical character recognition, handwriting recognition, natural language processing, quality enhancement, language model, convolutional neural networks recurrent neural networks
The project aims to create technology and tools which would improve accessibility of digitized historic documents. These tools, based on state of the art methods from computer vision, machine learning and language modeling, will enable existing digital archives and libraries to provide full-text search and content extraction for low quality historic printed and all hand written documents - which can not be automatically processed by the currently available tools. The project extends automation and capabilities of digitization pipeline by providing tools for automated quality assessment and control, quality improvement, automated text transcription of historic printed documents, semi-automated hand written text transcription, and automatic extraction of semantic information from semi-structured documents (e.g. library catalogs and birth records). The created tools and techniques will be validated by processing selected collections of digitized materials and by a pilot operation by cooperation with Moravian Library.
Bařina David, Ing., Ph.D. (UPGM FIT VUT) , team leader
Hradiš Michal, Ing., Ph.D. (UPGM FIT VUT) , team leader
Juránek Roman, Ing., Ph.D. (UPGM FIT VUT) , team leader
Zemčík Pavel, prof. Dr. Ing. (UPGM FIT VUT) , team leader
Beneš Karel, Ing. (UPGM FIT VUT)
Hájková Gabriela, Mgr. (Děkanát FIT VUT)
Hříbek David, Ing. (UPGM FIT VUT)
Kodym Oldřich, Ing., Ph.D. (UPGM FIT VUT)
Kopeczinski Daniela, Mgr. (Děkanát FIT VUT)
2022
- KIŠŠ Martin, KOHÚT Jan, BENEŠ Karel and HRADIŠ Michal. Importance of Textlines in Historical Document Classification. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. Lecture Notes in Computer Science, vol. 13237. La Rochelle: Springer Nature Switzerland AG, 2022, pp. 158-170. ISBN 978-3-031-06554-5. Detail
- DVOŘÁKOVÁ Martina, HRADIŠ Michal, ŽABIČKA Petr, KOHÚT Jan, KIŠŠ Martin and BENEŠ Karel. Využití PERO OCR při přepisu rukopisů. Archivní časopis, vol. 72, no. 1, 2022, pp. 14-27. ISSN 0004-0398. Detail
2021
- KIŠŠ Martin, BENEŠ Karel and HRADIŠ Michal. AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. In: Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science, vol. 12824. Lausanne: Springer Nature Switzerland AG, 2021, pp. 463-477. ISBN 978-3-030-86336-4. Detail
- KODYM Oldřich and HRADIŠ Michal. Page Layout Analysis System for Unconstrained Historic Documents. In: Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science. Lausanne: Springer Nature Switzerland AG, 2021, pp. 492-506. ISBN 978-3-030-86330-2. Detail
- KODYM Oldřich and HRADIŠ Michal. TG2: text-guided transformer GAN for restoring document readability and perceived quality. International Journal on Document Analysis and Recognition (IJDAR), vol. 2021, no. 1, pp. 1-14. ISSN 1433-2825. Detail
- KOHÚT Jan and HRADIŠ Michal. TS-Net: OCR Trained to Switch Between Text Transcription Styles. In: Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science, vol. 12824. Lausanne: Springer Nature Switzerland AG, 2021, pp. 478-493. ISBN 978-3-030-86336-4. ISSN 0302-9743. Detail
2020
- KIŠŠ Martin, HRADIŠ Michal and KODYM Oldřich. Brno Mobile OCR Dataset. In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. Sydney: Institute of Electrical and Electronics Engineers, 2020, pp. 1352-1357. ISBN 978-1-7281-3015-6. Detail
2022
- Information extraction from semi-structured documents, software, 2022
Authors: Hradiš Michal, Kišš Martin, Kohút Jan, Beneš Karel, Kostelník Martin Detail
2021
- Interactive semi-automatic handwritten text recognition, software, 2021
Authors: Hradiš Michal, Kišš Martin, Kohút Jan, Beneš Karel, Kodym Oldřich, Buchal Petr, Hříbek David Detail
2020
- Adaptive OCR for older printed documents, software, 2020
Authors: Hradiš Michal, Kišš Martin, Kodym Oldřich, Kohút Jan, Beneš Karel, Buchal Petr Detail - Scanner for damaged documents, specimen, 2020
Authors: Hradiš Michal Detail