Welcome to the website of the IMPACT working group at the Centrum für Informations- und Sprachverarbeitung, University of Munich.

Members of the group are:

We have worked for the IMPACT project for more than three years (since 2008) - in the near future, this page will contain more information on what we've been doing, and what we have achieved since. If you have any questions, please don't hesitate to contact us (Mail: impact_team[at]cis[dot]uni-muenchen[dot]de)


Analysis and Post-Correction of OCR'd Documents

Text and Error Profiling

In the work-package "Language Modelling and Dictionaries in OCR" of the IMPACT project, the working group of the University of Munich has developed software to analyse OCR results for historical documents with the intention of using the inherent document characteristics to improve OCR success. This analysis addresses both the language of the document and the errors introduced by the OCR engine. Using the raw OCR output alone, not depending on manual interaction or ground truth data, the analysis returns a consistent statistical model which gives estimates for the below scenarios. This kind of information can be used to enhance a number of downstream operations, e.g. quality control, post-correction, retrieval, or also a second run of the OCR that is highly adaptive to the document at hand.

Interactive Post-Correction of OCRed documents

The technology explained above is used extensively in an application for interactive post-correction of OCRed documents. Using the information obtained by the Text and Error Profiler the whole correction process is adaptive to the document being processed. This adaptation influences all areas of the tool, including error detection and the presentation of correction candidates for erroneous tokens.

Besides, the document specific knowledge allows for the batch processing of erroneous words, where the system can find the correct substitute with very high confidence. In this way, usually huge numbers of systematic errors can be corrected with just a few keystrokes.

A complete view of the post-correction graphical interface

Batch-processing of a systematically misspelled word

Batch-processing of all errors based on the systematic error pattern n->u