IMPACT @ CIS
Welcome to the website of the IMPACT working group at the Centrum für Informations- und Sprachverarbeitung, University of Munich.
Members of the group are:
- Prof. Klaus Schulz
- Dr. Christoph Ringlstetter
- Thorsten Vobl
- Annette Gotscharek (until 2011)
- Ulrich Reffle (until 2011)
We have worked for the IMPACT project for more than three years (since 2008) - in the near future, this page will contain more information on what we've been doing, and what we have achieved since. If you have any questions, please don't hesitate to contact us (Mail: impact_team[at]cis[dot]uni-muenchen[dot]de)
Analysis and Post-Correction of OCR'd Documents
Text and Error ProfilingIn the work-package "Language Modelling and Dictionaries in OCR" of the IMPACT project, the working group of the University of Munich has developed software to analyse OCR results for historical documents with the intention of using the inherent document characteristics to improve OCR success. This analysis addresses both the language of the document and the errors introduced by the OCR engine. Using the raw OCR output alone, not depending on manual interaction or ground truth data, the analysis returns a consistent statistical model which gives estimates for the below scenarios. This kind of information can be used to enhance a number of downstream operations, e.g. quality control, post-correction, retrieval, or also a second run of the OCR that is highly adaptive to the document at hand.
- What is the vocabulary of the source document? Are large parts covered by modern lexica, or do historical orthography and vocabulary play a dominant role? To which extent are languages other than the primary language (e.g. Latin) present in the text?
- In particular: Which rules or patterns explain the orthographical variation in this specific document? Examples for such rules in German are k->c, t->th, i->y
- What can be said about the word error rate of the document? Which kinds of OCR errors are systematically introduced? Which words are repeatedly misspelled?
- Again, in particular: Which error patterns occur with high probability? Prominent examples for many font types and OCR classifiers are i->l, m->in,n->u
Interactive Post-Correction of OCRed documents
The technology explained above is used extensively in an application for interactive post-correction of OCRed documents. Using the information obtained by the Text and Error Profiler the whole correction process is adaptive to the document being processed. This adaptation influences all areas of the tool, including error detection and the presentation of correction candidates for erroneous tokens.
Besides, the document specific knowledge allows for the batch processing of erroneous words, where the system can find the correct substitute with very high confidence. In this way, usually huge numbers of systematic errors can be corrected with just a few keystrokes.
A complete view of the post-correction graphical interface
Batch-processing of a systematically misspelled word
Batch-processing of all errors based on the systematic error pattern n->u