Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text

Among mass digitization methods, double-keying is considered to be the one with the lowest error rate. This method requires two independent transcriptions of a text by two different operators. It is particularly well suited to historical texts, which often exhibit deficiencies like poor master copie...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Susanne Haaf, Frank Wiegand, Alexander Geyken
Formato: article
Lenguaje:DE
EN
ES
FR
IT
Publicado: OpenEdition 2015
Materias:
Acceso en línea:https://doaj.org/article/9e6bdc86bec445808f47bf218c864a24
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:9e6bdc86bec445808f47bf218c864a24
record_format dspace
spelling oai:doaj.org-article:9e6bdc86bec445808f47bf218c864a242021-12-02T11:30:31ZMeasuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text2162-560310.4000/jtei.739https://doaj.org/article/9e6bdc86bec445808f47bf218c864a242015-03-01T00:00:00Zhttp://journals.openedition.org/jtei/739https://doaj.org/toc/2162-5603Among mass digitization methods, double-keying is considered to be the one with the lowest error rate. This method requires two independent transcriptions of a text by two different operators. It is particularly well suited to historical texts, which often exhibit deficiencies like poor master copies or other difficulties such as spelling variation or complex text structures. Providers of data entry services using the double-keying method generally advertise very high accuracy rates (around 99.95% to 99.98%). These advertised percentages are generally estimated on the basis of small samples, and little if anything is said about either the actual amount of text or the text genres which have been proofread, about error types, proofreaders, etc. In order to obtain significant data on this problem it is necessary to analyze a large amount of text representing a balanced sample of different text types, to distinguish the structural XML/TEI level from the typographical level, and to differentiate between various types of errors which may originate from different sources and may not be equally severe. This paper presents an extensive and complex approach to the analysis and correction of double-keying errors which has been applied by the DFG-funded project "Deutsches Textarchiv" (German Text Archive, hereafter DTA) in order to evaluate and preferably to increase the transcription and annotation accuracy of double-keyed DTA texts. Statistical analyses of the results gained from proofreading a large quantity of text are presented, which verify the common accuracy rates for the double-keying method.Susanne HaafFrank WiegandAlexander GeykenOpenEditionarticledouble-keyingquality controlerror classificationdigitizationtoolstranscription accuracyComputer engineering. Computer hardwareTK7885-7895DEENESFRITJournal of the Text Encoding Initiative, Vol 4 (2015)
institution DOAJ
collection DOAJ
language DE
EN
ES
FR
IT
topic double-keying
quality control
error classification
digitization
tools
transcription accuracy
Computer engineering. Computer hardware
TK7885-7895
spellingShingle double-keying
quality control
error classification
digitization
tools
transcription accuracy
Computer engineering. Computer hardware
TK7885-7895
Susanne Haaf
Frank Wiegand
Alexander Geyken
Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text
description Among mass digitization methods, double-keying is considered to be the one with the lowest error rate. This method requires two independent transcriptions of a text by two different operators. It is particularly well suited to historical texts, which often exhibit deficiencies like poor master copies or other difficulties such as spelling variation or complex text structures. Providers of data entry services using the double-keying method generally advertise very high accuracy rates (around 99.95% to 99.98%). These advertised percentages are generally estimated on the basis of small samples, and little if anything is said about either the actual amount of text or the text genres which have been proofread, about error types, proofreaders, etc. In order to obtain significant data on this problem it is necessary to analyze a large amount of text representing a balanced sample of different text types, to distinguish the structural XML/TEI level from the typographical level, and to differentiate between various types of errors which may originate from different sources and may not be equally severe. This paper presents an extensive and complex approach to the analysis and correction of double-keying errors which has been applied by the DFG-funded project "Deutsches Textarchiv" (German Text Archive, hereafter DTA) in order to evaluate and preferably to increase the transcription and annotation accuracy of double-keyed DTA texts. Statistical analyses of the results gained from proofreading a large quantity of text are presented, which verify the common accuracy rates for the double-keying method.
format article
author Susanne Haaf
Frank Wiegand
Alexander Geyken
author_facet Susanne Haaf
Frank Wiegand
Alexander Geyken
author_sort Susanne Haaf
title Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text
title_short Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text
title_full Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text
title_fullStr Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text
title_full_unstemmed Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text
title_sort measuring the correctness of double-keying: error classification and quality control in a large corpus of tei-annotated historical text
publisher OpenEdition
publishDate 2015
url https://doaj.org/article/9e6bdc86bec445808f47bf218c864a24
work_keys_str_mv AT susannehaaf measuringthecorrectnessofdoublekeyingerrorclassificationandqualitycontrolinalargecorpusofteiannotatedhistoricaltext
AT frankwiegand measuringthecorrectnessofdoublekeyingerrorclassificationandqualitycontrolinalargecorpusofteiannotatedhistoricaltext
AT alexandergeyken measuringthecorrectnessofdoublekeyingerrorclassificationandqualitycontrolinalargecorpusofteiannotatedhistoricaltext
_version_ 1718395895293149184