The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources
In this article we describe the DTA “Base Format” (DTABf), a strict subset of the TEI P5 tag set. The purpose of the DTABf is to provide a balance between expressiveness and precision as well as an interoperable annotation scheme for a large variety of text types of historical corpora of printed tex...
Guardado en:
Autores principales: | , , |
---|---|
Formato: | article |
Lenguaje: | DE EN ES FR IT |
Publicado: |
OpenEdition
2015
|
Materias: | |
Acceso en línea: | https://doaj.org/article/446055617fd9463c89716d80d26a7998 |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:446055617fd9463c89716d80d26a7998 |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:446055617fd9463c89716d80d26a79982021-12-02T11:30:44ZThe DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources2162-560310.4000/jtei.1114https://doaj.org/article/446055617fd9463c89716d80d26a79982015-04-01T00:00:00Zhttp://journals.openedition.org/jtei/1114https://doaj.org/toc/2162-5603In this article we describe the DTA “Base Format” (DTABf), a strict subset of the TEI P5 tag set. The purpose of the DTABf is to provide a balance between expressiveness and precision as well as an interoperable annotation scheme for a large variety of text types of historical corpora of printed text from multiple sources. The DTABf has been developed on the basis of a large amount of historical text data in the core corpus of the project Deutsches Textarchiv (DTA) and text collections from 15 cooperating projects with a current total of 210 million tokens. The DTABf is a “living” TEI format which is continuously adjusted when new text candidates for the DTA containing new structural phenomena are encountered. We also focus on other aspects of the DTABf including consistency, interoperability with other TEI dialects, HTML and other presentations of the TEI texts, and conversion into other formats, as well as linguistic analysis. We include some examples of best practices to illustrate how external corpora can be losslessly converted into the DTABf, thus enabling third parties to use the DTABf in their specific projects. The DTABf is comprehensively documented, and several software tools are available for working with it, making it a widely used format for the encoding of historical printed German text.Susanne HaafAlexander GeykenFrank WiegandOpenEditionarticleTEI customizationhistorical corporacorpus annotationinteroperabilityinterchangeschema designComputer engineering. Computer hardwareTK7885-7895DEENESFRITJournal of the Text Encoding Initiative, Vol 8 (2015) |
institution |
DOAJ |
collection |
DOAJ |
language |
DE EN ES FR IT |
topic |
TEI customization historical corpora corpus annotation interoperability interchange schema design Computer engineering. Computer hardware TK7885-7895 |
spellingShingle |
TEI customization historical corpora corpus annotation interoperability interchange schema design Computer engineering. Computer hardware TK7885-7895 Susanne Haaf Alexander Geyken Frank Wiegand The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources |
description |
In this article we describe the DTA “Base Format” (DTABf), a strict subset of the TEI P5 tag set. The purpose of the DTABf is to provide a balance between expressiveness and precision as well as an interoperable annotation scheme for a large variety of text types of historical corpora of printed text from multiple sources. The DTABf has been developed on the basis of a large amount of historical text data in the core corpus of the project Deutsches Textarchiv (DTA) and text collections from 15 cooperating projects with a current total of 210 million tokens. The DTABf is a “living” TEI format which is continuously adjusted when new text candidates for the DTA containing new structural phenomena are encountered. We also focus on other aspects of the DTABf including consistency, interoperability with other TEI dialects, HTML and other presentations of the TEI texts, and conversion into other formats, as well as linguistic analysis. We include some examples of best practices to illustrate how external corpora can be losslessly converted into the DTABf, thus enabling third parties to use the DTABf in their specific projects. The DTABf is comprehensively documented, and several software tools are available for working with it, making it a widely used format for the encoding of historical printed German text. |
format |
article |
author |
Susanne Haaf Alexander Geyken Frank Wiegand |
author_facet |
Susanne Haaf Alexander Geyken Frank Wiegand |
author_sort |
Susanne Haaf |
title |
The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources |
title_short |
The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources |
title_full |
The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources |
title_fullStr |
The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources |
title_full_unstemmed |
The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources |
title_sort |
dta “base format”: a tei subset for the compilation of a large reference corpus of printed text from multiple sources |
publisher |
OpenEdition |
publishDate |
2015 |
url |
https://doaj.org/article/446055617fd9463c89716d80d26a7998 |
work_keys_str_mv |
AT susannehaaf thedtabaseformatateisubsetforthecompilationofalargereferencecorpusofprintedtextfrommultiplesources AT alexandergeyken thedtabaseformatateisubsetforthecompilationofalargereferencecorpusofprintedtextfrommultiplesources AT frankwiegand thedtabaseformatateisubsetforthecompilationofalargereferencecorpusofprintedtextfrommultiplesources AT susannehaaf dtabaseformatateisubsetforthecompilationofalargereferencecorpusofprintedtextfrommultiplesources AT alexandergeyken dtabaseformatateisubsetforthecompilationofalargereferencecorpusofprintedtextfrommultiplesources AT frankwiegand dtabaseformatateisubsetforthecompilationofalargereferencecorpusofprintedtextfrommultiplesources |
_version_ |
1718395874323726336 |