Open Information Extraction from real Internet texts in Spanish using constraints over part-of-speech sequences: Problems of the method, their causes, and ways for improvement

Usually we do not know the domain of an arbitrary text from the Internet, or the semantics of the relations it conveys. While humans identify such information easily, for a computer this task is far from straightforward. The task of detecting relations of arbitrary semantic type in texts is known as...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Zhila,Alisa, Gelbukh,Alexander
Lenguaje:	English
Publicado:	Pontificia Universidad Católica de Valparaíso. Instituto de Literatura y Ciencias del Lenguaje 2016
Materias:	Open Information Extraction relation extraction error analysis Spanish Internet texts
Acceso en línea:	http://www.scielo.cl/scielo.php?script=sci_arttext&pid=S0718-09342016000100006
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:scielo:S0718-09342016000100006
record_format	dspace
spelling	oai:scielo:S0718-093420160001000062016-03-14Open Information Extraction from real Internet texts in Spanish using constraints over part-of-speech sequences: Problems of the method, their causes, and ways for improvementZhila,AlisaGelbukh,Alexander Open Information Extraction relation extraction error analysis Spanish Internet texts Usually we do not know the domain of an arbitrary text from the Internet, or the semantics of the relations it conveys. While humans identify such information easily, for a computer this task is far from straightforward. The task of detecting relations of arbitrary semantic type in texts is known as Open Information Extraction (Open IE). The approach to this task based on heuristic constraints over part-of-speech sequences has been shown to achieve high performance with lower computational and implementation cost. Recently, this approach has gained spread and popularity. However, Open IE is prone to certain errors that have not yet been analyzed in the literature. Detailed analysis of the errors and their causes will allow for faster and more focused improvement of the methods for Open IE based on this approach. In this paper, we analyze and classify the main types of errors in relation extraction that are specific to Open IE based on heuristic constraints over part-of-speech sequences. We identify the causes of the errors of each type and suggest ways for preventing such errors with corresponding analysis of their cost and scale of impact. The analysis is performed for extractions from two Spanish-language text datasets: the FactSpaCIC dataset of grammatically correct and verified sentences and the RawWeb dataset of unedited text fragments from the Internet. Extraction is performed by the ExtrHech system.info:eu-repo/semantics/openAccessPontificia Universidad Católica de Valparaíso. Instituto de Literatura y Ciencias del LenguajeRevista signos v.49 n.90 20162016-03-01text/htmlhttp://www.scielo.cl/scielo.php?script=sci_arttext&pid=S0718-09342016000100006en10.4067/S0718-09342016000100006
institution	Scielo Chile
collection	Scielo Chile
language	English
topic	Open Information Extraction relation extraction error analysis Spanish Internet texts
spellingShingle	Open Information Extraction relation extraction error analysis Spanish Internet texts Zhila,Alisa Gelbukh,Alexander Open Information Extraction from real Internet texts in Spanish using constraints over part-of-speech sequences: Problems of the method, their causes, and ways for improvement
description	Usually we do not know the domain of an arbitrary text from the Internet, or the semantics of the relations it conveys. While humans identify such information easily, for a computer this task is far from straightforward. The task of detecting relations of arbitrary semantic type in texts is known as Open Information Extraction (Open IE). The approach to this task based on heuristic constraints over part-of-speech sequences has been shown to achieve high performance with lower computational and implementation cost. Recently, this approach has gained spread and popularity. However, Open IE is prone to certain errors that have not yet been analyzed in the literature. Detailed analysis of the errors and their causes will allow for faster and more focused improvement of the methods for Open IE based on this approach. In this paper, we analyze and classify the main types of errors in relation extraction that are specific to Open IE based on heuristic constraints over part-of-speech sequences. We identify the causes of the errors of each type and suggest ways for preventing such errors with corresponding analysis of their cost and scale of impact. The analysis is performed for extractions from two Spanish-language text datasets: the FactSpaCIC dataset of grammatically correct and verified sentences and the RawWeb dataset of unedited text fragments from the Internet. Extraction is performed by the ExtrHech system.
author	Zhila,Alisa Gelbukh,Alexander
author_facet	Zhila,Alisa Gelbukh,Alexander
author_sort	Zhila,Alisa
title	Open Information Extraction from real Internet texts in Spanish using constraints over part-of-speech sequences: Problems of the method, their causes, and ways for improvement
title_short	Open Information Extraction from real Internet texts in Spanish using constraints over part-of-speech sequences: Problems of the method, their causes, and ways for improvement
title_full	Open Information Extraction from real Internet texts in Spanish using constraints over part-of-speech sequences: Problems of the method, their causes, and ways for improvement
title_fullStr	Open Information Extraction from real Internet texts in Spanish using constraints over part-of-speech sequences: Problems of the method, their causes, and ways for improvement
title_full_unstemmed	Open Information Extraction from real Internet texts in Spanish using constraints over part-of-speech sequences: Problems of the method, their causes, and ways for improvement
title_sort	open information extraction from real internet texts in spanish using constraints over part-of-speech sequences: problems of the method, their causes, and ways for improvement
publisher	Pontificia Universidad Católica de Valparaíso. Instituto de Literatura y Ciencias del Lenguaje
publishDate	2016
url	http://www.scielo.cl/scielo.php?script=sci_arttext&pid=S0718-09342016000100006
work_keys_str_mv	AT zhilaalisa openinformationextractionfromrealinternettextsinspanishusingconstraintsoverpartofspeechsequencesproblemsofthemethodtheircausesandwaysforimprovement AT gelbukhalexander openinformationextractionfromrealinternettextsinspanishusingconstraintsoverpartofspeechsequencesproblemsofthemethodtheircausesandwaysforimprovement
_version_	1714201850169786368

Open Information Extraction from real Internet texts in Spanish using constraints over part-of-speech sequences: Problems of the method, their causes, and ways for improvement

Ejemplares similares