Automatic consistency assurance for literature-based gene ontology annotation

Abstract Background Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between G...

Description complète

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Jiyu Chen, Nicholas Geard, Justin Zobel, Karin Verspoor
Format:	article
Langue:	EN
Publié:	BMC 2021
Sujets:	Biological database quality Gene ontology annotation Text mining Computer applications to medicine. Medical informatics R858-859.7 Biology (General) QH301-705.5
Accès en ligne:	https://doaj.org/article/f3c71bf6afee4d0a844694bbd99c4df8
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

id	oai:doaj.org-article:f3c71bf6afee4d0a844694bbd99c4df8
record_format	dspace
spelling	oai:doaj.org-article:f3c71bf6afee4d0a844694bbd99c4df82021-11-28T12:11:01ZAutomatic consistency assurance for literature-based gene ontology annotation10.1186/s12859-021-04479-91471-2105https://doaj.org/article/f3c71bf6afee4d0a844694bbd99c4df82021-11-01T00:00:00Zhttps://doi.org/10.1186/s12859-021-04479-9https://doaj.org/toc/1471-2105Abstract Background Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated. Results In this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions. Conclusions Two models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. Our approach demonstrates clear value for human-in-the-loop curation scenarios.Jiyu ChenNicholas GeardJustin ZobelKarin VerspoorBMCarticleBiological database qualityGene ontology annotationText miningComputer applications to medicine. Medical informaticsR858-859.7Biology (General)QH301-705.5ENBMC Bioinformatics, Vol 22, Iss 1, Pp 1-22 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Biological database quality Gene ontology annotation Text mining Computer applications to medicine. Medical informatics R858-859.7 Biology (General) QH301-705.5
spellingShingle	Biological database quality Gene ontology annotation Text mining Computer applications to medicine. Medical informatics R858-859.7 Biology (General) QH301-705.5 Jiyu Chen Nicholas Geard Justin Zobel Karin Verspoor Automatic consistency assurance for literature-based gene ontology annotation
description	Abstract Background Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated. Results In this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions. Conclusions Two models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. Our approach demonstrates clear value for human-in-the-loop curation scenarios.
format	article
author	Jiyu Chen Nicholas Geard Justin Zobel Karin Verspoor
author_facet	Jiyu Chen Nicholas Geard Justin Zobel Karin Verspoor
author_sort	Jiyu Chen
title	Automatic consistency assurance for literature-based gene ontology annotation
title_short	Automatic consistency assurance for literature-based gene ontology annotation
title_full	Automatic consistency assurance for literature-based gene ontology annotation
title_fullStr	Automatic consistency assurance for literature-based gene ontology annotation
title_full_unstemmed	Automatic consistency assurance for literature-based gene ontology annotation
title_sort	automatic consistency assurance for literature-based gene ontology annotation
publisher	BMC
publishDate	2021
url	https://doaj.org/article/f3c71bf6afee4d0a844694bbd99c4df8
work_keys_str_mv	AT jiyuchen automaticconsistencyassuranceforliteraturebasedgeneontologyannotation AT nicholasgeard automaticconsistencyassuranceforliteraturebasedgeneontologyannotation AT justinzobel automaticconsistencyassuranceforliteraturebasedgeneontologyannotation AT karinverspoor automaticconsistencyassuranceforliteraturebasedgeneontologyannotation
_version_	1718408175166685184

Automatic consistency assurance for literature-based gene ontology annotation

Documents similaires