Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network

Abstract The increased diversity and scale of published biological data has to led to a growing appreciation for the applications of machine learning and statistical methodologies to gain new insights. Key to achieving this aim is solving the Relationship Extraction problem which specifies the seman...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Rakesh David, Rhys-Joshua D. Menezes, Jan De Klerk, Ian R. Castleden, Cornelia M. Hooper, Gustavo Carneiro, Matthew Gilliham
Formato:	article
Lenguaje:	EN
Publicado:	Nature Portfolio 2021
Materias:	Medicine R Science Q
Acceso en línea:	https://doaj.org/article/a2f75ba1c52b427b9b5afad5f74509d0
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:a2f75ba1c52b427b9b5afad5f74509d0
record_format	dspace
spelling	oai:doaj.org-article:a2f75ba1c52b427b9b5afad5f74509d02021-12-02T13:48:41ZIdentifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network10.1038/s41598-020-80441-82045-2322https://doaj.org/article/a2f75ba1c52b427b9b5afad5f74509d02021-01-01T00:00:00Zhttps://doi.org/10.1038/s41598-020-80441-8https://doaj.org/toc/2045-2322Abstract The increased diversity and scale of published biological data has to led to a growing appreciation for the applications of machine learning and statistical methodologies to gain new insights. Key to achieving this aim is solving the Relationship Extraction problem which specifies the semantic interaction between two or more biological entities in a published study. Here, we employed two deep neural network natural language processing (NLP) methods, namely: the continuous bag of words (CBOW), and the bi-directional long short-term memory (bi-LSTM). These methods were employed to predict relations between entities that describe protein subcellular localisation in plants. We applied our system to 1700 published Arabidopsis protein subcellular studies from the SUBA manually curated dataset. The system combines pre-processing of full-text articles in a machine-readable format with relevant sentence extraction for downstream NLP analysis. Using the SUBA corpus, the neural network classifier predicted interactions between protein name, subcellular localisation and experimental methodology with an average precision, recall rate, accuracy and F1 scores of 95.1%, 82.8%, 89.3% and 88.4% respectively (n = 30). Comparable scoring metrics were obtained using the CropPAL database as an independent testing dataset that stores protein subcellular localisation in crop species, demonstrating wide applicability of prediction model. We provide a framework for extracting protein functional features from unstructured text in the literature with high accuracy, improving data dissemination and unlocking the potential of big data text analytics for generating new hypotheses.Rakesh DavidRhys-Joshua D. MenezesJan De KlerkIan R. CastledenCornelia M. HooperGustavo CarneiroMatthew GillihamNature PortfolioarticleMedicineRScienceQENScientific Reports, Vol 11, Iss 1, Pp 1-11 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Medicine R Science Q
spellingShingle	Medicine R Science Q Rakesh David Rhys-Joshua D. Menezes Jan De Klerk Ian R. Castleden Cornelia M. Hooper Gustavo Carneiro Matthew Gilliham Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network
description	Abstract The increased diversity and scale of published biological data has to led to a growing appreciation for the applications of machine learning and statistical methodologies to gain new insights. Key to achieving this aim is solving the Relationship Extraction problem which specifies the semantic interaction between two or more biological entities in a published study. Here, we employed two deep neural network natural language processing (NLP) methods, namely: the continuous bag of words (CBOW), and the bi-directional long short-term memory (bi-LSTM). These methods were employed to predict relations between entities that describe protein subcellular localisation in plants. We applied our system to 1700 published Arabidopsis protein subcellular studies from the SUBA manually curated dataset. The system combines pre-processing of full-text articles in a machine-readable format with relevant sentence extraction for downstream NLP analysis. Using the SUBA corpus, the neural network classifier predicted interactions between protein name, subcellular localisation and experimental methodology with an average precision, recall rate, accuracy and F1 scores of 95.1%, 82.8%, 89.3% and 88.4% respectively (n = 30). Comparable scoring metrics were obtained using the CropPAL database as an independent testing dataset that stores protein subcellular localisation in crop species, demonstrating wide applicability of prediction model. We provide a framework for extracting protein functional features from unstructured text in the literature with high accuracy, improving data dissemination and unlocking the potential of big data text analytics for generating new hypotheses.
format	article
author	Rakesh David Rhys-Joshua D. Menezes Jan De Klerk Ian R. Castleden Cornelia M. Hooper Gustavo Carneiro Matthew Gilliham
author_facet	Rakesh David Rhys-Joshua D. Menezes Jan De Klerk Ian R. Castleden Cornelia M. Hooper Gustavo Carneiro Matthew Gilliham
author_sort	Rakesh David
title	Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network
title_short	Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network
title_full	Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network
title_fullStr	Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network
title_full_unstemmed	Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network
title_sort	identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network
publisher	Nature Portfolio
publishDate	2021
url	https://doaj.org/article/a2f75ba1c52b427b9b5afad5f74509d0
work_keys_str_mv	AT rakeshdavid identifyingproteinsubcellularlocalisationinscientificliteratureusingbidirectionaldeeprecurrentneuralnetwork AT rhysjoshuadmenezes identifyingproteinsubcellularlocalisationinscientificliteratureusingbidirectionaldeeprecurrentneuralnetwork AT jandeklerk identifyingproteinsubcellularlocalisationinscientificliteratureusingbidirectionaldeeprecurrentneuralnetwork AT ianrcastleden identifyingproteinsubcellularlocalisationinscientificliteratureusingbidirectionaldeeprecurrentneuralnetwork AT corneliamhooper identifyingproteinsubcellularlocalisationinscientificliteratureusingbidirectionaldeeprecurrentneuralnetwork AT gustavocarneiro identifyingproteinsubcellularlocalisationinscientificliteratureusingbidirectionaldeeprecurrentneuralnetwork AT matthewgilliham identifyingproteinsubcellularlocalisationinscientificliteratureusingbidirectionaldeeprecurrentneuralnetwork
_version_	1718392456161001472

Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network

Ejemplares similares