Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.

Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can ex...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Halima Alachram, Hryhorii Chereda, Tim Beißbarth, Edgar Wingender, Philip Stegmaier
Formato:	article
Lenguaje:	EN
Publicado:	Public Library of Science (PLoS) 2021
Materias:	Medicine R Science Q
Acceso en línea:	https://doaj.org/article/72ddc5ce3d944699be3e4b096cf57a7a
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:72ddc5ce3d944699be3e4b096cf57a7a
record_format	dspace
spelling	oai:doaj.org-article:72ddc5ce3d944699be3e4b096cf57a7a2021-12-02T20:16:51ZText mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.1932-620310.1371/journal.pone.0258623https://doaj.org/article/72ddc5ce3d944699be3e4b096cf57a7a2021-01-01T00:00:00Zhttps://doi.org/10.1371/journal.pone.0258623https://doaj.org/toc/1932-6203Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on large breast cancer gene expression data and on other cancer datasets. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed sufficiently good for the metastatic event prediction tasks compared to other networks. Such performance was good enough to validate the utility of our generated word embeddings in constructing biological networks. Word representations as produced by text mining algorithms like word2vec, therefore are able to capture biologically meaningful relations between entities. Our generated embeddings are publicly available at https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md.Halima AlachramHryhorii CheredaTim BeißbarthEdgar WingenderPhilip StegmaierPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 16, Iss 10, p e0258623 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Medicine R Science Q
spellingShingle	Medicine R Science Q Halima Alachram Hryhorii Chereda Tim Beißbarth Edgar Wingender Philip Stegmaier Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.
description	Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on large breast cancer gene expression data and on other cancer datasets. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed sufficiently good for the metastatic event prediction tasks compared to other networks. Such performance was good enough to validate the utility of our generated word embeddings in constructing biological networks. Word representations as produced by text mining algorithms like word2vec, therefore are able to capture biologically meaningful relations between entities. Our generated embeddings are publicly available at https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md.
format	article
author	Halima Alachram Hryhorii Chereda Tim Beißbarth Edgar Wingender Philip Stegmaier
author_facet	Halima Alachram Hryhorii Chereda Tim Beißbarth Edgar Wingender Philip Stegmaier
author_sort	Halima Alachram
title	Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.
title_short	Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.
title_full	Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.
title_fullStr	Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.
title_full_unstemmed	Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.
title_sort	text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.
publisher	Public Library of Science (PLoS)
publishDate	2021
url	https://doaj.org/article/72ddc5ce3d944699be3e4b096cf57a7a
work_keys_str_mv	AT halimaalachram textminingbasedwordrepresentationsforbiomedicaldataanalysisandproteinproteininteractionnetworksinmachinelearningtasks AT hryhoriichereda textminingbasedwordrepresentationsforbiomedicaldataanalysisandproteinproteininteractionnetworksinmachinelearningtasks AT timbeißbarth textminingbasedwordrepresentationsforbiomedicaldataanalysisandproteinproteininteractionnetworksinmachinelearningtasks AT edgarwingender textminingbasedwordrepresentationsforbiomedicaldataanalysisandproteinproteininteractionnetworksinmachinelearningtasks AT philipstegmaier textminingbasedwordrepresentationsforbiomedicaldataanalysisandproteinproteininteractionnetworksinmachinelearningtasks
_version_	1718374411310989312

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.

Ejemplares similares