The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction

Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from ma...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Irene van den Bent, Stavros Makrodimitris, Marcel Reinders
Formato: article
Lenguaje:EN
Publicado: SAGE Publishing 2021
Materias:
Acceso en línea:https://doaj.org/article/6517605b2ea149948d0f7ac12058a55c
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:6517605b2ea149948d0f7ac12058a55c
record_format dspace
spelling oai:doaj.org-article:6517605b2ea149948d0f7ac12058a55c2021-12-03T23:03:40ZThe Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction1176-934310.1177/11769343211062608https://doaj.org/article/6517605b2ea149948d0f7ac12058a55c2021-12-01T00:00:00Zhttps://doi.org/10.1177/11769343211062608https://doaj.org/toc/1176-9343Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.Irene van den BentStavros MakrodimitrisMarcel ReindersSAGE PublishingarticleEvolutionQH359-425ENEvolutionary Bioinformatics, Vol 17 (2021)
institution DOAJ
collection DOAJ
language EN
topic Evolution
QH359-425
spellingShingle Evolution
QH359-425
Irene van den Bent
Stavros Makrodimitris
Marcel Reinders
The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction
description Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.
format article
author Irene van den Bent
Stavros Makrodimitris
Marcel Reinders
author_facet Irene van den Bent
Stavros Makrodimitris
Marcel Reinders
author_sort Irene van den Bent
title The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction
title_short The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction
title_full The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction
title_fullStr The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction
title_full_unstemmed The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction
title_sort power of universal contextualized protein embeddings in cross-species protein function prediction
publisher SAGE Publishing
publishDate 2021
url https://doaj.org/article/6517605b2ea149948d0f7ac12058a55c
work_keys_str_mv AT irenevandenbent thepowerofuniversalcontextualizedproteinembeddingsincrossspeciesproteinfunctionprediction
AT stavrosmakrodimitris thepowerofuniversalcontextualizedproteinembeddingsincrossspeciesproteinfunctionprediction
AT marcelreinders thepowerofuniversalcontextualizedproteinembeddingsincrossspeciesproteinfunctionprediction
AT irenevandenbent powerofuniversalcontextualizedproteinembeddingsincrossspeciesproteinfunctionprediction
AT stavrosmakrodimitris powerofuniversalcontextualizedproteinembeddingsincrossspeciesproteinfunctionprediction
AT marcelreinders powerofuniversalcontextualizedproteinembeddingsincrossspeciesproteinfunctionprediction
_version_ 1718373081955696640