PaperBLAST: Text Mining Papers for Information about Homologs

ABSTRACT Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific liter...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Morgan N. Price, Adam P. Arkin
Formato: article
Lenguaje:EN
Publicado: American Society for Microbiology 2017
Materias:
Acceso en línea:https://doaj.org/article/9ba677cd6ce54424bb3541f7d99afce9
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:9ba677cd6ce54424bb3541f7d99afce9
record_format dspace
spelling oai:doaj.org-article:9ba677cd6ce54424bb3541f7d99afce92021-12-02T18:39:33ZPaperBLAST: Text Mining Papers for Information about Homologs10.1128/mSystems.00039-172379-5077https://doaj.org/article/9ba677cd6ce54424bb3541f7d99afce92017-08-01T00:00:00Zhttps://journals.asm.org/doi/10.1128/mSystems.00039-17https://doaj.org/toc/2379-5077ABSTRACT Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/ . IMPORTANCE With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins’ functions.Morgan N. PriceAdam P. ArkinAmerican Society for Microbiologyarticleannotationtext miningMicrobiologyQR1-502ENmSystems, Vol 2, Iss 4 (2017)
institution DOAJ
collection DOAJ
language EN
topic annotation
text mining
Microbiology
QR1-502
spellingShingle annotation
text mining
Microbiology
QR1-502
Morgan N. Price
Adam P. Arkin
PaperBLAST: Text Mining Papers for Information about Homologs
description ABSTRACT Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/ . IMPORTANCE With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins’ functions.
format article
author Morgan N. Price
Adam P. Arkin
author_facet Morgan N. Price
Adam P. Arkin
author_sort Morgan N. Price
title PaperBLAST: Text Mining Papers for Information about Homologs
title_short PaperBLAST: Text Mining Papers for Information about Homologs
title_full PaperBLAST: Text Mining Papers for Information about Homologs
title_fullStr PaperBLAST: Text Mining Papers for Information about Homologs
title_full_unstemmed PaperBLAST: Text Mining Papers for Information about Homologs
title_sort paperblast: text mining papers for information about homologs
publisher American Society for Microbiology
publishDate 2017
url https://doaj.org/article/9ba677cd6ce54424bb3541f7d99afce9
work_keys_str_mv AT morgannprice paperblasttextminingpapersforinformationabouthomologs
AT adamparkin paperblasttextminingpapersforinformationabouthomologs
_version_ 1718377752496701440