Large scale identification and categorization of protein sequences using structured logistic regression.

<h4>Background</h4>Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Bjørn P Pedersen, Georgiana Ifrim, Poul Liboriussen, Kristian B Axelsen, Michael G Palmgren, Poul Nissen, Carsten Wiuf, Christian N S Pedersen
Formato: article
Lenguaje:EN
Publicado: Public Library of Science (PLoS) 2014
Materias:
R
Q
Acceso en línea:https://doaj.org/article/1ca32674a916416281ef7f72a391c25f
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:1ca32674a916416281ef7f72a391c25f
record_format dspace
spelling oai:doaj.org-article:1ca32674a916416281ef7f72a391c25f2021-11-18T08:37:19ZLarge scale identification and categorization of protein sequences using structured logistic regression.1932-620310.1371/journal.pone.0085139https://doaj.org/article/1ca32674a916416281ef7f72a391c25f2014-01-01T00:00:00Zhttps://www.ncbi.nlm.nih.gov/pmc/articles/pmid/24465495/?tool=EBIhttps://doaj.org/toc/1932-6203<h4>Background</h4>Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem.<h4>Results</h4>Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 pre-defined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases.<h4>Conclusions</h4>Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.Bjørn P PedersenGeorgiana IfrimPoul LiboriussenKristian B AxelsenMichael G PalmgrenPoul NissenCarsten WiufChristian N S PedersenPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 9, Iss 1, p e85139 (2014)
institution DOAJ
collection DOAJ
language EN
topic Medicine
R
Science
Q
spellingShingle Medicine
R
Science
Q
Bjørn P Pedersen
Georgiana Ifrim
Poul Liboriussen
Kristian B Axelsen
Michael G Palmgren
Poul Nissen
Carsten Wiuf
Christian N S Pedersen
Large scale identification and categorization of protein sequences using structured logistic regression.
description <h4>Background</h4>Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem.<h4>Results</h4>Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 pre-defined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases.<h4>Conclusions</h4>Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.
format article
author Bjørn P Pedersen
Georgiana Ifrim
Poul Liboriussen
Kristian B Axelsen
Michael G Palmgren
Poul Nissen
Carsten Wiuf
Christian N S Pedersen
author_facet Bjørn P Pedersen
Georgiana Ifrim
Poul Liboriussen
Kristian B Axelsen
Michael G Palmgren
Poul Nissen
Carsten Wiuf
Christian N S Pedersen
author_sort Bjørn P Pedersen
title Large scale identification and categorization of protein sequences using structured logistic regression.
title_short Large scale identification and categorization of protein sequences using structured logistic regression.
title_full Large scale identification and categorization of protein sequences using structured logistic regression.
title_fullStr Large scale identification and categorization of protein sequences using structured logistic regression.
title_full_unstemmed Large scale identification and categorization of protein sequences using structured logistic regression.
title_sort large scale identification and categorization of protein sequences using structured logistic regression.
publisher Public Library of Science (PLoS)
publishDate 2014
url https://doaj.org/article/1ca32674a916416281ef7f72a391c25f
work_keys_str_mv AT bjørnppedersen largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
AT georgianaifrim largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
AT poulliboriussen largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
AT kristianbaxelsen largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
AT michaelgpalmgren largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
AT poulnissen largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
AT carstenwiuf largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
AT christiannspedersen largescaleidentificationandcategorizationofproteinsequencesusingstructuredlogisticregression
_version_ 1718421546842718208