Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.

Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Tunca Doğan, Bilge Karaçalı
Formato:	article
Lenguaje:	EN
Publicado:	Public Library of Science (PLoS) 2013
Materias:	Medicine R Science Q
Acceso en línea:	https://doaj.org/article/284f83eec0db416899e3080a1922d44f
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:284f83eec0db416899e3080a1922d44f
record_format	dspace
spelling	oai:doaj.org-article:284f83eec0db416899e3080a1922d44f2021-11-18T08:55:27ZAutomatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.1932-620310.1371/journal.pone.0075458https://doaj.org/article/284f83eec0db416899e3080a1922d44f2013-01-01T00:00:00Zhttps://www.ncbi.nlm.nih.gov/pmc/articles/pmid/24069417/?tool=EBIhttps://doaj.org/toc/1932-6203Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences.Tunca DoğanBilge KaraçalıPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 8, Iss 9, p e75458 (2013)
institution	DOAJ
collection	DOAJ
language	EN
topic	Medicine R Science Q
spellingShingle	Medicine R Science Q Tunca Doğan Bilge Karaçalı Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.
description	Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences.
format	article
author	Tunca Doğan Bilge Karaçalı
author_facet	Tunca Doğan Bilge Karaçalı
author_sort	Tunca Doğan
title	Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.
title_short	Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.
title_full	Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.
title_fullStr	Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.
title_full_unstemmed	Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.
title_sort	automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.
publisher	Public Library of Science (PLoS)
publishDate	2013
url	https://doaj.org/article/284f83eec0db416899e3080a1922d44f
work_keys_str_mv	AT tuncadogan automaticidentificationofhighlyconservedfamilyregionsandrelationshipsingenomewidedatasetsincludingremoteproteinsequences AT bilgekaracalı automaticidentificationofhighlyconservedfamilyregionsandrelationshipsingenomewidedatasetsincludingremoteproteinsequences
_version_	1718421130791878656

Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.

Ejemplares similares