SECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.

With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Ming Fan, Ka-Chun Wong, Taewoo Ryu, Timothy Ravasi, Xin Gao
Formato:	article
Lenguaje:	EN
Publicado:	Public Library of Science (PLoS) 2012
Materias:	Medicine R Science Q
Acceso en línea:	https://doaj.org/article/c665281c5d40454c9c64b3d7f684391a
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:c665281c5d40454c9c64b3d7f684391a
record_format	dspace
spelling	oai:doaj.org-article:c665281c5d40454c9c64b3d7f684391a2021-11-18T07:14:08ZSECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.1932-620310.1371/journal.pone.0039475https://doaj.org/article/c665281c5d40454c9c64b3d7f684391a2012-01-01T00:00:00Zhttps://www.ncbi.nlm.nih.gov/pmc/articles/pmid/22761802/pdf/?tool=EBIhttps://doaj.org/toc/1932-6203With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are computationally demanding. Here, we propose a novel genome-wide domain prediction method, SECOM. Instead of conducting all-against-all sequence alignment, SECOM first indexes all the proteins in the genome by using a hash seed function. Local similarity can thus be detected and encoded into a graph structure, in which each node represents a protein sequence and each edge weight represents the shared hash seeds between the two nodes. SECOM then formulates the domain prediction problem as an overlapping community-finding problem in this graph. A backward graph percolation algorithm that efficiently identifies the domains is proposed. We tested SECOM on five recently sequenced genomes of aquatic animals. Our tests demonstrated that SECOM was able to identify most of the known domains identified by InterProScan. When compared with the alignment-based method, SECOM showed higher sensitivity in detecting putative novel domains, while it was also three orders of magnitude faster. For example, SECOM was able to predict a novel sponge-specific domain in nucleoside-triphosphatase (NTPases). Furthermore, SECOM discovered two novel domains, likely of bacterial origin, that are taxonomically restricted to sea anemone and hydra. SECOM is an open-source program and available at http://sfb.kaust.edu.sa/Pages/Software.aspx.Ming FanKa-Chun WongTaewoo RyuTimothy RavasiXin GaoPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 7, Iss 6, p e39475 (2012)
institution	DOAJ
collection	DOAJ
language	EN
topic	Medicine R Science Q
spellingShingle	Medicine R Science Q Ming Fan Ka-Chun Wong Taewoo Ryu Timothy Ravasi Xin Gao SECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.
description	With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are computationally demanding. Here, we propose a novel genome-wide domain prediction method, SECOM. Instead of conducting all-against-all sequence alignment, SECOM first indexes all the proteins in the genome by using a hash seed function. Local similarity can thus be detected and encoded into a graph structure, in which each node represents a protein sequence and each edge weight represents the shared hash seeds between the two nodes. SECOM then formulates the domain prediction problem as an overlapping community-finding problem in this graph. A backward graph percolation algorithm that efficiently identifies the domains is proposed. We tested SECOM on five recently sequenced genomes of aquatic animals. Our tests demonstrated that SECOM was able to identify most of the known domains identified by InterProScan. When compared with the alignment-based method, SECOM showed higher sensitivity in detecting putative novel domains, while it was also three orders of magnitude faster. For example, SECOM was able to predict a novel sponge-specific domain in nucleoside-triphosphatase (NTPases). Furthermore, SECOM discovered two novel domains, likely of bacterial origin, that are taxonomically restricted to sea anemone and hydra. SECOM is an open-source program and available at http://sfb.kaust.edu.sa/Pages/Software.aspx.
format	article
author	Ming Fan Ka-Chun Wong Taewoo Ryu Timothy Ravasi Xin Gao
author_facet	Ming Fan Ka-Chun Wong Taewoo Ryu Timothy Ravasi Xin Gao
author_sort	Ming Fan
title	SECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.
title_short	SECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.
title_full	SECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.
title_fullStr	SECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.
title_full_unstemmed	SECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.
title_sort	secom: a novel hash seed and community detection based-approach for genome-scale protein domain identification.
publisher	Public Library of Science (PLoS)
publishDate	2012
url	https://doaj.org/article/c665281c5d40454c9c64b3d7f684391a
work_keys_str_mv	AT mingfan secomanovelhashseedandcommunitydetectionbasedapproachforgenomescaleproteindomainidentification AT kachunwong secomanovelhashseedandcommunitydetectionbasedapproachforgenomescaleproteindomainidentification AT taewooryu secomanovelhashseedandcommunitydetectionbasedapproachforgenomescaleproteindomainidentification AT timothyravasi secomanovelhashseedandcommunitydetectionbasedapproachforgenomescaleproteindomainidentification AT xingao secomanovelhashseedandcommunitydetectionbasedapproachforgenomescaleproteindomainidentification
_version_	1718423755273797632

SECOM: a novel hash seed and community detection based-approach for genome-scale protein domain identification.

Ejemplares similares