Two-stage clustering (TSC): a pipeline for selecting operational taxonomic units for the high-throughput sequencing of PCR amplicons.

Clustering 16S/18S rRNA amplicon sequences into operational taxonomic units (OTUs) is a critical step for the bioinformatic analysis of microbial diversity. Here, we report a pipeline for selecting OTUs with a relatively low computational demand and a high degree of accuracy. This pipeline is referr...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Xiao-Tao Jiang, Hai Zhang, Hua-Fang Sheng, Yu Wang, Yan He, Fei Zou, Hong-Wei Zhou
Formato: article
Lenguaje:EN
Publicado: Public Library of Science (PLoS) 2012
Materias:
R
Q
Acceso en línea:https://doaj.org/article/310758fd66834024b6fa766db45ac085
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:310758fd66834024b6fa766db45ac085
record_format dspace
spelling oai:doaj.org-article:310758fd66834024b6fa766db45ac0852021-11-18T07:30:24ZTwo-stage clustering (TSC): a pipeline for selecting operational taxonomic units for the high-throughput sequencing of PCR amplicons.1932-620310.1371/journal.pone.0030230https://doaj.org/article/310758fd66834024b6fa766db45ac0852012-01-01T00:00:00Zhttps://www.ncbi.nlm.nih.gov/pmc/articles/pmid/22253923/?tool=EBIhttps://doaj.org/toc/1932-6203Clustering 16S/18S rRNA amplicon sequences into operational taxonomic units (OTUs) is a critical step for the bioinformatic analysis of microbial diversity. Here, we report a pipeline for selecting OTUs with a relatively low computational demand and a high degree of accuracy. This pipeline is referred to as two-stage clustering (TSC) because it divides tags into two groups according to their abundance and clusters them sequentially. The more abundant group is clustered using a hierarchical algorithm similar to that in ESPRIT, which has a high degree of accuracy but is computationally costly for large datasets. The rarer group, which includes the majority of tags, is then heuristically clustered to improve efficiency. To further improve the computational efficiency and accuracy, two preclustering steps are implemented. To maintain clustering accuracy, all tags are grouped into an OTU depending on their pairwise Needleman-Wunsch distance. This method not only improved the computational efficiency but also mitigated the spurious OTU estimation from 'noise' sequences. In addition, OTUs clustered using TSC showed comparable or improved performance in beta-diversity comparisons compared to existing OTU selection methods. This study suggests that the distribution of sequencing datasets is a useful property for improving the computational efficiency and increasing the clustering accuracy of the high-throughput sequencing of PCR amplicons. The software and user guide are freely available at http://hwzhoulab.smu.edu.cn/paperdata/.Xiao-Tao JiangHai ZhangHua-Fang ShengYu WangYan HeFei ZouHong-Wei ZhouPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 7, Iss 1, p e30230 (2012)
institution DOAJ
collection DOAJ
language EN
topic Medicine
R
Science
Q
spellingShingle Medicine
R
Science
Q
Xiao-Tao Jiang
Hai Zhang
Hua-Fang Sheng
Yu Wang
Yan He
Fei Zou
Hong-Wei Zhou
Two-stage clustering (TSC): a pipeline for selecting operational taxonomic units for the high-throughput sequencing of PCR amplicons.
description Clustering 16S/18S rRNA amplicon sequences into operational taxonomic units (OTUs) is a critical step for the bioinformatic analysis of microbial diversity. Here, we report a pipeline for selecting OTUs with a relatively low computational demand and a high degree of accuracy. This pipeline is referred to as two-stage clustering (TSC) because it divides tags into two groups according to their abundance and clusters them sequentially. The more abundant group is clustered using a hierarchical algorithm similar to that in ESPRIT, which has a high degree of accuracy but is computationally costly for large datasets. The rarer group, which includes the majority of tags, is then heuristically clustered to improve efficiency. To further improve the computational efficiency and accuracy, two preclustering steps are implemented. To maintain clustering accuracy, all tags are grouped into an OTU depending on their pairwise Needleman-Wunsch distance. This method not only improved the computational efficiency but also mitigated the spurious OTU estimation from 'noise' sequences. In addition, OTUs clustered using TSC showed comparable or improved performance in beta-diversity comparisons compared to existing OTU selection methods. This study suggests that the distribution of sequencing datasets is a useful property for improving the computational efficiency and increasing the clustering accuracy of the high-throughput sequencing of PCR amplicons. The software and user guide are freely available at http://hwzhoulab.smu.edu.cn/paperdata/.
format article
author Xiao-Tao Jiang
Hai Zhang
Hua-Fang Sheng
Yu Wang
Yan He
Fei Zou
Hong-Wei Zhou
author_facet Xiao-Tao Jiang
Hai Zhang
Hua-Fang Sheng
Yu Wang
Yan He
Fei Zou
Hong-Wei Zhou
author_sort Xiao-Tao Jiang
title Two-stage clustering (TSC): a pipeline for selecting operational taxonomic units for the high-throughput sequencing of PCR amplicons.
title_short Two-stage clustering (TSC): a pipeline for selecting operational taxonomic units for the high-throughput sequencing of PCR amplicons.
title_full Two-stage clustering (TSC): a pipeline for selecting operational taxonomic units for the high-throughput sequencing of PCR amplicons.
title_fullStr Two-stage clustering (TSC): a pipeline for selecting operational taxonomic units for the high-throughput sequencing of PCR amplicons.
title_full_unstemmed Two-stage clustering (TSC): a pipeline for selecting operational taxonomic units for the high-throughput sequencing of PCR amplicons.
title_sort two-stage clustering (tsc): a pipeline for selecting operational taxonomic units for the high-throughput sequencing of pcr amplicons.
publisher Public Library of Science (PLoS)
publishDate 2012
url https://doaj.org/article/310758fd66834024b6fa766db45ac085
work_keys_str_mv AT xiaotaojiang twostageclusteringtscapipelineforselectingoperationaltaxonomicunitsforthehighthroughputsequencingofpcramplicons
AT haizhang twostageclusteringtscapipelineforselectingoperationaltaxonomicunitsforthehighthroughputsequencingofpcramplicons
AT huafangsheng twostageclusteringtscapipelineforselectingoperationaltaxonomicunitsforthehighthroughputsequencingofpcramplicons
AT yuwang twostageclusteringtscapipelineforselectingoperationaltaxonomicunitsforthehighthroughputsequencingofpcramplicons
AT yanhe twostageclusteringtscapipelineforselectingoperationaltaxonomicunitsforthehighthroughputsequencingofpcramplicons
AT feizou twostageclusteringtscapipelineforselectingoperationaltaxonomicunitsforthehighthroughputsequencingofpcramplicons
AT hongweizhou twostageclusteringtscapipelineforselectingoperationaltaxonomicunitsforthehighthroughputsequencingofpcramplicons
_version_ 1718423338915725312