A new method for species identification via protein-coding and non-coding DNA barcodes by combining machine learning with bioinformatic methods.

Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spac...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Ai-bing Zhang, Jie Feng, Robert D Ward, Ping Wan, Qiang Gao, Jun Wu, Wei-zhong Zhao
Formato: article
Lenguaje:EN
Publicado: Public Library of Science (PLoS) 2012
Materias:
R
Q
Acceso en línea:https://doaj.org/article/1ac9dc6e59784a5985ddeaa9cf7e1fff
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:1ac9dc6e59784a5985ddeaa9cf7e1fff
record_format dspace
spelling oai:doaj.org-article:1ac9dc6e59784a5985ddeaa9cf7e1fff2021-11-18T07:27:34ZA new method for species identification via protein-coding and non-coding DNA barcodes by combining machine learning with bioinformatic methods.1932-620310.1371/journal.pone.0030986https://doaj.org/article/1ac9dc6e59784a5985ddeaa9cf7e1fff2012-01-01T00:00:00Zhttps://www.ncbi.nlm.nih.gov/pmc/articles/pmid/22363527/?tool=EBIhttps://doaj.org/toc/1932-6203Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75-100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62-98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60-99.37%) for 1094 brown algae queries, both using ITS barcodes.Ai-bing ZhangJie FengRobert D WardPing WanQiang GaoJun WuWei-zhong ZhaoPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 7, Iss 2, p e30986 (2012)
institution DOAJ
collection DOAJ
language EN
topic Medicine
R
Science
Q
spellingShingle Medicine
R
Science
Q
Ai-bing Zhang
Jie Feng
Robert D Ward
Ping Wan
Qiang Gao
Jun Wu
Wei-zhong Zhao
A new method for species identification via protein-coding and non-coding DNA barcodes by combining machine learning with bioinformatic methods.
description Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75-100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62-98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60-99.37%) for 1094 brown algae queries, both using ITS barcodes.
format article
author Ai-bing Zhang
Jie Feng
Robert D Ward
Ping Wan
Qiang Gao
Jun Wu
Wei-zhong Zhao
author_facet Ai-bing Zhang
Jie Feng
Robert D Ward
Ping Wan
Qiang Gao
Jun Wu
Wei-zhong Zhao
author_sort Ai-bing Zhang
title A new method for species identification via protein-coding and non-coding DNA barcodes by combining machine learning with bioinformatic methods.
title_short A new method for species identification via protein-coding and non-coding DNA barcodes by combining machine learning with bioinformatic methods.
title_full A new method for species identification via protein-coding and non-coding DNA barcodes by combining machine learning with bioinformatic methods.
title_fullStr A new method for species identification via protein-coding and non-coding DNA barcodes by combining machine learning with bioinformatic methods.
title_full_unstemmed A new method for species identification via protein-coding and non-coding DNA barcodes by combining machine learning with bioinformatic methods.
title_sort new method for species identification via protein-coding and non-coding dna barcodes by combining machine learning with bioinformatic methods.
publisher Public Library of Science (PLoS)
publishDate 2012
url https://doaj.org/article/1ac9dc6e59784a5985ddeaa9cf7e1fff
work_keys_str_mv AT aibingzhang anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT jiefeng anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT robertdward anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT pingwan anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT qianggao anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT junwu anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT weizhongzhao anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT aibingzhang newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT jiefeng newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT robertdward newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT pingwan newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT qianggao newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT junwu newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT weizhongzhao newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
_version_ 1718423394283683840