Training set selection for the prediction of essential genes.

Various computational models have been developed to transfer annotations of gene essentiality between organisms. However, despite the increasing number of microorganisms with well-characterized sets of essential genes, selection of appropriate training sets for predicting the essential genes of poor...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Jian Cheng, Zhao Xu, Wenwu Wu, Li Zhao, Xiangchen Li, Yanlin Liu, Shiheng Tao
Formato: article
Lenguaje:EN
Publicado: Public Library of Science (PLoS) 2014
Materias:
R
Q
Acceso en línea:https://doaj.org/article/f465b11fa31042e395d8d82c8b5d2aa0
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:f465b11fa31042e395d8d82c8b5d2aa0
record_format dspace
spelling oai:doaj.org-article:f465b11fa31042e395d8d82c8b5d2aa02021-11-18T08:36:20ZTraining set selection for the prediction of essential genes.1932-620310.1371/journal.pone.0086805https://doaj.org/article/f465b11fa31042e395d8d82c8b5d2aa02014-01-01T00:00:00Zhttps://www.ncbi.nlm.nih.gov/pmc/articles/pmid/24466248/?tool=EBIhttps://doaj.org/toc/1932-6203Various computational models have been developed to transfer annotations of gene essentiality between organisms. However, despite the increasing number of microorganisms with well-characterized sets of essential genes, selection of appropriate training sets for predicting the essential genes of poorly-studied or newly sequenced organisms remains challenging. In this study, a machine learning approach was applied reciprocally to predict the essential genes in 21 microorganisms. Results showed that training set selection greatly influenced predictive accuracy. We determined four criteria for training set selection: (1) essential genes in the selected training set should be reliable; (2) the growth conditions in which essential genes are defined should be consistent in training and prediction sets; (3) species used as training set should be closely related to the target organism; and (4) organisms used as training and prediction sets should exhibit similar phenotypes or lifestyles. We then analyzed the performance of an incomplete training set and an integrated training set with multiple organisms. We found that the size of the training set should be at least 10% of the total genes to yield accurate predictions. Additionally, the integrated training sets exhibited remarkable increase in stability and accuracy compared with single sets. Finally, we compared the performance of the integrated training sets with the four criteria and with random selection. The results revealed that a rational selection of training sets based on our criteria yields better performance than random selection. Thus, our results provide empirical guidance on training set selection for the identification of essential genes on a genome-wide scale.Jian ChengZhao XuWenwu WuLi ZhaoXiangchen LiYanlin LiuShiheng TaoPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 9, Iss 1, p e86805 (2014)
institution DOAJ
collection DOAJ
language EN
topic Medicine
R
Science
Q
spellingShingle Medicine
R
Science
Q
Jian Cheng
Zhao Xu
Wenwu Wu
Li Zhao
Xiangchen Li
Yanlin Liu
Shiheng Tao
Training set selection for the prediction of essential genes.
description Various computational models have been developed to transfer annotations of gene essentiality between organisms. However, despite the increasing number of microorganisms with well-characterized sets of essential genes, selection of appropriate training sets for predicting the essential genes of poorly-studied or newly sequenced organisms remains challenging. In this study, a machine learning approach was applied reciprocally to predict the essential genes in 21 microorganisms. Results showed that training set selection greatly influenced predictive accuracy. We determined four criteria for training set selection: (1) essential genes in the selected training set should be reliable; (2) the growth conditions in which essential genes are defined should be consistent in training and prediction sets; (3) species used as training set should be closely related to the target organism; and (4) organisms used as training and prediction sets should exhibit similar phenotypes or lifestyles. We then analyzed the performance of an incomplete training set and an integrated training set with multiple organisms. We found that the size of the training set should be at least 10% of the total genes to yield accurate predictions. Additionally, the integrated training sets exhibited remarkable increase in stability and accuracy compared with single sets. Finally, we compared the performance of the integrated training sets with the four criteria and with random selection. The results revealed that a rational selection of training sets based on our criteria yields better performance than random selection. Thus, our results provide empirical guidance on training set selection for the identification of essential genes on a genome-wide scale.
format article
author Jian Cheng
Zhao Xu
Wenwu Wu
Li Zhao
Xiangchen Li
Yanlin Liu
Shiheng Tao
author_facet Jian Cheng
Zhao Xu
Wenwu Wu
Li Zhao
Xiangchen Li
Yanlin Liu
Shiheng Tao
author_sort Jian Cheng
title Training set selection for the prediction of essential genes.
title_short Training set selection for the prediction of essential genes.
title_full Training set selection for the prediction of essential genes.
title_fullStr Training set selection for the prediction of essential genes.
title_full_unstemmed Training set selection for the prediction of essential genes.
title_sort training set selection for the prediction of essential genes.
publisher Public Library of Science (PLoS)
publishDate 2014
url https://doaj.org/article/f465b11fa31042e395d8d82c8b5d2aa0
work_keys_str_mv AT jiancheng trainingsetselectionforthepredictionofessentialgenes
AT zhaoxu trainingsetselectionforthepredictionofessentialgenes
AT wenwuwu trainingsetselectionforthepredictionofessentialgenes
AT lizhao trainingsetselectionforthepredictionofessentialgenes
AT xiangchenli trainingsetselectionforthepredictionofessentialgenes
AT yanlinliu trainingsetselectionforthepredictionofessentialgenes
AT shihengtao trainingsetselectionforthepredictionofessentialgenes
_version_ 1718421597759471616