Employing phylogenetic tree shape statistics to resolve the underlying host population structure
Abstract Background Host population structure is a key determinant of pathogen and infectious disease transmission patterns. Pathogen phylogenetic trees are useful tools to reveal the population structure underlying an epidemic. Determining whether a population is structured or not is useful in info...
Guardado en:
Autores principales: | , , , , , , , , , , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
BMC
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/f921303b4ec8402c97778b1b59c8ee53 |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:f921303b4ec8402c97778b1b59c8ee53 |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:f921303b4ec8402c97778b1b59c8ee532021-11-14T12:13:09ZEmploying phylogenetic tree shape statistics to resolve the underlying host population structure10.1186/s12859-021-04465-11471-2105https://doaj.org/article/f921303b4ec8402c97778b1b59c8ee532021-11-01T00:00:00Zhttps://doi.org/10.1186/s12859-021-04465-1https://doaj.org/toc/1471-2105Abstract Background Host population structure is a key determinant of pathogen and infectious disease transmission patterns. Pathogen phylogenetic trees are useful tools to reveal the population structure underlying an epidemic. Determining whether a population is structured or not is useful in informing the type of phylogenetic methods to be used in a given study. We employ tree statistics derived from phylogenetic trees and machine learning classification techniques to reveal an underlying population structure. Results In this paper, we simulate phylogenetic trees from both structured and non-structured host populations. We compute eight statistics for the simulated trees, which are: the number of cherries; Sackin, Colless and total cophenetic indices; ladder length; maximum depth; maximum width, and width-to-depth ratio. Based on the estimated tree statistics, we classify the simulated trees as from either a non-structured or a structured population using the decision tree (DT), K-nearest neighbor (KNN) and support vector machine (SVM). We incorporate the basic reproductive number ( $$R_0$$ R 0 ) in our tree simulation procedure. Sensitivity analysis is done to investigate whether the classifiers are robust to different choice of model parameters and to size of trees. Cross-validated results for area under the curve (AUC) for receiver operating characteristic (ROC) curves yield mean values of over 0.9 for most of the classification models. Conclusions Our classification procedure distinguishes well between trees from structured and non-structured populations using the classifiers, the two-sample Kolmogorov-Smirnov, Cucconi and Podgor-Gastwirth tests and the box plots. SVM models were more robust to changes in model parameters and tree size compared to KNN and DT classifiers. Our classification procedure was applied to real -world data and the structured population was revealed with high accuracy of $$92.3\%$$ 92.3 % using SVM-polynomial classifier.Hassan W. KayondoAlfred SsekagiriGrace NabakoozaNicholas BbosaDeogratius SsemwangaPontiano KaleebuSamuel MwaliliJohn M. MangoAndrew J. Leigh BrownRoberto A. SaenzRonald GaliwangoJohn M. KitayimbwaBMCarticleStructurednon-structuredHost populationPhylogenetic treeSimulationTree statisticsComputer applications to medicine. Medical informaticsR858-859.7Biology (General)QH301-705.5ENBMC Bioinformatics, Vol 22, Iss 1, Pp 1-20 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
Structured non-structured Host population Phylogenetic tree Simulation Tree statistics Computer applications to medicine. Medical informatics R858-859.7 Biology (General) QH301-705.5 |
spellingShingle |
Structured non-structured Host population Phylogenetic tree Simulation Tree statistics Computer applications to medicine. Medical informatics R858-859.7 Biology (General) QH301-705.5 Hassan W. Kayondo Alfred Ssekagiri Grace Nabakooza Nicholas Bbosa Deogratius Ssemwanga Pontiano Kaleebu Samuel Mwalili John M. Mango Andrew J. Leigh Brown Roberto A. Saenz Ronald Galiwango John M. Kitayimbwa Employing phylogenetic tree shape statistics to resolve the underlying host population structure |
description |
Abstract Background Host population structure is a key determinant of pathogen and infectious disease transmission patterns. Pathogen phylogenetic trees are useful tools to reveal the population structure underlying an epidemic. Determining whether a population is structured or not is useful in informing the type of phylogenetic methods to be used in a given study. We employ tree statistics derived from phylogenetic trees and machine learning classification techniques to reveal an underlying population structure. Results In this paper, we simulate phylogenetic trees from both structured and non-structured host populations. We compute eight statistics for the simulated trees, which are: the number of cherries; Sackin, Colless and total cophenetic indices; ladder length; maximum depth; maximum width, and width-to-depth ratio. Based on the estimated tree statistics, we classify the simulated trees as from either a non-structured or a structured population using the decision tree (DT), K-nearest neighbor (KNN) and support vector machine (SVM). We incorporate the basic reproductive number ( $$R_0$$ R 0 ) in our tree simulation procedure. Sensitivity analysis is done to investigate whether the classifiers are robust to different choice of model parameters and to size of trees. Cross-validated results for area under the curve (AUC) for receiver operating characteristic (ROC) curves yield mean values of over 0.9 for most of the classification models. Conclusions Our classification procedure distinguishes well between trees from structured and non-structured populations using the classifiers, the two-sample Kolmogorov-Smirnov, Cucconi and Podgor-Gastwirth tests and the box plots. SVM models were more robust to changes in model parameters and tree size compared to KNN and DT classifiers. Our classification procedure was applied to real -world data and the structured population was revealed with high accuracy of $$92.3\%$$ 92.3 % using SVM-polynomial classifier. |
format |
article |
author |
Hassan W. Kayondo Alfred Ssekagiri Grace Nabakooza Nicholas Bbosa Deogratius Ssemwanga Pontiano Kaleebu Samuel Mwalili John M. Mango Andrew J. Leigh Brown Roberto A. Saenz Ronald Galiwango John M. Kitayimbwa |
author_facet |
Hassan W. Kayondo Alfred Ssekagiri Grace Nabakooza Nicholas Bbosa Deogratius Ssemwanga Pontiano Kaleebu Samuel Mwalili John M. Mango Andrew J. Leigh Brown Roberto A. Saenz Ronald Galiwango John M. Kitayimbwa |
author_sort |
Hassan W. Kayondo |
title |
Employing phylogenetic tree shape statistics to resolve the underlying host population structure |
title_short |
Employing phylogenetic tree shape statistics to resolve the underlying host population structure |
title_full |
Employing phylogenetic tree shape statistics to resolve the underlying host population structure |
title_fullStr |
Employing phylogenetic tree shape statistics to resolve the underlying host population structure |
title_full_unstemmed |
Employing phylogenetic tree shape statistics to resolve the underlying host population structure |
title_sort |
employing phylogenetic tree shape statistics to resolve the underlying host population structure |
publisher |
BMC |
publishDate |
2021 |
url |
https://doaj.org/article/f921303b4ec8402c97778b1b59c8ee53 |
work_keys_str_mv |
AT hassanwkayondo employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT alfredssekagiri employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT gracenabakooza employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT nicholasbbosa employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT deogratiusssemwanga employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT pontianokaleebu employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT samuelmwalili employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT johnmmango employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT andrewjleighbrown employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT robertoasaenz employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT ronaldgaliwango employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure AT johnmkitayimbwa employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure |
_version_ |
1718429383738261504 |