Employing phylogenetic tree shape statistics to resolve the underlying host population structure

Abstract Background Host population structure is a key determinant of pathogen and infectious disease transmission patterns. Pathogen phylogenetic trees are useful tools to reveal the population structure underlying an epidemic. Determining whether a population is structured or not is useful in info...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Hassan W. Kayondo, Alfred Ssekagiri, Grace Nabakooza, Nicholas Bbosa, Deogratius Ssemwanga, Pontiano Kaleebu, Samuel Mwalili, John M. Mango, Andrew J. Leigh Brown, Roberto A. Saenz, Ronald Galiwango, John M. Kitayimbwa
Formato: article
Lenguaje:EN
Publicado: BMC 2021
Materias:
Acceso en línea:https://doaj.org/article/f921303b4ec8402c97778b1b59c8ee53
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:f921303b4ec8402c97778b1b59c8ee53
record_format dspace
spelling oai:doaj.org-article:f921303b4ec8402c97778b1b59c8ee532021-11-14T12:13:09ZEmploying phylogenetic tree shape statistics to resolve the underlying host population structure10.1186/s12859-021-04465-11471-2105https://doaj.org/article/f921303b4ec8402c97778b1b59c8ee532021-11-01T00:00:00Zhttps://doi.org/10.1186/s12859-021-04465-1https://doaj.org/toc/1471-2105Abstract Background Host population structure is a key determinant of pathogen and infectious disease transmission patterns. Pathogen phylogenetic trees are useful tools to reveal the population structure underlying an epidemic. Determining whether a population is structured or not is useful in informing the type of phylogenetic methods to be used in a given study. We employ tree statistics derived from phylogenetic trees and machine learning classification techniques to reveal an underlying population structure. Results In this paper, we simulate phylogenetic trees from both structured and non-structured host populations. We compute eight statistics for the simulated trees, which are: the number of cherries; Sackin, Colless and total cophenetic indices; ladder length; maximum depth; maximum width, and width-to-depth ratio. Based on the estimated tree statistics, we classify the simulated trees as from either a non-structured or a structured population using the decision tree (DT), K-nearest neighbor (KNN) and support vector machine (SVM). We incorporate the basic reproductive number ( $$R_0$$ R 0 ) in our tree simulation procedure. Sensitivity analysis is done to investigate whether the classifiers are robust to different choice of model parameters and to size of trees. Cross-validated results for area under the curve (AUC) for receiver operating characteristic (ROC) curves yield mean values of over 0.9 for most of the classification models. Conclusions Our classification procedure distinguishes well between trees from structured and non-structured populations using the classifiers, the two-sample Kolmogorov-Smirnov, Cucconi and Podgor-Gastwirth tests and the box plots. SVM models were more robust to changes in model parameters and tree size compared to KNN and DT classifiers. Our classification procedure was applied to real -world data and the structured population was revealed with high accuracy of $$92.3\%$$ 92.3 % using SVM-polynomial classifier.Hassan W. KayondoAlfred SsekagiriGrace NabakoozaNicholas BbosaDeogratius SsemwangaPontiano KaleebuSamuel MwaliliJohn M. MangoAndrew J. Leigh BrownRoberto A. SaenzRonald GaliwangoJohn M. KitayimbwaBMCarticleStructurednon-structuredHost populationPhylogenetic treeSimulationTree statisticsComputer applications to medicine. Medical informaticsR858-859.7Biology (General)QH301-705.5ENBMC Bioinformatics, Vol 22, Iss 1, Pp 1-20 (2021)
institution DOAJ
collection DOAJ
language EN
topic Structured
non-structured
Host population
Phylogenetic tree
Simulation
Tree statistics
Computer applications to medicine. Medical informatics
R858-859.7
Biology (General)
QH301-705.5
spellingShingle Structured
non-structured
Host population
Phylogenetic tree
Simulation
Tree statistics
Computer applications to medicine. Medical informatics
R858-859.7
Biology (General)
QH301-705.5
Hassan W. Kayondo
Alfred Ssekagiri
Grace Nabakooza
Nicholas Bbosa
Deogratius Ssemwanga
Pontiano Kaleebu
Samuel Mwalili
John M. Mango
Andrew J. Leigh Brown
Roberto A. Saenz
Ronald Galiwango
John M. Kitayimbwa
Employing phylogenetic tree shape statistics to resolve the underlying host population structure
description Abstract Background Host population structure is a key determinant of pathogen and infectious disease transmission patterns. Pathogen phylogenetic trees are useful tools to reveal the population structure underlying an epidemic. Determining whether a population is structured or not is useful in informing the type of phylogenetic methods to be used in a given study. We employ tree statistics derived from phylogenetic trees and machine learning classification techniques to reveal an underlying population structure. Results In this paper, we simulate phylogenetic trees from both structured and non-structured host populations. We compute eight statistics for the simulated trees, which are: the number of cherries; Sackin, Colless and total cophenetic indices; ladder length; maximum depth; maximum width, and width-to-depth ratio. Based on the estimated tree statistics, we classify the simulated trees as from either a non-structured or a structured population using the decision tree (DT), K-nearest neighbor (KNN) and support vector machine (SVM). We incorporate the basic reproductive number ( $$R_0$$ R 0 ) in our tree simulation procedure. Sensitivity analysis is done to investigate whether the classifiers are robust to different choice of model parameters and to size of trees. Cross-validated results for area under the curve (AUC) for receiver operating characteristic (ROC) curves yield mean values of over 0.9 for most of the classification models. Conclusions Our classification procedure distinguishes well between trees from structured and non-structured populations using the classifiers, the two-sample Kolmogorov-Smirnov, Cucconi and Podgor-Gastwirth tests and the box plots. SVM models were more robust to changes in model parameters and tree size compared to KNN and DT classifiers. Our classification procedure was applied to real -world data and the structured population was revealed with high accuracy of $$92.3\%$$ 92.3 % using SVM-polynomial classifier.
format article
author Hassan W. Kayondo
Alfred Ssekagiri
Grace Nabakooza
Nicholas Bbosa
Deogratius Ssemwanga
Pontiano Kaleebu
Samuel Mwalili
John M. Mango
Andrew J. Leigh Brown
Roberto A. Saenz
Ronald Galiwango
John M. Kitayimbwa
author_facet Hassan W. Kayondo
Alfred Ssekagiri
Grace Nabakooza
Nicholas Bbosa
Deogratius Ssemwanga
Pontiano Kaleebu
Samuel Mwalili
John M. Mango
Andrew J. Leigh Brown
Roberto A. Saenz
Ronald Galiwango
John M. Kitayimbwa
author_sort Hassan W. Kayondo
title Employing phylogenetic tree shape statistics to resolve the underlying host population structure
title_short Employing phylogenetic tree shape statistics to resolve the underlying host population structure
title_full Employing phylogenetic tree shape statistics to resolve the underlying host population structure
title_fullStr Employing phylogenetic tree shape statistics to resolve the underlying host population structure
title_full_unstemmed Employing phylogenetic tree shape statistics to resolve the underlying host population structure
title_sort employing phylogenetic tree shape statistics to resolve the underlying host population structure
publisher BMC
publishDate 2021
url https://doaj.org/article/f921303b4ec8402c97778b1b59c8ee53
work_keys_str_mv AT hassanwkayondo employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure
AT alfredssekagiri employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure
AT gracenabakooza employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure
AT nicholasbbosa employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure
AT deogratiusssemwanga employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure
AT pontianokaleebu employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure
AT samuelmwalili employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure
AT johnmmango employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure
AT andrewjleighbrown employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure
AT robertoasaenz employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure
AT ronaldgaliwango employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure
AT johnmkitayimbwa employingphylogenetictreeshapestatisticstoresolvetheunderlyinghostpopulationstructure
_version_ 1718429383738261504