Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions

ABSTRACT Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacter...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: John A. Lees, T. Tien Mai, Marco Galardini, Nicole E. Wheeler, Samuel T. Horsfield, Julian Parkhill, Jukka Corander
Formato: article
Lenguaje:EN
Publicado: American Society for Microbiology 2020
Materias:
Acceso en línea:https://doaj.org/article/39a9c15c29234af282ab753ee02f2acb
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:39a9c15c29234af282ab753ee02f2acb
record_format dspace
spelling oai:doaj.org-article:39a9c15c29234af282ab753ee02f2acb2021-11-15T15:56:44ZImproved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions10.1128/mBio.01344-202150-7511https://doaj.org/article/39a9c15c29234af282ab753ee02f2acb2020-08-01T00:00:00Zhttps://journals.asm.org/doi/10.1128/mBio.01344-20https://doaj.org/toc/2150-7511ABSTRACT Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacterial genomes and the clonal structure of bacterial populations creates challenges. We introduce an alignment-free method which finds sets of loci associated with bacterial phenotypes, quantifies the total effect of genetics on the phenotype, and allows accurate phenotype prediction, all within a single computationally scalable joint modeling framework. Genetic variants covering the entire pangenome are compactly represented by extended DNA sequence words known as unitigs, and model fitting is achieved using elastic net penalization, an extension of standard multiple regression. Using an extensive set of state-of-the-art bacterial population genomic data sets, we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. Compared to those of previous approaches, which test each genotype-phenotype association separately for each variant and apply a significance threshold, the variants selected by our joint modeling approach overlap substantially. IMPORTANCE Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.John A. LeesT. Tien MaiMarco GalardiniNicole E. WheelerSamuel T. HorsfieldJulian ParkhillJukka CoranderAmerican Society for Microbiologyarticleelastic netpangenomephenotype predictionMicrobiologyQR1-502ENmBio, Vol 11, Iss 4 (2020)
institution DOAJ
collection DOAJ
language EN
topic elastic net
pangenome
phenotype prediction
Microbiology
QR1-502
spellingShingle elastic net
pangenome
phenotype prediction
Microbiology
QR1-502
John A. Lees
T. Tien Mai
Marco Galardini
Nicole E. Wheeler
Samuel T. Horsfield
Julian Parkhill
Jukka Corander
Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions
description ABSTRACT Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacterial genomes and the clonal structure of bacterial populations creates challenges. We introduce an alignment-free method which finds sets of loci associated with bacterial phenotypes, quantifies the total effect of genetics on the phenotype, and allows accurate phenotype prediction, all within a single computationally scalable joint modeling framework. Genetic variants covering the entire pangenome are compactly represented by extended DNA sequence words known as unitigs, and model fitting is achieved using elastic net penalization, an extension of standard multiple regression. Using an extensive set of state-of-the-art bacterial population genomic data sets, we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. Compared to those of previous approaches, which test each genotype-phenotype association separately for each variant and apply a significance threshold, the variants selected by our joint modeling approach overlap substantially. IMPORTANCE Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.
format article
author John A. Lees
T. Tien Mai
Marco Galardini
Nicole E. Wheeler
Samuel T. Horsfield
Julian Parkhill
Jukka Corander
author_facet John A. Lees
T. Tien Mai
Marco Galardini
Nicole E. Wheeler
Samuel T. Horsfield
Julian Parkhill
Jukka Corander
author_sort John A. Lees
title Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions
title_short Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions
title_full Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions
title_fullStr Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions
title_full_unstemmed Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions
title_sort improved prediction of bacterial genotype-phenotype associations using interpretable pangenome-spanning regressions
publisher American Society for Microbiology
publishDate 2020
url https://doaj.org/article/39a9c15c29234af282ab753ee02f2acb
work_keys_str_mv AT johnalees improvedpredictionofbacterialgenotypephenotypeassociationsusinginterpretablepangenomespanningregressions
AT ttienmai improvedpredictionofbacterialgenotypephenotypeassociationsusinginterpretablepangenomespanningregressions
AT marcogalardini improvedpredictionofbacterialgenotypephenotypeassociationsusinginterpretablepangenomespanningregressions
AT nicoleewheeler improvedpredictionofbacterialgenotypephenotypeassociationsusinginterpretablepangenomespanningregressions
AT samuelthorsfield improvedpredictionofbacterialgenotypephenotypeassociationsusinginterpretablepangenomespanningregressions
AT julianparkhill improvedpredictionofbacterialgenotypephenotypeassociationsusinginterpretablepangenomespanningregressions
AT jukkacorander improvedpredictionofbacterialgenotypephenotypeassociationsusinginterpretablepangenomespanningregressions
_version_ 1718427081028665344