Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy.
Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by...
Guardado en:
Autores principales: | , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
Public Library of Science (PLoS)
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/245f2896b0d84253a7e346dc3bf9ac4d |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:245f2896b0d84253a7e346dc3bf9ac4d |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:245f2896b0d84253a7e346dc3bf9ac4d2021-12-02T20:16:55ZLarge-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy.1932-620310.1371/journal.pone.0258693https://doaj.org/article/245f2896b0d84253a7e346dc3bf9ac4d2021-01-01T00:00:00Zhttps://doi.org/10.1371/journal.pone.0258693https://doaj.org/toc/1932-6203Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.Yuval BussiRuti KaponZiv ReichPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 16, Iss 10, p e0258693 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
Medicine R Science Q |
spellingShingle |
Medicine R Science Q Yuval Bussi Ruti Kapon Ziv Reich Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. |
description |
Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity. |
format |
article |
author |
Yuval Bussi Ruti Kapon Ziv Reich |
author_facet |
Yuval Bussi Ruti Kapon Ziv Reich |
author_sort |
Yuval Bussi |
title |
Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. |
title_short |
Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. |
title_full |
Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. |
title_fullStr |
Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. |
title_full_unstemmed |
Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. |
title_sort |
large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. |
publisher |
Public Library of Science (PLoS) |
publishDate |
2021 |
url |
https://doaj.org/article/245f2896b0d84253a7e346dc3bf9ac4d |
work_keys_str_mv |
AT yuvalbussi largescalekmerbasedanalysisoftheinformationalpropertiesofgenomescomparativegenomicsandtaxonomy AT rutikapon largescalekmerbasedanalysisoftheinformationalpropertiesofgenomescomparativegenomicsandtaxonomy AT zivreich largescalekmerbasedanalysisoftheinformationalpropertiesofgenomescomparativegenomicsandtaxonomy |
_version_ |
1718374393166430208 |