EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality

Abstract Background To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the qual...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Madolyn L. MacDonald, Kelvin H. Lee
Formato: article
Lenguaje:EN
Publicado: BMC 2021
Materias:
Acceso en línea:https://doaj.org/article/9149ddb6466743b6a1005ecd98f92813
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:9149ddb6466743b6a1005ecd98f92813
record_format dspace
spelling oai:doaj.org-article:9149ddb6466743b6a1005ecd98f928132021-11-28T12:11:15ZEvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality10.1186/s12859-021-04480-21471-2105https://doaj.org/article/9149ddb6466743b6a1005ecd98f928132021-11-01T00:00:00Zhttps://doi.org/10.1186/s12859-021-04480-2https://doaj.org/toc/1471-2105Abstract Background To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scoring of genome assemblies and does not require an existing reference genome for accuracy assessment. Results EvalDNA calculates a list of quality metrics from an assembled sequence and applies a model created from supervised machine learning methods to integrate various metrics into a comprehensive quality score. A well-tested, accurate model for scoring mammalian genome sequences is provided as part of EvalDNA. This random forest regression model evaluates an assembled sequence based on continuity, completeness, and accuracy, and was able to explain 86% of the variation in reference-based quality scores within the testing data. EvalDNA was applied to human chromosome 14 assemblies from the GAGE study to rank genome assemblers and to compare EvalDNA to two other quality evaluation tools. In addition, EvalDNA was used to evaluate several genome assemblies of the Chinese hamster genome to help establish a better reference genome for the biopharmaceutical manufacturing community. EvalDNA was also used to assess more recent human assemblies from the QUAST-LG study completed in 2018, and its ability to score bacterial genomes was examined through application on bacterial assemblies from the GAGE-B study. Conclusions EvalDNA enables scientists to easily identify the best available genome assembly for their organism of interest without requiring a reference assembly. EvalDNA sets itself apart from other quality assessment tools by producing a quality score that enables direct comparison among assemblies from different species.Madolyn L. MacDonaldKelvin H. LeeBMCarticleGenome assembly qualityGenome assemblyMachine learningChinese hamsterCHO cellsComputer applications to medicine. Medical informaticsR858-859.7Biology (General)QH301-705.5ENBMC Bioinformatics, Vol 22, Iss 1, Pp 1-26 (2021)
institution DOAJ
collection DOAJ
language EN
topic Genome assembly quality
Genome assembly
Machine learning
Chinese hamster
CHO cells
Computer applications to medicine. Medical informatics
R858-859.7
Biology (General)
QH301-705.5
spellingShingle Genome assembly quality
Genome assembly
Machine learning
Chinese hamster
CHO cells
Computer applications to medicine. Medical informatics
R858-859.7
Biology (General)
QH301-705.5
Madolyn L. MacDonald
Kelvin H. Lee
EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
description Abstract Background To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scoring of genome assemblies and does not require an existing reference genome for accuracy assessment. Results EvalDNA calculates a list of quality metrics from an assembled sequence and applies a model created from supervised machine learning methods to integrate various metrics into a comprehensive quality score. A well-tested, accurate model for scoring mammalian genome sequences is provided as part of EvalDNA. This random forest regression model evaluates an assembled sequence based on continuity, completeness, and accuracy, and was able to explain 86% of the variation in reference-based quality scores within the testing data. EvalDNA was applied to human chromosome 14 assemblies from the GAGE study to rank genome assemblers and to compare EvalDNA to two other quality evaluation tools. In addition, EvalDNA was used to evaluate several genome assemblies of the Chinese hamster genome to help establish a better reference genome for the biopharmaceutical manufacturing community. EvalDNA was also used to assess more recent human assemblies from the QUAST-LG study completed in 2018, and its ability to score bacterial genomes was examined through application on bacterial assemblies from the GAGE-B study. Conclusions EvalDNA enables scientists to easily identify the best available genome assembly for their organism of interest without requiring a reference assembly. EvalDNA sets itself apart from other quality assessment tools by producing a quality score that enables direct comparison among assemblies from different species.
format article
author Madolyn L. MacDonald
Kelvin H. Lee
author_facet Madolyn L. MacDonald
Kelvin H. Lee
author_sort Madolyn L. MacDonald
title EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
title_short EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
title_full EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
title_fullStr EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
title_full_unstemmed EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
title_sort evaldna: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality
publisher BMC
publishDate 2021
url https://doaj.org/article/9149ddb6466743b6a1005ecd98f92813
work_keys_str_mv AT madolynlmacdonald evaldnaamachinelearningbasedtoolforthecomprehensiveevaluationofmammaliangenomeassemblyquality
AT kelvinhlee evaldnaamachinelearningbasedtoolforthecomprehensiveevaluationofmammaliangenomeassemblyquality
_version_ 1718408154088210432