Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants
Abstract Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regula...
Guardado en:
Autores principales: | , , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
Nature Portfolio
2017
|
Materias: | |
Acceso en línea: | https://doaj.org/article/cb143f66e0ab410fb2077b350ce7d69e |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:cb143f66e0ab410fb2077b350ce7d69e |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:cb143f66e0ab410fb2077b350ce7d69e2021-12-02T11:53:08ZImbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants10.1038/s41598-017-03011-52045-2322https://doaj.org/article/cb143f66e0ab410fb2077b350ce7d69e2017-06-01T00:00:00Zhttps://doi.org/10.1038/s41598-017-03011-5https://doaj.org/toc/2045-2322Abstract Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods in two different contexts: the prediction of non-coding variants associated with Mendelian and with complex diseases. We show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms and we provide a method and an easy-to-use software tool that can be effectively applied to this challenging prediction task.Max SchubachMatteo RePeter N. RobinsonGiorgio ValentiniNature PortfolioarticleMedicineRScienceQENScientific Reports, Vol 7, Iss 1, Pp 1-12 (2017) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
Medicine R Science Q |
spellingShingle |
Medicine R Science Q Max Schubach Matteo Re Peter N. Robinson Giorgio Valentini Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants |
description |
Abstract Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods in two different contexts: the prediction of non-coding variants associated with Mendelian and with complex diseases. We show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms and we provide a method and an easy-to-use software tool that can be effectively applied to this challenging prediction task. |
format |
article |
author |
Max Schubach Matteo Re Peter N. Robinson Giorgio Valentini |
author_facet |
Max Schubach Matteo Re Peter N. Robinson Giorgio Valentini |
author_sort |
Max Schubach |
title |
Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants |
title_short |
Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants |
title_full |
Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants |
title_fullStr |
Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants |
title_full_unstemmed |
Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants |
title_sort |
imbalance-aware machine learning for predicting rare and common disease-associated non-coding variants |
publisher |
Nature Portfolio |
publishDate |
2017 |
url |
https://doaj.org/article/cb143f66e0ab410fb2077b350ce7d69e |
work_keys_str_mv |
AT maxschubach imbalanceawaremachinelearningforpredictingrareandcommondiseaseassociatednoncodingvariants AT matteore imbalanceawaremachinelearningforpredictingrareandcommondiseaseassociatednoncodingvariants AT peternrobinson imbalanceawaremachinelearningforpredictingrareandcommondiseaseassociatednoncodingvariants AT giorgiovalentini imbalanceawaremachinelearningforpredictingrareandcommondiseaseassociatednoncodingvariants |
_version_ |
1718394894045675520 |