Machine learning random forest for predicting oncosomatic variant NGS analysis

Abstract Since 2017, we have used IonTorrent NGS platform in our hospital to diagnose and treat cancer. Analyzing variants at each run requires considerable time, and we are still struggling with some variants that appear correct on the metrics at first, but are found to be negative upon further inv...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Eric Pellegrino, Coralie Jacques, Nathalie Beaufils, Isabelle Nanni, Antoine Carlioz, Philippe Metellus, L’Houcine Ouafik
Formato:	article
Lenguaje:	EN
Publicado:	Nature Portfolio 2021
Materias:	Medicine R Science Q
Acceso en línea:	https://doaj.org/article/f2cc208aa9504b5da4bddc6b83fc1c28
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:f2cc208aa9504b5da4bddc6b83fc1c28
record_format	dspace
institution	DOAJ
collection	DOAJ
language	EN
topic	Medicine R Science Q
spellingShingle	Medicine R Science Q Eric Pellegrino Coralie Jacques Nathalie Beaufils Isabelle Nanni Antoine Carlioz Philippe Metellus L’Houcine Ouafik Machine learning random forest for predicting oncosomatic variant NGS analysis
description	Abstract Since 2017, we have used IonTorrent NGS platform in our hospital to diagnose and treat cancer. Analyzing variants at each run requires considerable time, and we are still struggling with some variants that appear correct on the metrics at first, but are found to be negative upon further investigation. Can any machine learning algorithm (ML) help us classify NGS variants? This has led us to investigate which ML can fit our NGS data and to develop a tool that can be routinely implemented to help biologists. Currently, one of the greatest challenges in medicine is processing a significant quantity of data. This is particularly true in molecular biology with the advantage of next-generation sequencing (NGS) for profiling and identifying molecular tumors and their treatment. In addition to bioinformatics pipelines, artificial intelligence (AI) can be valuable in helping to analyze mutation variants. Generating sequencing data from patient DNA samples has become easy to perform in clinical trials. However, analyzing the massive quantities of genomic or transcriptomic data and extracting the key biomarkers associated with a clinical response to a specific therapy requires a formidable combination of scientific expertise, biomolecular skills and a panel of bioinformatic and biostatistic tools, in which artificial intelligence is now successful in developing future routine diagnostics. However, cancer genome complexity and technical artifacts make identifying real variants challenging. We present a machine learning method for classifying pathogenic single nucleotide variants (SNVs), single nucleotide polymorphisms (SNPs), multiple nucleotide variants (MNVs), insertions, and deletions detected by NGS from different types of tumor specimens, such as: colorectal, melanoma, lung and glioma cancer. We compared our NGS data to different machine learning algorithms using the k-fold cross-validation method and to neural networks (deep learning) to measure the performance of the different ML algorithms and determine which one is a valid model for confirming NGS variant calls in cancer diagnosis. We trained our machine learning with 70% of our data samples, extracted from our local database (our data structure had 7 parameters: chromosome, position, exon, variant allele frequency, minor allele frequency, coverage and protein description) and validated it with the 30% remaining data. The model offering the best accuracy was chosen and implemented in the NGS analysis routine. Artificial intelligence was developed with the R script language version 3.6.0. We trained our model on 70% of 102,011 variants. Our best error rate (0.22%) was found with random forest machine learning (ntree = 500 and mtry = 4), with an AUC of 0.99. Neural networks achieved some good scores. The final trained model with the neural network achieved an accuracy of 98% and an ROC-AUC of 0.99 with validation data. We tested our RF model to interpret more than 2000 variants from our NGS database: 20 variants were misclassified (error rate < 1%). The errors were nomenclature problems and false positives. After adding false positives to our training database and implementing our RF model routinely, our error rate was always < 0.5%. The RF model shows excellent results for oncosomatic NGS interpretation and can easily be implemented in other molecular biology laboratories. AI is becoming increasingly important in molecular biomedical analysis and can be very helpful in processing medical data. Neural networks show a good capacity in variant classification, and in the future, they may be useful in predicting more complex variants.
format	article
author	Eric Pellegrino Coralie Jacques Nathalie Beaufils Isabelle Nanni Antoine Carlioz Philippe Metellus L’Houcine Ouafik
author_facet	Eric Pellegrino Coralie Jacques Nathalie Beaufils Isabelle Nanni Antoine Carlioz Philippe Metellus L’Houcine Ouafik
author_sort	Eric Pellegrino
title	Machine learning random forest for predicting oncosomatic variant NGS analysis
title_short	Machine learning random forest for predicting oncosomatic variant NGS analysis
title_full	Machine learning random forest for predicting oncosomatic variant NGS analysis
title_fullStr	Machine learning random forest for predicting oncosomatic variant NGS analysis
title_full_unstemmed	Machine learning random forest for predicting oncosomatic variant NGS analysis
title_sort	machine learning random forest for predicting oncosomatic variant ngs analysis
publisher	Nature Portfolio
publishDate	2021
url	https://doaj.org/article/f2cc208aa9504b5da4bddc6b83fc1c28
work_keys_str_mv	AT ericpellegrino machinelearningrandomforestforpredictingoncosomaticvariantngsanalysis AT coraliejacques machinelearningrandomforestforpredictingoncosomaticvariantngsanalysis AT nathaliebeaufils machinelearningrandomforestforpredictingoncosomaticvariantngsanalysis AT isabellenanni machinelearningrandomforestforpredictingoncosomaticvariantngsanalysis AT antoinecarlioz machinelearningrandomforestforpredictingoncosomaticvariantngsanalysis AT philippemetellus machinelearningrandomforestforpredictingoncosomaticvariantngsanalysis AT lhoucineouafik machinelearningrandomforestforpredictingoncosomaticvariantngsanalysis
_version_	1718429324148736000
spelling	oai:doaj.org-article:f2cc208aa9504b5da4bddc6b83fc1c282021-11-14T12:17:39ZMachine learning random forest for predicting oncosomatic variant NGS analysis10.1038/s41598-021-01253-y2045-2322https://doaj.org/article/f2cc208aa9504b5da4bddc6b83fc1c282021-11-01T00:00:00Zhttps://doi.org/10.1038/s41598-021-01253-yhttps://doaj.org/toc/2045-2322Abstract Since 2017, we have used IonTorrent NGS platform in our hospital to diagnose and treat cancer. Analyzing variants at each run requires considerable time, and we are still struggling with some variants that appear correct on the metrics at first, but are found to be negative upon further investigation. Can any machine learning algorithm (ML) help us classify NGS variants? This has led us to investigate which ML can fit our NGS data and to develop a tool that can be routinely implemented to help biologists. Currently, one of the greatest challenges in medicine is processing a significant quantity of data. This is particularly true in molecular biology with the advantage of next-generation sequencing (NGS) for profiling and identifying molecular tumors and their treatment. In addition to bioinformatics pipelines, artificial intelligence (AI) can be valuable in helping to analyze mutation variants. Generating sequencing data from patient DNA samples has become easy to perform in clinical trials. However, analyzing the massive quantities of genomic or transcriptomic data and extracting the key biomarkers associated with a clinical response to a specific therapy requires a formidable combination of scientific expertise, biomolecular skills and a panel of bioinformatic and biostatistic tools, in which artificial intelligence is now successful in developing future routine diagnostics. However, cancer genome complexity and technical artifacts make identifying real variants challenging. We present a machine learning method for classifying pathogenic single nucleotide variants (SNVs), single nucleotide polymorphisms (SNPs), multiple nucleotide variants (MNVs), insertions, and deletions detected by NGS from different types of tumor specimens, such as: colorectal, melanoma, lung and glioma cancer. We compared our NGS data to different machine learning algorithms using the k-fold cross-validation method and to neural networks (deep learning) to measure the performance of the different ML algorithms and determine which one is a valid model for confirming NGS variant calls in cancer diagnosis. We trained our machine learning with 70% of our data samples, extracted from our local database (our data structure had 7 parameters: chromosome, position, exon, variant allele frequency, minor allele frequency, coverage and protein description) and validated it with the 30% remaining data. The model offering the best accuracy was chosen and implemented in the NGS analysis routine. Artificial intelligence was developed with the R script language version 3.6.0. We trained our model on 70% of 102,011 variants. Our best error rate (0.22%) was found with random forest machine learning (ntree = 500 and mtry = 4), with an AUC of 0.99. Neural networks achieved some good scores. The final trained model with the neural network achieved an accuracy of 98% and an ROC-AUC of 0.99 with validation data. We tested our RF model to interpret more than 2000 variants from our NGS database: 20 variants were misclassified (error rate < 1%). The errors were nomenclature problems and false positives. After adding false positives to our training database and implementing our RF model routinely, our error rate was always < 0.5%. The RF model shows excellent results for oncosomatic NGS interpretation and can easily be implemented in other molecular biology laboratories. AI is becoming increasingly important in molecular biomedical analysis and can be very helpful in processing medical data. Neural networks show a good capacity in variant classification, and in the future, they may be useful in predicting more complex variants.Eric PellegrinoCoralie JacquesNathalie BeaufilsIsabelle NanniAntoine CarliozPhilippe MetellusL’Houcine OuafikNature PortfolioarticleMedicineRScienceQENScientific Reports, Vol 11, Iss 1, Pp 1-14 (2021)

Machine learning random forest for predicting oncosomatic variant NGS analysis

Ejemplares similares