Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data

Background: Evaluation of gene interaction models in cancer genomics is challenging, as the true distribution is uncertain. Previous analyses have benchmarked models using synthetic data or databases of experimentally verified interactions – approaches which are susceptible to misrepresentation and...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Robert J O’Shea, Sophia Tsoka, Gary JR Cook, Vicky Goh
Formato:	article
Lenguaje:	EN
Publicado:	SAGE Publishing 2021
Materias:	Neoplasms. Tumors. Oncology. Including cancer and carcinogens RC254-282
Acceso en línea:	https://doaj.org/article/d2c6613ebad14b04be7cce97ef0de93f
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:d2c6613ebad14b04be7cce97ef0de93f
record_format	dspace
spelling	oai:doaj.org-article:d2c6613ebad14b04be7cce97ef0de93f2021-12-01T22:35:48ZSparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data1176-935110.1177/11769351211056298https://doaj.org/article/d2c6613ebad14b04be7cce97ef0de93f2021-11-01T00:00:00Zhttps://doi.org/10.1177/11769351211056298https://doaj.org/toc/1176-9351Background: Evaluation of gene interaction models in cancer genomics is challenging, as the true distribution is uncertain. Previous analyses have benchmarked models using synthetic data or databases of experimentally verified interactions – approaches which are susceptible to misrepresentation and incompleteness, respectively. The objectives of this analysis are to (1) provide a real-world data-driven approach for comparing performance of genomic model inference algorithms, (2) compare the performance of LASSO, elastic net, best-subset selection, L 0 L 1 penalisation and L 0 L 2 penalisation in real genomic data and (3) compare algorithmic preselection according to performance in our benchmark datasets to algorithmic selection by internal cross-validation. Methods: Five large ( n 4000 ) genomic datasets were extracted from Gene Expression Omnibus. ‘Gold-standard’ regression models were trained on subspaces of these datasets ( n 4000 , p = 500 ). Penalised regression models were trained on small samples from these subspaces ( n ∈ { 25 , 75 , 150 } , p = 500 ) and validated against the gold-standard models. Variable selection performance and out-of-sample prediction were assessed. Penalty ‘preselection’ according to test performance in the other 4 datasets was compared to selection internal cross-validation error minimisation. Results: L 1 L 2 -penalisation achieved the highest cosine similarity between estimated coefficients and those of gold-standard models. L 0 L 2 -penalised models explained the greatest proportion of variance in test responses, though performance was unreliable in low signal:noise conditions. L 0 L 2 also attained the highest overall median variable selection F1 score. Penalty preselection significantly outperformed selection by internal cross-validation in each of 3 examined metrics. Conclusions: This analysis explores a novel approach for comparisons of model selection approaches in real genomic data from 5 cancers. Our benchmarking datasets have been made publicly available for use in future research. Our findings support the use of L 0 L 2 penalisation for structural selection and L 1 L 2 penalisation for coefficient recovery in genomic data. Evaluation of learning algorithms according to observed test performance in external genomic datasets yields valuable insights into actual test performance, providing a data-driven complement to internal cross-validation in genomic regression tasks.Robert J O’SheaSophia TsokaGary JR CookVicky GohSAGE PublishingarticleNeoplasms. Tumors. Oncology. Including cancer and carcinogensRC254-282ENCancer Informatics, Vol 20 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Neoplasms. Tumors. Oncology. Including cancer and carcinogens RC254-282
spellingShingle	Neoplasms. Tumors. Oncology. Including cancer and carcinogens RC254-282 Robert J O’Shea Sophia Tsoka Gary JR Cook Vicky Goh Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data
description	Background: Evaluation of gene interaction models in cancer genomics is challenging, as the true distribution is uncertain. Previous analyses have benchmarked models using synthetic data or databases of experimentally verified interactions – approaches which are susceptible to misrepresentation and incompleteness, respectively. The objectives of this analysis are to (1) provide a real-world data-driven approach for comparing performance of genomic model inference algorithms, (2) compare the performance of LASSO, elastic net, best-subset selection, L 0 L 1 penalisation and L 0 L 2 penalisation in real genomic data and (3) compare algorithmic preselection according to performance in our benchmark datasets to algorithmic selection by internal cross-validation. Methods: Five large ( n 4000 ) genomic datasets were extracted from Gene Expression Omnibus. ‘Gold-standard’ regression models were trained on subspaces of these datasets ( n 4000 , p = 500 ). Penalised regression models were trained on small samples from these subspaces ( n ∈ { 25 , 75 , 150 } , p = 500 ) and validated against the gold-standard models. Variable selection performance and out-of-sample prediction were assessed. Penalty ‘preselection’ according to test performance in the other 4 datasets was compared to selection internal cross-validation error minimisation. Results: L 1 L 2 -penalisation achieved the highest cosine similarity between estimated coefficients and those of gold-standard models. L 0 L 2 -penalised models explained the greatest proportion of variance in test responses, though performance was unreliable in low signal:noise conditions. L 0 L 2 also attained the highest overall median variable selection F1 score. Penalty preselection significantly outperformed selection by internal cross-validation in each of 3 examined metrics. Conclusions: This analysis explores a novel approach for comparisons of model selection approaches in real genomic data from 5 cancers. Our benchmarking datasets have been made publicly available for use in future research. Our findings support the use of L 0 L 2 penalisation for structural selection and L 1 L 2 penalisation for coefficient recovery in genomic data. Evaluation of learning algorithms according to observed test performance in external genomic datasets yields valuable insights into actual test performance, providing a data-driven complement to internal cross-validation in genomic regression tasks.
format	article
author	Robert J O’Shea Sophia Tsoka Gary JR Cook Vicky Goh
author_facet	Robert J O’Shea Sophia Tsoka Gary JR Cook Vicky Goh
author_sort	Robert J O’Shea
title	Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data
title_short	Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data
title_full	Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data
title_fullStr	Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data
title_full_unstemmed	Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data
title_sort	sparse regression in cancer genomics: comparing variable selection and predictions in real world data
publisher	SAGE Publishing
publishDate	2021
url	https://doaj.org/article/d2c6613ebad14b04be7cce97ef0de93f
work_keys_str_mv	AT robertjoshea sparseregressionincancergenomicscomparingvariableselectionandpredictionsinrealworlddata AT sophiatsoka sparseregressionincancergenomicscomparingvariableselectionandpredictionsinrealworlddata AT garyjrcook sparseregressionincancergenomicscomparingvariableselectionandpredictionsinrealworlddata AT vickygoh sparseregressionincancergenomicscomparingvariableselectionandpredictionsinrealworlddata
_version_	1718404127611944960

Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data

Ejemplares similares