A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC.

Recent technological advances have made the gathering of comprehensive gene expression datasets a commodity. This has shifted the limiting step of transcriptomic studies from the accumulation of data to their analyses and interpretation. The main problem in analyzing transcriptomics data is that the...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Jason Bennett, Mikhail Pomaznoy, Akul Singhania, Bjoern Peters
Formato: article
Lenguaje:EN
Publicado: Public Library of Science (PLoS) 2021
Materias:
Acceso en línea:https://doaj.org/article/2b76c97381b7453c92f5082552e84bb9
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:2b76c97381b7453c92f5082552e84bb9
record_format dspace
spelling oai:doaj.org-article:2b76c97381b7453c92f5082552e84bb92021-11-25T05:40:32ZA metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC.1553-734X1553-735810.1371/journal.pcbi.1009459https://doaj.org/article/2b76c97381b7453c92f5082552e84bb92021-10-01T00:00:00Zhttps://doi.org/10.1371/journal.pcbi.1009459https://doaj.org/toc/1553-734Xhttps://doaj.org/toc/1553-7358Recent technological advances have made the gathering of comprehensive gene expression datasets a commodity. This has shifted the limiting step of transcriptomic studies from the accumulation of data to their analyses and interpretation. The main problem in analyzing transcriptomics data is that the number of independent samples is typically much lower (<100) than the number of genes whose expression is quantified (typically >14,000). To address this, it would be desirable to reduce the gathered data's dimensionality without losing information. Clustering genes into discrete modules is one of the most commonly used tools to accomplish this task. While there are multiple clustering approaches, there is a lack of informative metrics available to evaluate the resultant clusters' biological quality. Here we present a metric that incorporates known ground truth gene sets to quantify gene clusters' biological quality derived from standard clustering techniques. The GECO (Ground truth Evaluation of Clustering Outcomes) metric demonstrates that quantitative and repeatable scoring of gene clusters is not only possible but computationally lightweight and robust. Unlike current methods, it allows direct comparison between gene clusters generated by different clustering techniques. It also reveals that current cluster analysis techniques often underestimate the number of clusters that should be formed from a dataset, which leads to fewer clusters of lower quality. As a test case, we applied GECO combined with k-means clustering to derive an optimal set of co-expressed gene modules derived from PBMC, which we show to be superior to previously generated modules generated on whole-blood. Overall, GECO provides a rational metric to test and compare different clustering approaches to analyze high-dimensional transcriptomic data.Jason BennettMikhail PomaznoyAkul SinghaniaBjoern PetersPublic Library of Science (PLoS)articleBiology (General)QH301-705.5ENPLoS Computational Biology, Vol 17, Iss 10, p e1009459 (2021)
institution DOAJ
collection DOAJ
language EN
topic Biology (General)
QH301-705.5
spellingShingle Biology (General)
QH301-705.5
Jason Bennett
Mikhail Pomaznoy
Akul Singhania
Bjoern Peters
A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC.
description Recent technological advances have made the gathering of comprehensive gene expression datasets a commodity. This has shifted the limiting step of transcriptomic studies from the accumulation of data to their analyses and interpretation. The main problem in analyzing transcriptomics data is that the number of independent samples is typically much lower (<100) than the number of genes whose expression is quantified (typically >14,000). To address this, it would be desirable to reduce the gathered data's dimensionality without losing information. Clustering genes into discrete modules is one of the most commonly used tools to accomplish this task. While there are multiple clustering approaches, there is a lack of informative metrics available to evaluate the resultant clusters' biological quality. Here we present a metric that incorporates known ground truth gene sets to quantify gene clusters' biological quality derived from standard clustering techniques. The GECO (Ground truth Evaluation of Clustering Outcomes) metric demonstrates that quantitative and repeatable scoring of gene clusters is not only possible but computationally lightweight and robust. Unlike current methods, it allows direct comparison between gene clusters generated by different clustering techniques. It also reveals that current cluster analysis techniques often underestimate the number of clusters that should be formed from a dataset, which leads to fewer clusters of lower quality. As a test case, we applied GECO combined with k-means clustering to derive an optimal set of co-expressed gene modules derived from PBMC, which we show to be superior to previously generated modules generated on whole-blood. Overall, GECO provides a rational metric to test and compare different clustering approaches to analyze high-dimensional transcriptomic data.
format article
author Jason Bennett
Mikhail Pomaznoy
Akul Singhania
Bjoern Peters
author_facet Jason Bennett
Mikhail Pomaznoy
Akul Singhania
Bjoern Peters
author_sort Jason Bennett
title A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC.
title_short A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC.
title_full A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC.
title_fullStr A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC.
title_full_unstemmed A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC.
title_sort metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in pbmc.
publisher Public Library of Science (PLoS)
publishDate 2021
url https://doaj.org/article/2b76c97381b7453c92f5082552e84bb9
work_keys_str_mv AT jasonbennett ametricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
AT mikhailpomaznoy ametricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
AT akulsinghania ametricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
AT bjoernpeters ametricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
AT jasonbennett metricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
AT mikhailpomaznoy metricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
AT akulsinghania metricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
AT bjoernpeters metricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
_version_ 1718414508424167424