Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

Abstract Background The research landscape of single-cell and single-nuclei RNA-sequencing is evolving rapidly. In particular, the area for the detection of rare cells was highly facilitated by this technology. However, an automated, unbiased, and accurate annotation of rare subpopulations is challe...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Saptarshi Bej, Anne-Marie Galow, Robert David, Markus Wolfien, Olaf Wolkenhauer
Formato:	article
Lenguaje:	EN
Publicado:	BMC 2021
Materias:	Single-cell RNA-sequencing Imbalanced datasets Rare cell type detection LoRAS algorithm Automated cell annotation Computer applications to medicine. Medical informatics R858-859.7 Biology (General) QH301-705.5
Acceso en línea:	https://doaj.org/article/84da11c5e74a475d9c97338ac7622f7c
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:84da11c5e74a475d9c97338ac7622f7c
record_format	dspace
spelling	oai:doaj.org-article:84da11c5e74a475d9c97338ac7622f7c2021-11-21T12:09:14ZAutomated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling10.1186/s12859-021-04469-x1471-2105https://doaj.org/article/84da11c5e74a475d9c97338ac7622f7c2021-11-01T00:00:00Zhttps://doi.org/10.1186/s12859-021-04469-xhttps://doaj.org/toc/1471-2105Abstract Background The research landscape of single-cell and single-nuclei RNA-sequencing is evolving rapidly. In particular, the area for the detection of rare cells was highly facilitated by this technology. However, an automated, unbiased, and accurate annotation of rare subpopulations is challenging. Once rare cells are identified in one dataset, it is usually necessary to generate further specific datasets to enrich the analysis (e.g., with samples from other tissues). From a machine learning perspective, the challenge arises from the fact that rare-cell subpopulations constitute an imbalanced classification problem. We here introduce a Machine Learning (ML)-based oversampling method that uses gene expression counts of already identified rare cells as an input to generate synthetic cells to then identify similar (rare) cells in other publicly available experiments. We utilize single-cell synthetic oversampling (sc-SynO), which is based on the Localized Random Affine Shadowsampling (LoRAS) algorithm. The algorithm corrects for the overall imbalance ratio of the minority and majority class. Results We demonstrate the effectiveness of our method for three independent use cases, each consisting of already published datasets. The first use case identifies cardiac glial cells in snRNA-Seq data (17 nuclei out of 8635). This use case was designed to take a larger imbalance ratio (~1 to 500) into account and only uses single-nuclei data. The second use case was designed to jointly use snRNA-Seq data and scRNA-Seq on a lower imbalance ratio (~1 to 26) for the training step to likewise investigate the potential of the algorithm to consider both single-cell capture procedures and the impact of “less” rare-cell types. The third dataset refers to the murine data of the Allen Brain Atlas, including more than 1 million cells. For validation purposes only, all datasets have also been analyzed traditionally using common data analysis approaches, such as the Seurat workflow. Conclusions In comparison to baseline testing without oversampling, our approach identifies rare-cells with a robust precision-recall balance, including a high accuracy and low false positive detection rate. A practical benefit of our algorithm is that it can be readily implemented in other and existing workflows. The code basis in R and Python is publicly available at FairdomHub, as well as GitHub, and can easily be transferred to identify other rare-cell types.Saptarshi BejAnne-Marie GalowRobert DavidMarkus WolfienOlaf WolkenhauerBMCarticleSingle-cell RNA-sequencingImbalanced datasetsRare cell type detectionLoRAS algorithmAutomated cell annotationComputer applications to medicine. Medical informaticsR858-859.7Biology (General)QH301-705.5ENBMC Bioinformatics, Vol 22, Iss 1, Pp 1-17 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Single-cell RNA-sequencing Imbalanced datasets Rare cell type detection LoRAS algorithm Automated cell annotation Computer applications to medicine. Medical informatics R858-859.7 Biology (General) QH301-705.5
spellingShingle	Single-cell RNA-sequencing Imbalanced datasets Rare cell type detection LoRAS algorithm Automated cell annotation Computer applications to medicine. Medical informatics R858-859.7 Biology (General) QH301-705.5 Saptarshi Bej Anne-Marie Galow Robert David Markus Wolfien Olaf Wolkenhauer Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling
description	Abstract Background The research landscape of single-cell and single-nuclei RNA-sequencing is evolving rapidly. In particular, the area for the detection of rare cells was highly facilitated by this technology. However, an automated, unbiased, and accurate annotation of rare subpopulations is challenging. Once rare cells are identified in one dataset, it is usually necessary to generate further specific datasets to enrich the analysis (e.g., with samples from other tissues). From a machine learning perspective, the challenge arises from the fact that rare-cell subpopulations constitute an imbalanced classification problem. We here introduce a Machine Learning (ML)-based oversampling method that uses gene expression counts of already identified rare cells as an input to generate synthetic cells to then identify similar (rare) cells in other publicly available experiments. We utilize single-cell synthetic oversampling (sc-SynO), which is based on the Localized Random Affine Shadowsampling (LoRAS) algorithm. The algorithm corrects for the overall imbalance ratio of the minority and majority class. Results We demonstrate the effectiveness of our method for three independent use cases, each consisting of already published datasets. The first use case identifies cardiac glial cells in snRNA-Seq data (17 nuclei out of 8635). This use case was designed to take a larger imbalance ratio (~1 to 500) into account and only uses single-nuclei data. The second use case was designed to jointly use snRNA-Seq data and scRNA-Seq on a lower imbalance ratio (~1 to 26) for the training step to likewise investigate the potential of the algorithm to consider both single-cell capture procedures and the impact of “less” rare-cell types. The third dataset refers to the murine data of the Allen Brain Atlas, including more than 1 million cells. For validation purposes only, all datasets have also been analyzed traditionally using common data analysis approaches, such as the Seurat workflow. Conclusions In comparison to baseline testing without oversampling, our approach identifies rare-cells with a robust precision-recall balance, including a high accuracy and low false positive detection rate. A practical benefit of our algorithm is that it can be readily implemented in other and existing workflows. The code basis in R and Python is publicly available at FairdomHub, as well as GitHub, and can easily be transferred to identify other rare-cell types.
format	article
author	Saptarshi Bej Anne-Marie Galow Robert David Markus Wolfien Olaf Wolkenhauer
author_facet	Saptarshi Bej Anne-Marie Galow Robert David Markus Wolfien Olaf Wolkenhauer
author_sort	Saptarshi Bej
title	Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling
title_short	Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling
title_full	Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling
title_fullStr	Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling
title_full_unstemmed	Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling
title_sort	automated annotation of rare-cell types from single-cell rna-sequencing data through synthetic oversampling
publisher	BMC
publishDate	2021
url	https://doaj.org/article/84da11c5e74a475d9c97338ac7622f7c
work_keys_str_mv	AT saptarshibej automatedannotationofrarecelltypesfromsinglecellrnasequencingdatathroughsyntheticoversampling AT annemariegalow automatedannotationofrarecelltypesfromsinglecellrnasequencingdatathroughsyntheticoversampling AT robertdavid automatedannotationofrarecelltypesfromsinglecellrnasequencingdatathroughsyntheticoversampling AT markuswolfien automatedannotationofrarecelltypesfromsinglecellrnasequencingdatathroughsyntheticoversampling AT olafwolkenhauer automatedannotationofrarecelltypesfromsinglecellrnasequencingdatathroughsyntheticoversampling
_version_	1718419196390408192

Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

Ejemplares similares