Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing.

While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probab...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Justin M Zook, Daniel Samarov, Jennifer McDaniel, Shurjo K Sen, Marc Salit
Formato:	article
Lenguaje:	EN
Publicado:	Public Library of Science (PLoS) 2012
Materias:	Medicine R Science Q
Acceso en línea:	https://doaj.org/article/a35ecb62c8ca4ec2a8ace6b4145a01de
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:a35ecb62c8ca4ec2a8ace6b4145a01de
record_format	dspace
spelling	oai:doaj.org-article:a35ecb62c8ca4ec2a8ace6b4145a01de2021-11-18T07:10:18ZSynthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing.1932-620310.1371/journal.pone.0041356https://doaj.org/article/a35ecb62c8ca4ec2a8ace6b4145a01de2012-01-01T00:00:00Zhttps://www.ncbi.nlm.nih.gov/pmc/articles/pmid/22859977/?tool=EBIhttps://doaj.org/toc/1932-6203While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being "recalibrated" (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.Justin M ZookDaniel SamarovJennifer McDanielShurjo K SenMarc SalitPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 7, Iss 7, p e41356 (2012)
institution	DOAJ
collection	DOAJ
language	EN
topic	Medicine R Science Q
spellingShingle	Medicine R Science Q Justin M Zook Daniel Samarov Jennifer McDaniel Shurjo K Sen Marc Salit Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing.
description	While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being "recalibrated" (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.
format	article
author	Justin M Zook Daniel Samarov Jennifer McDaniel Shurjo K Sen Marc Salit
author_facet	Justin M Zook Daniel Samarov Jennifer McDaniel Shurjo K Sen Marc Salit
author_sort	Justin M Zook
title	Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing.
title_short	Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing.
title_full	Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing.
title_fullStr	Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing.
title_full_unstemmed	Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing.
title_sort	synthetic spike-in standards improve run-specific systematic error analysis for dna and rna sequencing.
publisher	Public Library of Science (PLoS)
publishDate	2012
url	https://doaj.org/article/a35ecb62c8ca4ec2a8ace6b4145a01de
work_keys_str_mv	AT justinmzook syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing AT danielsamarov syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing AT jennifermcdaniel syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing AT shurjoksen syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing AT marcsalit syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing
_version_	1718423871708725248

Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing.

Ejemplares similares