A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection

ABSTRACT Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral data...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Norman Goodacre, Aisha Aljanahi, Subhiksha Nandakumar, Mike Mikailov, Arifa S. Khan
Formato: article
Lenguaje:EN
Publicado: American Society for Microbiology 2018
Materias:
Acceso en línea:https://doaj.org/article/39ec26834cea474a9ecfb828cd3ab3ee
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:39ec26834cea474a9ecfb828cd3ab3ee
record_format dspace
spelling oai:doaj.org-article:39ec26834cea474a9ecfb828cd3ab3ee2021-11-15T15:22:14ZA Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection10.1128/mSphereDirect.00069-182379-5042https://doaj.org/article/39ec26834cea474a9ecfb828cd3ab3ee2018-04-01T00:00:00Zhttps://journals.asm.org/doi/10.1128/mSphereDirect.00069-18https://doaj.org/toc/2379-5042ABSTRACT Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral database (RVDB) includes all viral, virus-related, and virus-like nucleotide sequences (excluding bacterial viruses), regardless of length, and with overall reduced cellular sequences. Semantic selection criteria (SEM-I) were used to select viral sequences from GenBank, resulting in a first-generation viral database (VDB). This database was manually and computationally reviewed, resulting in refined, semantic selection criteria (SEM-R), which were applied to a new download of updated GenBank sequences to create a second-generation VDB. Viral entries in the latter were clustered at 98% by CD-HIT-EST to reduce redundancy while retaining high viral sequence diversity. The viral identity of the clustered representative sequences (creps) was confirmed by BLAST searches in NCBI databases and HMMER searches in PFAM and DFAM databases. The resulting RVDB contained a broad representation of viral families, sequence diversity, and a reduced cellular content; it includes full-length and partial sequences and endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Testing of RVDBv10.2, with an in-house HTS transcriptomic data set indicated a significantly faster run for virus detection than interrogating the entirety of the NCBI nonredundant nucleotide database, which contains all viral sequences but also nonviral sequences. RVDB is publically available for facilitating HTS analysis, particularly for novel virus detection. It is meant to be updated on a regular basis to include new viral sequences added to GenBank. IMPORTANCE To facilitate bioinformatics analysis of high-throughput sequencing (HTS) data for the detection of both known and novel viruses, we have developed a new reference viral database (RVDB) that provides a broad representation of different virus species from eukaryotes by including all viral, virus-like, and virus-related sequences (excluding bacteriophages), regardless of their size. In particular, RVDB contains endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Sequences were clustered to reduce redundancy while retaining high viral sequence diversity. A particularly useful feature of RVDB is the reduction of cellular sequences, which can enhance the run efficiency of large transcriptomic and genomic data analysis and increase the specificity of virus detection.Norman GoodacreAisha AljanahiSubhiksha NandakumarMike MikailovArifa S. KhanAmerican Society for MicrobiologyarticleRVDBadventitious virusesbioinformatics analysishigh-throughput sequencingreference virus databaseviral sequencesMicrobiologyQR1-502ENmSphere, Vol 3, Iss 2 (2018)
institution DOAJ
collection DOAJ
language EN
topic RVDB
adventitious viruses
bioinformatics analysis
high-throughput sequencing
reference virus database
viral sequences
Microbiology
QR1-502
spellingShingle RVDB
adventitious viruses
bioinformatics analysis
high-throughput sequencing
reference virus database
viral sequences
Microbiology
QR1-502
Norman Goodacre
Aisha Aljanahi
Subhiksha Nandakumar
Mike Mikailov
Arifa S. Khan
A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection
description ABSTRACT Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral database (RVDB) includes all viral, virus-related, and virus-like nucleotide sequences (excluding bacterial viruses), regardless of length, and with overall reduced cellular sequences. Semantic selection criteria (SEM-I) were used to select viral sequences from GenBank, resulting in a first-generation viral database (VDB). This database was manually and computationally reviewed, resulting in refined, semantic selection criteria (SEM-R), which were applied to a new download of updated GenBank sequences to create a second-generation VDB. Viral entries in the latter were clustered at 98% by CD-HIT-EST to reduce redundancy while retaining high viral sequence diversity. The viral identity of the clustered representative sequences (creps) was confirmed by BLAST searches in NCBI databases and HMMER searches in PFAM and DFAM databases. The resulting RVDB contained a broad representation of viral families, sequence diversity, and a reduced cellular content; it includes full-length and partial sequences and endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Testing of RVDBv10.2, with an in-house HTS transcriptomic data set indicated a significantly faster run for virus detection than interrogating the entirety of the NCBI nonredundant nucleotide database, which contains all viral sequences but also nonviral sequences. RVDB is publically available for facilitating HTS analysis, particularly for novel virus detection. It is meant to be updated on a regular basis to include new viral sequences added to GenBank. IMPORTANCE To facilitate bioinformatics analysis of high-throughput sequencing (HTS) data for the detection of both known and novel viruses, we have developed a new reference viral database (RVDB) that provides a broad representation of different virus species from eukaryotes by including all viral, virus-like, and virus-related sequences (excluding bacteriophages), regardless of their size. In particular, RVDB contains endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Sequences were clustered to reduce redundancy while retaining high viral sequence diversity. A particularly useful feature of RVDB is the reduction of cellular sequences, which can enhance the run efficiency of large transcriptomic and genomic data analysis and increase the specificity of virus detection.
format article
author Norman Goodacre
Aisha Aljanahi
Subhiksha Nandakumar
Mike Mikailov
Arifa S. Khan
author_facet Norman Goodacre
Aisha Aljanahi
Subhiksha Nandakumar
Mike Mikailov
Arifa S. Khan
author_sort Norman Goodacre
title A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection
title_short A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection
title_full A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection
title_fullStr A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection
title_full_unstemmed A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection
title_sort reference viral database (rvdb) to enhance bioinformatics analysis of high-throughput sequencing for novel virus detection
publisher American Society for Microbiology
publishDate 2018
url https://doaj.org/article/39ec26834cea474a9ecfb828cd3ab3ee
work_keys_str_mv AT normangoodacre areferenceviraldatabaservdbtoenhancebioinformaticsanalysisofhighthroughputsequencingfornovelvirusdetection
AT aishaaljanahi areferenceviraldatabaservdbtoenhancebioinformaticsanalysisofhighthroughputsequencingfornovelvirusdetection
AT subhikshanandakumar areferenceviraldatabaservdbtoenhancebioinformaticsanalysisofhighthroughputsequencingfornovelvirusdetection
AT mikemikailov areferenceviraldatabaservdbtoenhancebioinformaticsanalysisofhighthroughputsequencingfornovelvirusdetection
AT arifaskhan areferenceviraldatabaservdbtoenhancebioinformaticsanalysisofhighthroughputsequencingfornovelvirusdetection
AT normangoodacre referenceviraldatabaservdbtoenhancebioinformaticsanalysisofhighthroughputsequencingfornovelvirusdetection
AT aishaaljanahi referenceviraldatabaservdbtoenhancebioinformaticsanalysisofhighthroughputsequencingfornovelvirusdetection
AT subhikshanandakumar referenceviraldatabaservdbtoenhancebioinformaticsanalysisofhighthroughputsequencingfornovelvirusdetection
AT mikemikailov referenceviraldatabaservdbtoenhancebioinformaticsanalysisofhighthroughputsequencingfornovelvirusdetection
AT arifaskhan referenceviraldatabaservdbtoenhancebioinformaticsanalysisofhighthroughputsequencingfornovelvirusdetection
_version_ 1718428064105365504