Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.

Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Yuansheng Liu, Jinyan Li
Formato: article
Lenguaje:EN
Publicado: Public Library of Science (PLoS) 2021
Materias:
Acceso en línea:https://doaj.org/article/762452c087c740ccbd501d129ae00806
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:762452c087c740ccbd501d129ae00806
record_format dspace
spelling oai:doaj.org-article:762452c087c740ccbd501d129ae008062021-12-02T19:57:24ZHamming-shifting graph of genomic short reads: Efficient construction and its application for compression.1553-734X1553-735810.1371/journal.pcbi.1009229https://doaj.org/article/762452c087c740ccbd501d129ae008062021-07-01T00:00:00Zhttps://doi.org/10.1371/journal.pcbi.1009229https://doaj.org/toc/1553-734Xhttps://doaj.org/toc/1553-7358Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 - 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly.Yuansheng LiuJinyan LiPublic Library of Science (PLoS)articleBiology (General)QH301-705.5ENPLoS Computational Biology, Vol 17, Iss 7, p e1009229 (2021)
institution DOAJ
collection DOAJ
language EN
topic Biology (General)
QH301-705.5
spellingShingle Biology (General)
QH301-705.5
Yuansheng Liu
Jinyan Li
Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.
description Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 - 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly.
format article
author Yuansheng Liu
Jinyan Li
author_facet Yuansheng Liu
Jinyan Li
author_sort Yuansheng Liu
title Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.
title_short Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.
title_full Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.
title_fullStr Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.
title_full_unstemmed Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.
title_sort hamming-shifting graph of genomic short reads: efficient construction and its application for compression.
publisher Public Library of Science (PLoS)
publishDate 2021
url https://doaj.org/article/762452c087c740ccbd501d129ae00806
work_keys_str_mv AT yuanshengliu hammingshiftinggraphofgenomicshortreadsefficientconstructionanditsapplicationforcompression
AT jinyanli hammingshiftinggraphofgenomicshortreadsefficientconstructionanditsapplicationforcompression
_version_ 1718375847247740928