Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.
Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We...
Guardado en:
Autores principales: | , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
Public Library of Science (PLoS)
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/762452c087c740ccbd501d129ae00806 |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:762452c087c740ccbd501d129ae00806 |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:762452c087c740ccbd501d129ae008062021-12-02T19:57:24ZHamming-shifting graph of genomic short reads: Efficient construction and its application for compression.1553-734X1553-735810.1371/journal.pcbi.1009229https://doaj.org/article/762452c087c740ccbd501d129ae008062021-07-01T00:00:00Zhttps://doi.org/10.1371/journal.pcbi.1009229https://doaj.org/toc/1553-734Xhttps://doaj.org/toc/1553-7358Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 - 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly.Yuansheng LiuJinyan LiPublic Library of Science (PLoS)articleBiology (General)QH301-705.5ENPLoS Computational Biology, Vol 17, Iss 7, p e1009229 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
Biology (General) QH301-705.5 |
spellingShingle |
Biology (General) QH301-705.5 Yuansheng Liu Jinyan Li Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression. |
description |
Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 - 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly. |
format |
article |
author |
Yuansheng Liu Jinyan Li |
author_facet |
Yuansheng Liu Jinyan Li |
author_sort |
Yuansheng Liu |
title |
Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression. |
title_short |
Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression. |
title_full |
Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression. |
title_fullStr |
Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression. |
title_full_unstemmed |
Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression. |
title_sort |
hamming-shifting graph of genomic short reads: efficient construction and its application for compression. |
publisher |
Public Library of Science (PLoS) |
publishDate |
2021 |
url |
https://doaj.org/article/762452c087c740ccbd501d129ae00806 |
work_keys_str_mv |
AT yuanshengliu hammingshiftinggraphofgenomicshortreadsefficientconstructionanditsapplicationforcompression AT jinyanli hammingshiftinggraphofgenomicshortreadsefficientconstructionanditsapplicationforcompression |
_version_ |
1718375847247740928 |