K‐mer counting and curated libraries drive efficient annotation of repeats in plant genomes

Abstract The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole‐genome alignment, promoter analysis, or pangenome exploration. Although homology‐based annotation methods are compu...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Bruno Contreras‐Moreira, Carla V Filippi, Guy Naamati, Carlos García Girón, James E Allen, Paul Flicek
Formato: article
Lenguaje:EN
Publicado: Wiley 2021
Materias:
Acceso en línea:https://doaj.org/article/b532ac8f92da477eab4920c1f846cd9b
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:b532ac8f92da477eab4920c1f846cd9b
record_format dspace
spelling oai:doaj.org-article:b532ac8f92da477eab4920c1f846cd9b2021-12-05T07:50:12ZK‐mer counting and curated libraries drive efficient annotation of repeats in plant genomes1940-337210.1002/tpg2.20143https://doaj.org/article/b532ac8f92da477eab4920c1f846cd9b2021-11-01T00:00:00Zhttps://doi.org/10.1002/tpg2.20143https://doaj.org/toc/1940-3372Abstract The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole‐genome alignment, promoter analysis, or pangenome exploration. Although homology‐based annotation methods are computationally expensive, k‐mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two‐step approach, where repeats were first called by k‐mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k‐mer‐based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red‐masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant‐scripts.Bruno Contreras‐MoreiraCarla V FilippiGuy NaamatiCarlos García GirónJames E AllenPaul FlicekWileyarticlePlant cultureSB1-1110GeneticsQH426-470ENThe Plant Genome, Vol 14, Iss 3, Pp n/a-n/a (2021)
institution DOAJ
collection DOAJ
language EN
topic Plant culture
SB1-1110
Genetics
QH426-470
spellingShingle Plant culture
SB1-1110
Genetics
QH426-470
Bruno Contreras‐Moreira
Carla V Filippi
Guy Naamati
Carlos García Girón
James E Allen
Paul Flicek
K‐mer counting and curated libraries drive efficient annotation of repeats in plant genomes
description Abstract The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole‐genome alignment, promoter analysis, or pangenome exploration. Although homology‐based annotation methods are computationally expensive, k‐mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two‐step approach, where repeats were first called by k‐mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k‐mer‐based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red‐masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant‐scripts.
format article
author Bruno Contreras‐Moreira
Carla V Filippi
Guy Naamati
Carlos García Girón
James E Allen
Paul Flicek
author_facet Bruno Contreras‐Moreira
Carla V Filippi
Guy Naamati
Carlos García Girón
James E Allen
Paul Flicek
author_sort Bruno Contreras‐Moreira
title K‐mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_short K‐mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_full K‐mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_fullStr K‐mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_full_unstemmed K‐mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_sort k‐mer counting and curated libraries drive efficient annotation of repeats in plant genomes
publisher Wiley
publishDate 2021
url https://doaj.org/article/b532ac8f92da477eab4920c1f846cd9b
work_keys_str_mv AT brunocontrerasmoreira kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT carlavfilippi kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT guynaamati kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT carlosgarciagiron kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT jameseallen kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
AT paulflicek kmercountingandcuratedlibrariesdriveefficientannotationofrepeatsinplantgenomes
_version_ 1718372578207203328