A Sparse Transformer-Based Approach for Image Captioning

Image Captioning is the task of providing a natural language description for an image. It has caught significant amounts of attention from both computer vision and natural language processing communities. Most image captioning models adopt deep encoder-decoder architectures to achieve state-of-the-a...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Zhou Lei, Congcong Zhou, Shengbo Chen, Yiyong Huang, Xianrui Liu
Formato:	article
Lenguaje:	EN
Publicado:	IEEE 2020
Materias:	Image captioning self-attention explict sparse local adaptive threshold Electrical engineering. Electronics. Nuclear engineering TK1-9971
Acceso en línea:	https://doaj.org/article/8d37acadce0441f6b826f861c201713c
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:8d37acadce0441f6b826f861c201713c
record_format	dspace
spelling	oai:doaj.org-article:8d37acadce0441f6b826f861c201713c2021-11-19T00:05:19ZA Sparse Transformer-Based Approach for Image Captioning2169-353610.1109/ACCESS.2020.3024639https://doaj.org/article/8d37acadce0441f6b826f861c201713c2020-01-01T00:00:00Zhttps://ieeexplore.ieee.org/document/9199872/https://doaj.org/toc/2169-3536Image Captioning is the task of providing a natural language description for an image. It has caught significant amounts of attention from both computer vision and natural language processing communities. Most image captioning models adopt deep encoder-decoder architectures to achieve state-of-the-art performances. However, it is difficult to model knowledge on relationships between input image region pairs in the encoder. Furthermore, the word in the decoder hardly knows the correlation to specific image regions. In this article, a novel deep encoder-decoder model is proposed for image captioning which is developed on sparse Transformer framework. The encoder adopts a multi-level representation of image features based on self-attention to exploit low-level and high-level features, naturally the correlations between image region pairs are adequately modeled as self-attention operation can be seen as a way of encoding pairwise relationships. The decoder improves the concentration of multi-head self-attention on the global context by explicitly selecting the most relevant segments at each row of the attention matrix. It can help the model focus on the more contributing image regions and generate more accurate words in the context. Experiments demonstrate that our model outperforms previous methods and achieves higher performance on MSCOCO and Flickr30k datasets. Our code is available at <uri>https://github.com/2014gaokao/ImageCaptioning</uri>.Zhou LeiCongcong ZhouShengbo ChenYiyong HuangXianrui LiuIEEEarticleImage captioningself-attentionexplict sparselocal adaptive thresholdElectrical engineering. Electronics. Nuclear engineeringTK1-9971ENIEEE Access, Vol 8, Pp 213437-213446 (2020)
institution	DOAJ
collection	DOAJ
language	EN
topic	Image captioning self-attention explict sparse local adaptive threshold Electrical engineering. Electronics. Nuclear engineering TK1-9971
spellingShingle	Image captioning self-attention explict sparse local adaptive threshold Electrical engineering. Electronics. Nuclear engineering TK1-9971 Zhou Lei Congcong Zhou Shengbo Chen Yiyong Huang Xianrui Liu A Sparse Transformer-Based Approach for Image Captioning
description	Image Captioning is the task of providing a natural language description for an image. It has caught significant amounts of attention from both computer vision and natural language processing communities. Most image captioning models adopt deep encoder-decoder architectures to achieve state-of-the-art performances. However, it is difficult to model knowledge on relationships between input image region pairs in the encoder. Furthermore, the word in the decoder hardly knows the correlation to specific image regions. In this article, a novel deep encoder-decoder model is proposed for image captioning which is developed on sparse Transformer framework. The encoder adopts a multi-level representation of image features based on self-attention to exploit low-level and high-level features, naturally the correlations between image region pairs are adequately modeled as self-attention operation can be seen as a way of encoding pairwise relationships. The decoder improves the concentration of multi-head self-attention on the global context by explicitly selecting the most relevant segments at each row of the attention matrix. It can help the model focus on the more contributing image regions and generate more accurate words in the context. Experiments demonstrate that our model outperforms previous methods and achieves higher performance on MSCOCO and Flickr30k datasets. Our code is available at <uri>https://github.com/2014gaokao/ImageCaptioning</uri>.
format	article
author	Zhou Lei Congcong Zhou Shengbo Chen Yiyong Huang Xianrui Liu
author_facet	Zhou Lei Congcong Zhou Shengbo Chen Yiyong Huang Xianrui Liu
author_sort	Zhou Lei
title	A Sparse Transformer-Based Approach for Image Captioning
title_short	A Sparse Transformer-Based Approach for Image Captioning
title_full	A Sparse Transformer-Based Approach for Image Captioning
title_fullStr	A Sparse Transformer-Based Approach for Image Captioning
title_full_unstemmed	A Sparse Transformer-Based Approach for Image Captioning
title_sort	sparse transformer-based approach for image captioning
publisher	IEEE
publishDate	2020
url	https://doaj.org/article/8d37acadce0441f6b826f861c201713c
work_keys_str_mv	AT zhoulei asparsetransformerbasedapproachforimagecaptioning AT congcongzhou asparsetransformerbasedapproachforimagecaptioning AT shengbochen asparsetransformerbasedapproachforimagecaptioning AT yiyonghuang asparsetransformerbasedapproachforimagecaptioning AT xianruiliu asparsetransformerbasedapproachforimagecaptioning AT zhoulei sparsetransformerbasedapproachforimagecaptioning AT congcongzhou sparsetransformerbasedapproachforimagecaptioning AT shengbochen sparsetransformerbasedapproachforimagecaptioning AT yiyonghuang sparsetransformerbasedapproachforimagecaptioning AT xianruiliu sparsetransformerbasedapproachforimagecaptioning
_version_	1718420678586138624

A Sparse Transformer-Based Approach for Image Captioning

Ejemplares similares