TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance

Scene text recognition (STR) is an important bridge between images and text, attracting abundant research attention. While convolutional neural networks (CNNS) have achieved remarkable progress in this task, most of the existing works need an extra module (context modeling module) to help CNN to cap...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Yue Tao, Zhiwei Jia, Runze Ma, Shugong Xu
Formato:	article
Lenguaje:	EN
Publicado:	MDPI AG 2021
Materias:	scene text recognition transformer self-attention 1-D split initial embedding Electronics TK7800-8360
Acceso en línea:	https://doaj.org/article/5d61a56968d34aa99a6d35d3cd35b304
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:5d61a56968d34aa99a6d35d3cd35b304
record_format	dspace
spelling	oai:doaj.org-article:5d61a56968d34aa99a6d35d3cd35b3042021-11-25T17:24:35ZTRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance10.3390/electronics102227802079-9292https://doaj.org/article/5d61a56968d34aa99a6d35d3cd35b3042021-11-01T00:00:00Zhttps://www.mdpi.com/2079-9292/10/22/2780https://doaj.org/toc/2079-9292Scene text recognition (STR) is an important bridge between images and text, attracting abundant research attention. While convolutional neural networks (CNNS) have achieved remarkable progress in this task, most of the existing works need an extra module (context modeling module) to help CNN to capture global dependencies to solve the inductive bias and strengthen the relationship between text features. Recently, the transformer has been proposed as a promising network for global context modeling by self-attention mechanism, but one of the main short-comings, when applied to recognition, is the efficiency. We propose a 1-D split to address the challenges of complexity and replace the CNN with the transformer encoder to reduce the need for a context modeling module. Furthermore, recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy. We propose to use a learnable initial embedding learned from the transformer encoder to make it adaptive to different input images. Above all, we introduce a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG), composed of three stages (transformation, feature extraction, and prediction). Extensive experiments show that our approach can achieve state-of-the-art on text recognition benchmarks.Yue TaoZhiwei JiaRunze MaShugong XuMDPI AGarticlescene text recognitiontransformerself-attention1-D splitinitial embeddingElectronicsTK7800-8360ENElectronics, Vol 10, Iss 2780, p 2780 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	scene text recognition transformer self-attention 1-D split initial embedding Electronics TK7800-8360
spellingShingle	scene text recognition transformer self-attention 1-D split initial embedding Electronics TK7800-8360 Yue Tao Zhiwei Jia Runze Ma Shugong Xu TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance
description	Scene text recognition (STR) is an important bridge between images and text, attracting abundant research attention. While convolutional neural networks (CNNS) have achieved remarkable progress in this task, most of the existing works need an extra module (context modeling module) to help CNN to capture global dependencies to solve the inductive bias and strengthen the relationship between text features. Recently, the transformer has been proposed as a promising network for global context modeling by self-attention mechanism, but one of the main short-comings, when applied to recognition, is the efficiency. We propose a 1-D split to address the challenges of complexity and replace the CNN with the transformer encoder to reduce the need for a context modeling module. Furthermore, recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy. We propose to use a learnable initial embedding learned from the transformer encoder to make it adaptive to different input images. Above all, we introduce a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG), composed of three stages (transformation, feature extraction, and prediction). Extensive experiments show that our approach can achieve state-of-the-art on text recognition benchmarks.
format	article
author	Yue Tao Zhiwei Jia Runze Ma Shugong Xu
author_facet	Yue Tao Zhiwei Jia Runze Ma Shugong Xu
author_sort	Yue Tao
title	TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance
title_short	TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance
title_full	TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance
title_fullStr	TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance
title_full_unstemmed	TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance
title_sort	trig: transformer-based text recognizer with initial embedding guidance
publisher	MDPI AG
publishDate	2021
url	https://doaj.org/article/5d61a56968d34aa99a6d35d3cd35b304
work_keys_str_mv	AT yuetao trigtransformerbasedtextrecognizerwithinitialembeddingguidance AT zhiweijia trigtransformerbasedtextrecognizerwithinitialembeddingguidance AT runzema trigtransformerbasedtextrecognizerwithinitialembeddingguidance AT shugongxu trigtransformerbasedtextrecognizerwithinitialembeddingguidance
_version_	1718412409428770816

TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance

Ejemplares similares