$Text Classification Model Enhanced by Unlabeled Data for LaTeX Formula$
Código QR

Text Classification Model Enhanced by Unlabeled Data for LaTeX Formula

Generic language models pretrained on large unspecific domains are currently the foundation of NLP. Labeled data are limited in most model training due to the cost of manual annotation, especially in domains including massive Proper Nouns such as mathematics and biology, where it affects the accurac...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Hua Cheng, Renjie Yu, Yixin Tang, Yiquan Fang, Tao Cheng
Formato:	article
Lenguaje:	EN
Publicado:	MDPI AG 2021
Materias:	unlabeled data self-training pretraining BERT LaTeX formula Technology T Engineering (General). Civil engineering (General) TA1-2040 Biology (General) QH301-705.5 Physics QC1-999 Chemistry QD1-999
Acceso en línea:	https://doaj.org/article/2b40f93513304981a835803b16c74883
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:2b40f93513304981a835803b16c74883
record_format	dspace
spelling	oai:doaj.org-article:2b40f93513304981a835803b16c748832021-11-25T16:30:58ZText Classification Model Enhanced by Unlabeled Data for LaTeX Formula10.3390/app1122105362076-3417https://doaj.org/article/2b40f93513304981a835803b16c748832021-11-01T00:00:00Zhttps://www.mdpi.com/2076-3417/11/22/10536https://doaj.org/toc/2076-3417Generic language models pretrained on large unspecific domains are currently the foundation of NLP. Labeled data are limited in most model training due to the cost of manual annotation, especially in domains including massive Proper Nouns such as mathematics and biology, where it affects the accuracy and robustness of model prediction. However, directly applying a generic language model on a specific domain does not work well. This paper introduces a BERT-based text classification model enhanced by unlabeled data (UL-BERT) in the LaTeX formula domain. A two-stage Pretraining model based on BERT(TP-BERT) is pretrained by unlabeled data in the LaTeX formula domain. A double-prediction pseudo-labeling (DPP) method is introduced to obtain high confidence pseudo-labels for unlabeled data by self-training. Moreover, a multi-rounds teacher–student model training approach is proposed for UL-BERT model training with few labeled data and more unlabeled data with pseudo-labels. Experiments on the classification of the LaTex formula domain show that the classification accuracies have been significantly improved by UL-BERT where the F1 score has been mostly enhanced by 2.76%, and lower resources are needed in model training. It is concluded that our method may be applicable to other specific domains with enormous unlabeled data and limited labelled data.Hua ChengRenjie YuYixin TangYiquan FangTao ChengMDPI AGarticleunlabeled dataself-trainingpretrainingBERTLaTeX formulaTechnologyTEngineering (General). Civil engineering (General)TA1-2040Biology (General)QH301-705.5PhysicsQC1-999ChemistryQD1-999ENApplied Sciences, Vol 11, Iss 10536, p 10536 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	unlabeled data self-training pretraining BERT LaTeX formula Technology T Engineering (General). Civil engineering (General) TA1-2040 Biology (General) QH301-705.5 Physics QC1-999 Chemistry QD1-999
spellingShingle	unlabeled data self-training pretraining BERT LaTeX formula Technology T Engineering (General). Civil engineering (General) TA1-2040 Biology (General) QH301-705.5 Physics QC1-999 Chemistry QD1-999 Hua Cheng Renjie Yu Yixin Tang Yiquan Fang Tao Cheng Text Classification Model Enhanced by Unlabeled Data for LaTeX Formula
description	Generic language models pretrained on large unspecific domains are currently the foundation of NLP. Labeled data are limited in most model training due to the cost of manual annotation, especially in domains including massive Proper Nouns such as mathematics and biology, where it affects the accuracy and robustness of model prediction. However, directly applying a generic language model on a specific domain does not work well. This paper introduces a BERT-based text classification model enhanced by unlabeled data (UL-BERT) in the LaTeX formula domain. A two-stage Pretraining model based on BERT(TP-BERT) is pretrained by unlabeled data in the LaTeX formula domain. A double-prediction pseudo-labeling (DPP) method is introduced to obtain high confidence pseudo-labels for unlabeled data by self-training. Moreover, a multi-rounds teacher–student model training approach is proposed for UL-BERT model training with few labeled data and more unlabeled data with pseudo-labels. Experiments on the classification of the LaTex formula domain show that the classification accuracies have been significantly improved by UL-BERT where the F1 score has been mostly enhanced by 2.76%, and lower resources are needed in model training. It is concluded that our method may be applicable to other specific domains with enormous unlabeled data and limited labelled data.
format	article
author	Hua Cheng Renjie Yu Yixin Tang Yiquan Fang Tao Cheng
author_facet	Hua Cheng Renjie Yu Yixin Tang Yiquan Fang Tao Cheng
author_sort	Hua Cheng
title	Text Classification Model Enhanced by Unlabeled Data for LaTeX Formula
title_short	Text Classification Model Enhanced by Unlabeled Data for LaTeX Formula
title_full	Text Classification Model Enhanced by Unlabeled Data for LaTeX Formula
title_fullStr	Text Classification Model Enhanced by Unlabeled Data for LaTeX Formula
title_full_unstemmed	Text Classification Model Enhanced by Unlabeled Data for LaTeX Formula
title_sort	text classification model enhanced by unlabeled data for latex formula
publisher	MDPI AG
publishDate	2021
url	https://doaj.org/article/2b40f93513304981a835803b16c74883
work_keys_str_mv	AT huacheng textclassificationmodelenhancedbyunlabeleddataforlatexformula AT renjieyu textclassificationmodelenhancedbyunlabeleddataforlatexformula AT yixintang textclassificationmodelenhancedbyunlabeleddataforlatexformula AT yiquanfang textclassificationmodelenhancedbyunlabeleddataforlatexformula AT taocheng textclassificationmodelenhancedbyunlabeleddataforlatexformula
_version_	1718413126195478528

Text Classification Model Enhanced by Unlabeled Data for LaTeX Formula

Ejemplares similares