Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese

Obtaining high-quality embeddings of out-of-vocabularies (OOVs) and low-frequency words is a challenge in natural language processing (NLP). To efficiently estimate the embeddings of OOVs and low-frequency words, we propose a new method that uses the dictionary to estimate the embeddings of OOVs and...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Xianwen Liao, Yongzhong Huang, Changfu Wei, Chenhao Zhang, Yongqing Deng, Ke Yi
Formato: article
Lenguaje:EN
Publicado: MDPI AG 2021
Materias:
T
Acceso en línea:https://doaj.org/article/e07fd774836c4f9d9dc1720ba49603f4
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:e07fd774836c4f9d9dc1720ba49603f4
record_format dspace
spelling oai:doaj.org-article:e07fd774836c4f9d9dc1720ba49603f42021-11-25T16:42:58ZEfficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese10.3390/app1122110182076-3417https://doaj.org/article/e07fd774836c4f9d9dc1720ba49603f42021-11-01T00:00:00Zhttps://www.mdpi.com/2076-3417/11/22/11018https://doaj.org/toc/2076-3417Obtaining high-quality embeddings of out-of-vocabularies (OOVs) and low-frequency words is a challenge in natural language processing (NLP). To efficiently estimate the embeddings of OOVs and low-frequency words, we propose a new method that uses the dictionary to estimate the embeddings of OOVs and low-frequency words. More specifically, the explanatory note of an entry in dictionaries accurately describes the semantics of the corresponding word. Naturally, we adopt the sentence representation model to extract the semantics of the explanatory note and regard the semantics as the embedding of the corresponding word. We design a new sentence representation model to encode sentences to extract the semantics from the explanatory notes of entries more efficiently. Based on the assumption that the higher quality of word embeddings will lead to better performance, we design an extrinsic experiment to evaluate the quality of low-frequency words’ embeddings. The experimental results show that the embeddings of low-frequency words estimated by our proposed method have higher quality. In addition, both intrinsic and extrinsic experiments show that our proposed sentence representation model can represent the semantics of sentences well.Xianwen LiaoYongzhong HuangChangfu WeiChenhao ZhangYongqing DengKe YiMDPI AGarticlenatural language processingword embeddingBERTdictionaryTechnologyTEngineering (General). Civil engineering (General)TA1-2040Biology (General)QH301-705.5PhysicsQC1-999ChemistryQD1-999ENApplied Sciences, Vol 11, Iss 11018, p 11018 (2021)
institution DOAJ
collection DOAJ
language EN
topic natural language processing
word embedding
BERT
dictionary
Technology
T
Engineering (General). Civil engineering (General)
TA1-2040
Biology (General)
QH301-705.5
Physics
QC1-999
Chemistry
QD1-999
spellingShingle natural language processing
word embedding
BERT
dictionary
Technology
T
Engineering (General). Civil engineering (General)
TA1-2040
Biology (General)
QH301-705.5
Physics
QC1-999
Chemistry
QD1-999
Xianwen Liao
Yongzhong Huang
Changfu Wei
Chenhao Zhang
Yongqing Deng
Ke Yi
Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese
description Obtaining high-quality embeddings of out-of-vocabularies (OOVs) and low-frequency words is a challenge in natural language processing (NLP). To efficiently estimate the embeddings of OOVs and low-frequency words, we propose a new method that uses the dictionary to estimate the embeddings of OOVs and low-frequency words. More specifically, the explanatory note of an entry in dictionaries accurately describes the semantics of the corresponding word. Naturally, we adopt the sentence representation model to extract the semantics of the explanatory note and regard the semantics as the embedding of the corresponding word. We design a new sentence representation model to encode sentences to extract the semantics from the explanatory notes of entries more efficiently. Based on the assumption that the higher quality of word embeddings will lead to better performance, we design an extrinsic experiment to evaluate the quality of low-frequency words’ embeddings. The experimental results show that the embeddings of low-frequency words estimated by our proposed method have higher quality. In addition, both intrinsic and extrinsic experiments show that our proposed sentence representation model can represent the semantics of sentences well.
format article
author Xianwen Liao
Yongzhong Huang
Changfu Wei
Chenhao Zhang
Yongqing Deng
Ke Yi
author_facet Xianwen Liao
Yongzhong Huang
Changfu Wei
Chenhao Zhang
Yongqing Deng
Ke Yi
author_sort Xianwen Liao
title Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese
title_short Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese
title_full Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese
title_fullStr Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese
title_full_unstemmed Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese
title_sort efficient estimate of low-frequency words’ embeddings based on the dictionary: a case study on chinese
publisher MDPI AG
publishDate 2021
url https://doaj.org/article/e07fd774836c4f9d9dc1720ba49603f4
work_keys_str_mv AT xianwenliao efficientestimateoflowfrequencywordsembeddingsbasedonthedictionaryacasestudyonchinese
AT yongzhonghuang efficientestimateoflowfrequencywordsembeddingsbasedonthedictionaryacasestudyonchinese
AT changfuwei efficientestimateoflowfrequencywordsembeddingsbasedonthedictionaryacasestudyonchinese
AT chenhaozhang efficientestimateoflowfrequencywordsembeddingsbasedonthedictionaryacasestudyonchinese
AT yongqingdeng efficientestimateoflowfrequencywordsembeddingsbasedonthedictionaryacasestudyonchinese
AT keyi efficientestimateoflowfrequencywordsembeddingsbasedonthedictionaryacasestudyonchinese
_version_ 1718413049046499328