Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese

Obtaining high-quality embeddings of out-of-vocabularies (OOVs) and low-frequency words is a challenge in natural language processing (NLP). To efficiently estimate the embeddings of OOVs and low-frequency words, we propose a new method that uses the dictionary to estimate the embeddings of OOVs and...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Xianwen Liao, Yongzhong Huang, Changfu Wei, Chenhao Zhang, Yongqing Deng, Ke Yi
Formato: article
Lenguaje:EN
Publicado: MDPI AG 2021
Materias:
T
Acceso en línea:https://doaj.org/article/e07fd774836c4f9d9dc1720ba49603f4
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Sumario:Obtaining high-quality embeddings of out-of-vocabularies (OOVs) and low-frequency words is a challenge in natural language processing (NLP). To efficiently estimate the embeddings of OOVs and low-frequency words, we propose a new method that uses the dictionary to estimate the embeddings of OOVs and low-frequency words. More specifically, the explanatory note of an entry in dictionaries accurately describes the semantics of the corresponding word. Naturally, we adopt the sentence representation model to extract the semantics of the explanatory note and regard the semantics as the embedding of the corresponding word. We design a new sentence representation model to encode sentences to extract the semantics from the explanatory notes of entries more efficiently. Based on the assumption that the higher quality of word embeddings will lead to better performance, we design an extrinsic experiment to evaluate the quality of low-frequency words’ embeddings. The experimental results show that the embeddings of low-frequency words estimated by our proposed method have higher quality. In addition, both intrinsic and extrinsic experiments show that our proposed sentence representation model can represent the semantics of sentences well.