Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies
Tokenization is a significant primary step for the training of the Pre-trained Language Model (PLM), which alleviates the challenging Out-of-Vocabulary problem in the area of Natural Language Processing. As tokenization strategies can change linguistic understanding, it is essential to consider the...
Guardado en:
Autores principales: | , , , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
IEEE
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/d665db0beed8491ba11d763eda19afbd |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:d665db0beed8491ba11d763eda19afbd |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:d665db0beed8491ba11d763eda19afbd2021-11-17T00:01:11ZEnhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies2169-353610.1109/ACCESS.2021.3126882https://doaj.org/article/d665db0beed8491ba11d763eda19afbd2021-01-01T00:00:00Zhttps://ieeexplore.ieee.org/document/9610031/https://doaj.org/toc/2169-3536Tokenization is a significant primary step for the training of the Pre-trained Language Model (PLM), which alleviates the challenging Out-of-Vocabulary problem in the area of Natural Language Processing. As tokenization strategies can change linguistic understanding, it is essential to consider the composition of input features based on the characteristics of the language for model performance. This study answers the question of “Which tokenization strategy enhances the characteristics of the Korean language for the Named Entity Recognition (NER) task based on a language model?” focusing on tokenization, which significantly affects the quality of input features. We present two significant challenges for the NER task with the agglutinative characteristics in the Korean language. Next, we quantitatively and qualitatively analyze the coping process of each tokenization strategy for these challenges. By adopting various linguistic segmentation such as morpheme, syllable and subcharacter, we demonstrate the effectiveness and prove the performance between PLMs based on each tokenization strategy. We validate that the most consistent strategy for the challenges of the Korean language is a syllable based on Sentencepiece.Gyeongmin KimJunyoung SonJinsung KimHyunhee LeeHeuiseok LimIEEEarticleNamed entity recognitionKorean pre-trained language modelnatural language processingtokenizationlinguistic segmentationagglutinative languageElectrical engineering. Electronics. Nuclear engineeringTK1-9971ENIEEE Access, Vol 9, Pp 151814-151823 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
Named entity recognition Korean pre-trained language model natural language processing tokenization linguistic segmentation agglutinative language Electrical engineering. Electronics. Nuclear engineering TK1-9971 |
spellingShingle |
Named entity recognition Korean pre-trained language model natural language processing tokenization linguistic segmentation agglutinative language Electrical engineering. Electronics. Nuclear engineering TK1-9971 Gyeongmin Kim Junyoung Son Jinsung Kim Hyunhee Lee Heuiseok Lim Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies |
description |
Tokenization is a significant primary step for the training of the Pre-trained Language Model (PLM), which alleviates the challenging Out-of-Vocabulary problem in the area of Natural Language Processing. As tokenization strategies can change linguistic understanding, it is essential to consider the composition of input features based on the characteristics of the language for model performance. This study answers the question of “Which tokenization strategy enhances the characteristics of the Korean language for the Named Entity Recognition (NER) task based on a language model?” focusing on tokenization, which significantly affects the quality of input features. We present two significant challenges for the NER task with the agglutinative characteristics in the Korean language. Next, we quantitatively and qualitatively analyze the coping process of each tokenization strategy for these challenges. By adopting various linguistic segmentation such as morpheme, syllable and subcharacter, we demonstrate the effectiveness and prove the performance between PLMs based on each tokenization strategy. We validate that the most consistent strategy for the challenges of the Korean language is a syllable based on Sentencepiece. |
format |
article |
author |
Gyeongmin Kim Junyoung Son Jinsung Kim Hyunhee Lee Heuiseok Lim |
author_facet |
Gyeongmin Kim Junyoung Son Jinsung Kim Hyunhee Lee Heuiseok Lim |
author_sort |
Gyeongmin Kim |
title |
Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies |
title_short |
Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies |
title_full |
Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies |
title_fullStr |
Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies |
title_full_unstemmed |
Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies |
title_sort |
enhancing korean named entity recognition with linguistic tokenization strategies |
publisher |
IEEE |
publishDate |
2021 |
url |
https://doaj.org/article/d665db0beed8491ba11d763eda19afbd |
work_keys_str_mv |
AT gyeongminkim enhancingkoreannamedentityrecognitionwithlinguistictokenizationstrategies AT junyoungson enhancingkoreannamedentityrecognitionwithlinguistictokenizationstrategies AT jinsungkim enhancingkoreannamedentityrecognitionwithlinguistictokenizationstrategies AT hyunheelee enhancingkoreannamedentityrecognitionwithlinguistictokenizationstrategies AT heuiseoklim enhancingkoreannamedentityrecognitionwithlinguistictokenizationstrategies |
_version_ |
1718426077527801856 |