Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies

Tokenization is a significant primary step for the training of the Pre-trained Language Model (PLM), which alleviates the challenging Out-of-Vocabulary problem in the area of Natural Language Processing. As tokenization strategies can change linguistic understanding, it is essential to consider the...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Gyeongmin Kim, Junyoung Son, Jinsung Kim, Hyunhee Lee, Heuiseok Lim
Formato: article
Lenguaje:EN
Publicado: IEEE 2021
Materias:
Acceso en línea:https://doaj.org/article/d665db0beed8491ba11d763eda19afbd
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:d665db0beed8491ba11d763eda19afbd
record_format dspace
spelling oai:doaj.org-article:d665db0beed8491ba11d763eda19afbd2021-11-17T00:01:11ZEnhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies2169-353610.1109/ACCESS.2021.3126882https://doaj.org/article/d665db0beed8491ba11d763eda19afbd2021-01-01T00:00:00Zhttps://ieeexplore.ieee.org/document/9610031/https://doaj.org/toc/2169-3536Tokenization is a significant primary step for the training of the Pre-trained Language Model (PLM), which alleviates the challenging Out-of-Vocabulary problem in the area of Natural Language Processing. As tokenization strategies can change linguistic understanding, it is essential to consider the composition of input features based on the characteristics of the language for model performance. This study answers the question of “Which tokenization strategy enhances the characteristics of the Korean language for the Named Entity Recognition (NER) task based on a language model?” focusing on tokenization, which significantly affects the quality of input features. We present two significant challenges for the NER task with the agglutinative characteristics in the Korean language. Next, we quantitatively and qualitatively analyze the coping process of each tokenization strategy for these challenges. By adopting various linguistic segmentation such as morpheme, syllable and subcharacter, we demonstrate the effectiveness and prove the performance between PLMs based on each tokenization strategy. We validate that the most consistent strategy for the challenges of the Korean language is a syllable based on Sentencepiece.Gyeongmin KimJunyoung SonJinsung KimHyunhee LeeHeuiseok LimIEEEarticleNamed entity recognitionKorean pre-trained language modelnatural language processingtokenizationlinguistic segmentationagglutinative languageElectrical engineering. Electronics. Nuclear engineeringTK1-9971ENIEEE Access, Vol 9, Pp 151814-151823 (2021)
institution DOAJ
collection DOAJ
language EN
topic Named entity recognition
Korean pre-trained language model
natural language processing
tokenization
linguistic segmentation
agglutinative language
Electrical engineering. Electronics. Nuclear engineering
TK1-9971
spellingShingle Named entity recognition
Korean pre-trained language model
natural language processing
tokenization
linguistic segmentation
agglutinative language
Electrical engineering. Electronics. Nuclear engineering
TK1-9971
Gyeongmin Kim
Junyoung Son
Jinsung Kim
Hyunhee Lee
Heuiseok Lim
Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies
description Tokenization is a significant primary step for the training of the Pre-trained Language Model (PLM), which alleviates the challenging Out-of-Vocabulary problem in the area of Natural Language Processing. As tokenization strategies can change linguistic understanding, it is essential to consider the composition of input features based on the characteristics of the language for model performance. This study answers the question of “Which tokenization strategy enhances the characteristics of the Korean language for the Named Entity Recognition (NER) task based on a language model?” focusing on tokenization, which significantly affects the quality of input features. We present two significant challenges for the NER task with the agglutinative characteristics in the Korean language. Next, we quantitatively and qualitatively analyze the coping process of each tokenization strategy for these challenges. By adopting various linguistic segmentation such as morpheme, syllable and subcharacter, we demonstrate the effectiveness and prove the performance between PLMs based on each tokenization strategy. We validate that the most consistent strategy for the challenges of the Korean language is a syllable based on Sentencepiece.
format article
author Gyeongmin Kim
Junyoung Son
Jinsung Kim
Hyunhee Lee
Heuiseok Lim
author_facet Gyeongmin Kim
Junyoung Son
Jinsung Kim
Hyunhee Lee
Heuiseok Lim
author_sort Gyeongmin Kim
title Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies
title_short Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies
title_full Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies
title_fullStr Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies
title_full_unstemmed Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies
title_sort enhancing korean named entity recognition with linguistic tokenization strategies
publisher IEEE
publishDate 2021
url https://doaj.org/article/d665db0beed8491ba11d763eda19afbd
work_keys_str_mv AT gyeongminkim enhancingkoreannamedentityrecognitionwithlinguistictokenizationstrategies
AT junyoungson enhancingkoreannamedentityrecognitionwithlinguistictokenizationstrategies
AT jinsungkim enhancingkoreannamedentityrecognitionwithlinguistictokenizationstrategies
AT hyunheelee enhancingkoreannamedentityrecognitionwithlinguistictokenizationstrategies
AT heuiseoklim enhancingkoreannamedentityrecognitionwithlinguistictokenizationstrategies
_version_ 1718426077527801856