Morpheme Embedding for Bahasa Indonesia Using Modified Byte Pair Encoding

Word embedding is an efficient feature representation that carries semantic and syntactic information. Word embedding works as a word level that treats words as minor independent entity units and cannot handle words that are not in the training corpus. One solution is to generate embedding from more...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Amalia Amalia, Opim Salim Sitompul, Teddy Mantoro, Erna Budhiarti Nababan
Formato:	article
Lenguaje:	EN
Publicado:	IEEE 2021
Materias:	Morphological embedding Bahasa Indonesia linguistic word segmentation Electrical engineering. Electronics. Nuclear engineering TK1-9971
Acceso en línea:	https://doaj.org/article/eb8bfc9bfc5644308d925a818fa0dfc3
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:eb8bfc9bfc5644308d925a818fa0dfc3
record_format	dspace
spelling	oai:doaj.org-article:eb8bfc9bfc5644308d925a818fa0dfc32021-12-01T00:00:32ZMorpheme Embedding for Bahasa Indonesia Using Modified Byte Pair Encoding2169-353610.1109/ACCESS.2021.3128439https://doaj.org/article/eb8bfc9bfc5644308d925a818fa0dfc32021-01-01T00:00:00Zhttps://ieeexplore.ieee.org/document/9615151/https://doaj.org/toc/2169-3536Word embedding is an efficient feature representation that carries semantic and syntactic information. Word embedding works as a word level that treats words as minor independent entity units and cannot handle words that are not in the training corpus. One solution is to generate embedding from more minor parts of words such as morphemes. Morphemes are the smallest part of a word linguistic that has meaning in the grammatical unit of languages. This study aims to build a morpheme embedding model for Bahasa Indonesia (in English: Indonesian Language) in sort: Bahasa. However, there were many morphological rules in Bahasa, such as inflectional and derivational affixes. This implies that all rules in word segmentation will increase the computational complexity. Moreover, the rules were not regular and similar for all words in Bahasa. Therefore, this study modified a Byte Pair Embedding (BPE) algorithm to generate morpheme embedding appropriate to the morphology of Bahasa. The study implemented a simple method by filtering the BPE segmentation results with the list of Bahasa’s morphemes. This process has proven to anticipate the limitation of a conventional BPE algorithm that produces intermediate junk tokens that are not meaningful. Based on three evaluation scenarios, the model in the study can handle OOV and carry semantic and syntactic information in the embedding value of the words.Amalia AmaliaOpim Salim SitompulTeddy MantoroErna Budhiarti NababanIEEEarticleMorphological embeddingBahasa Indonesia linguisticword segmentationElectrical engineering. Electronics. Nuclear engineeringTK1-9971ENIEEE Access, Vol 9, Pp 155699-155710 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Morphological embedding Bahasa Indonesia linguistic word segmentation Electrical engineering. Electronics. Nuclear engineering TK1-9971
spellingShingle	Morphological embedding Bahasa Indonesia linguistic word segmentation Electrical engineering. Electronics. Nuclear engineering TK1-9971 Amalia Amalia Opim Salim Sitompul Teddy Mantoro Erna Budhiarti Nababan Morpheme Embedding for Bahasa Indonesia Using Modified Byte Pair Encoding
description	Word embedding is an efficient feature representation that carries semantic and syntactic information. Word embedding works as a word level that treats words as minor independent entity units and cannot handle words that are not in the training corpus. One solution is to generate embedding from more minor parts of words such as morphemes. Morphemes are the smallest part of a word linguistic that has meaning in the grammatical unit of languages. This study aims to build a morpheme embedding model for Bahasa Indonesia (in English: Indonesian Language) in sort: Bahasa. However, there were many morphological rules in Bahasa, such as inflectional and derivational affixes. This implies that all rules in word segmentation will increase the computational complexity. Moreover, the rules were not regular and similar for all words in Bahasa. Therefore, this study modified a Byte Pair Embedding (BPE) algorithm to generate morpheme embedding appropriate to the morphology of Bahasa. The study implemented a simple method by filtering the BPE segmentation results with the list of Bahasa’s morphemes. This process has proven to anticipate the limitation of a conventional BPE algorithm that produces intermediate junk tokens that are not meaningful. Based on three evaluation scenarios, the model in the study can handle OOV and carry semantic and syntactic information in the embedding value of the words.
format	article
author	Amalia Amalia Opim Salim Sitompul Teddy Mantoro Erna Budhiarti Nababan
author_facet	Amalia Amalia Opim Salim Sitompul Teddy Mantoro Erna Budhiarti Nababan
author_sort	Amalia Amalia
title	Morpheme Embedding for Bahasa Indonesia Using Modified Byte Pair Encoding
title_short	Morpheme Embedding for Bahasa Indonesia Using Modified Byte Pair Encoding
title_full	Morpheme Embedding for Bahasa Indonesia Using Modified Byte Pair Encoding
title_fullStr	Morpheme Embedding for Bahasa Indonesia Using Modified Byte Pair Encoding
title_full_unstemmed	Morpheme Embedding for Bahasa Indonesia Using Modified Byte Pair Encoding
title_sort	morpheme embedding for bahasa indonesia using modified byte pair encoding
publisher	IEEE
publishDate	2021
url	https://doaj.org/article/eb8bfc9bfc5644308d925a818fa0dfc3
work_keys_str_mv	AT amaliaamalia morphemeembeddingforbahasaindonesiausingmodifiedbytepairencoding AT opimsalimsitompul morphemeembeddingforbahasaindonesiausingmodifiedbytepairencoding AT teddymantoro morphemeembeddingforbahasaindonesiausingmodifiedbytepairencoding AT ernabudhiartinababan morphemeembeddingforbahasaindonesiausingmodifiedbytepairencoding
_version_	1718406174240407552

Morpheme Embedding for Bahasa Indonesia Using Modified Byte Pair Encoding

Ejemplares similares