Language Identification of Intra-Word Code-Switching for Arabic–English

Multilingual speakers tend to mix different languages in text and speech; a phenomenon referred to by linguists as “code-switching” (CS). Also, speakers switch between morphemes from various languages in the same word (intra-word CS). User-generated texts on social media are informal and contain a f...

Description complète

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Caroline Sabty, Islam Mesabah, Özlem Çetinoğlu, Slim Abdennadher
Format:	article
Langue:	EN
Publié:	Elsevier 2021
Sujets:	Natural language processing Automatic language identification Deep learning Code-switched data Arabic language Computer engineering. Computer hardware TK7885-7895 Electronic computers. Computer science QA75.5-76.95
Accès en ligne:	https://doaj.org/article/a5ec6db515d345e290c61a851789e27e
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

id	oai:doaj.org-article:a5ec6db515d345e290c61a851789e27e
record_format	dspace
spelling	oai:doaj.org-article:a5ec6db515d345e290c61a851789e27e2021-12-02T05:03:33ZLanguage Identification of Intra-Word Code-Switching for Arabic–English2590-005610.1016/j.array.2021.100104https://doaj.org/article/a5ec6db515d345e290c61a851789e27e2021-12-01T00:00:00Zhttp://www.sciencedirect.com/science/article/pii/S2590005621000473https://doaj.org/toc/2590-0056Multilingual speakers tend to mix different languages in text and speech; a phenomenon referred to by linguists as “code-switching” (CS). Also, speakers switch between morphemes from various languages in the same word (intra-word CS). User-generated texts on social media are informal and contain a fair amount of different types of CS data. This data needs to be investigated and analyzed for several linguistic tasks. Language Identification (LID) is one of the important tasks that should be tackled for intra-word CS data. LID involves segmenting mixed words and tagging each part with its corresponding language ID. This work aimed at creating the first annotated Arabic–English (AR–EN) corpus for the CS intra-word LID task along with a web-based application for data annotation. We implemented two baseline models using Naïve Bayes and Character BiLSTM for AR–EN text. Our main model was constructed using segmental recurrent neural networks (SegRNN). We investigated the usage of different word embeddings with SegRNN. The highest LID system for tagging the entire data-set was obtained using SegRNN alone, achieving an F1-score of 94.84% and was able to recognize mixed words with F1-score equal to 81.15%. Besides, the model of the SegRNN with FastText embeddings achieved the highest results equal to 81.45% F1-score for tagging the mixed words.Caroline SabtyIslam MesabahÖzlem ÇetinoğluSlim AbdennadherElsevierarticleNatural language processingAutomatic language identificationDeep learningCode-switched dataArabic languageComputer engineering. Computer hardwareTK7885-7895Electronic computers. Computer scienceQA75.5-76.95ENArray, Vol 12, Iss , Pp 100104- (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Natural language processing Automatic language identification Deep learning Code-switched data Arabic language Computer engineering. Computer hardware TK7885-7895 Electronic computers. Computer science QA75.5-76.95
spellingShingle	Natural language processing Automatic language identification Deep learning Code-switched data Arabic language Computer engineering. Computer hardware TK7885-7895 Electronic computers. Computer science QA75.5-76.95 Caroline Sabty Islam Mesabah Özlem Çetinoğlu Slim Abdennadher Language Identification of Intra-Word Code-Switching for Arabic–English
description	Multilingual speakers tend to mix different languages in text and speech; a phenomenon referred to by linguists as “code-switching” (CS). Also, speakers switch between morphemes from various languages in the same word (intra-word CS). User-generated texts on social media are informal and contain a fair amount of different types of CS data. This data needs to be investigated and analyzed for several linguistic tasks. Language Identification (LID) is one of the important tasks that should be tackled for intra-word CS data. LID involves segmenting mixed words and tagging each part with its corresponding language ID. This work aimed at creating the first annotated Arabic–English (AR–EN) corpus for the CS intra-word LID task along with a web-based application for data annotation. We implemented two baseline models using Naïve Bayes and Character BiLSTM for AR–EN text. Our main model was constructed using segmental recurrent neural networks (SegRNN). We investigated the usage of different word embeddings with SegRNN. The highest LID system for tagging the entire data-set was obtained using SegRNN alone, achieving an F1-score of 94.84% and was able to recognize mixed words with F1-score equal to 81.15%. Besides, the model of the SegRNN with FastText embeddings achieved the highest results equal to 81.45% F1-score for tagging the mixed words.
format	article
author	Caroline Sabty Islam Mesabah Özlem Çetinoğlu Slim Abdennadher
author_facet	Caroline Sabty Islam Mesabah Özlem Çetinoğlu Slim Abdennadher
author_sort	Caroline Sabty
title	Language Identification of Intra-Word Code-Switching for Arabic–English
title_short	Language Identification of Intra-Word Code-Switching for Arabic–English
title_full	Language Identification of Intra-Word Code-Switching for Arabic–English
title_fullStr	Language Identification of Intra-Word Code-Switching for Arabic–English
title_full_unstemmed	Language Identification of Intra-Word Code-Switching for Arabic–English
title_sort	language identification of intra-word code-switching for arabic–english
publisher	Elsevier
publishDate	2021
url	https://doaj.org/article/a5ec6db515d345e290c61a851789e27e
work_keys_str_mv	AT carolinesabty languageidentificationofintrawordcodeswitchingforarabicenglish AT islammesabah languageidentificationofintrawordcodeswitchingforarabicenglish AT ozlemcetinoglu languageidentificationofintrawordcodeswitchingforarabicenglish AT slimabdennadher languageidentificationofintrawordcodeswitchingforarabicenglish
_version_	1718400667374059520

Language Identification of Intra-Word Code-Switching for Arabic–English

Documents similaires