BengSentiLex and BengSwearLex: creating lexicons for sentiment analysis and profanity detection in low-resource Bengali language

Bengali is a low-resource language that lacks tools and resources for various natural language processing (NLP) tasks, such as sentiment analysis or profanity identification. In Bengali, only the translated versions of English sentiment lexicons are available. Moreover, no dictionary exists for dete...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autor principal:	Salim Sazzed
Formato:	article
Lenguaje:	EN
Publicado:	PeerJ Inc. 2021
Materias:	Sentiment lexicon Profanity detection Electronic computers. Computer science QA75.5-76.95
Acceso en línea:	https://doaj.org/article/07f3536053cb42c0bb2f3fe57ae6b24a
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:07f3536053cb42c0bb2f3fe57ae6b24a
record_format	dspace
spelling	oai:doaj.org-article:07f3536053cb42c0bb2f3fe57ae6b24a2021-11-18T15:05:09ZBengSentiLex and BengSwearLex: creating lexicons for sentiment analysis and profanity detection in low-resource Bengali language10.7717/peerj-cs.6812376-5992https://doaj.org/article/07f3536053cb42c0bb2f3fe57ae6b24a2021-11-01T00:00:00Zhttps://peerj.com/articles/cs-681.pdfhttps://peerj.com/articles/cs-681/https://doaj.org/toc/2376-5992Bengali is a low-resource language that lacks tools and resources for various natural language processing (NLP) tasks, such as sentiment analysis or profanity identification. In Bengali, only the translated versions of English sentiment lexicons are available. Moreover, no dictionary exists for detecting profanity in Bengali social media text. This study introduces a Bengali sentiment lexicon, BengSentiLex, and a Bengali swear lexicon, BengSwearLex. For creating BengSentiLex, a cross-lingual methodology is proposed that utilizes a machine translation system, a review corpus, two English sentiment lexicons, pointwise mutual information (PMI), and supervised machine learning (ML) classifiers in various stages. A semi-automatic methodology is presented to develop BengSwearLex that leverages an obscene corpus, word embedding, and part-of-speech (POS) taggers. The performance of BengSentiLex compared with the translated English lexicons in three evaluation datasets. BengSentiLex achieves 5%–50% improvement over the translated lexicons. For identifying profanity, BengSwearLex achieves documentlevel coverage of around 85% in an document-level in the evaluation dataset. The experimental results imply that BengSentiLex and BengSwearLex are effective resources for classifying sentiment and identifying profanity in Bengali social media content, respectively.Salim SazzedPeerJ Inc.articleSentiment lexiconProfanity detectionElectronic computers. Computer scienceQA75.5-76.95ENPeerJ Computer Science, Vol 7, p e681 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Sentiment lexicon Profanity detection Electronic computers. Computer science QA75.5-76.95
spellingShingle	Sentiment lexicon Profanity detection Electronic computers. Computer science QA75.5-76.95 Salim Sazzed BengSentiLex and BengSwearLex: creating lexicons for sentiment analysis and profanity detection in low-resource Bengali language
description	Bengali is a low-resource language that lacks tools and resources for various natural language processing (NLP) tasks, such as sentiment analysis or profanity identification. In Bengali, only the translated versions of English sentiment lexicons are available. Moreover, no dictionary exists for detecting profanity in Bengali social media text. This study introduces a Bengali sentiment lexicon, BengSentiLex, and a Bengali swear lexicon, BengSwearLex. For creating BengSentiLex, a cross-lingual methodology is proposed that utilizes a machine translation system, a review corpus, two English sentiment lexicons, pointwise mutual information (PMI), and supervised machine learning (ML) classifiers in various stages. A semi-automatic methodology is presented to develop BengSwearLex that leverages an obscene corpus, word embedding, and part-of-speech (POS) taggers. The performance of BengSentiLex compared with the translated English lexicons in three evaluation datasets. BengSentiLex achieves 5%–50% improvement over the translated lexicons. For identifying profanity, BengSwearLex achieves documentlevel coverage of around 85% in an document-level in the evaluation dataset. The experimental results imply that BengSentiLex and BengSwearLex are effective resources for classifying sentiment and identifying profanity in Bengali social media content, respectively.
format	article
author	Salim Sazzed
author_facet	Salim Sazzed
author_sort	Salim Sazzed
title	BengSentiLex and BengSwearLex: creating lexicons for sentiment analysis and profanity detection in low-resource Bengali language
title_short	BengSentiLex and BengSwearLex: creating lexicons for sentiment analysis and profanity detection in low-resource Bengali language
title_full	BengSentiLex and BengSwearLex: creating lexicons for sentiment analysis and profanity detection in low-resource Bengali language
title_fullStr	BengSentiLex and BengSwearLex: creating lexicons for sentiment analysis and profanity detection in low-resource Bengali language
title_full_unstemmed	BengSentiLex and BengSwearLex: creating lexicons for sentiment analysis and profanity detection in low-resource Bengali language
title_sort	bengsentilex and bengswearlex: creating lexicons for sentiment analysis and profanity detection in low-resource bengali language
publisher	PeerJ Inc.
publishDate	2021
url	https://doaj.org/article/07f3536053cb42c0bb2f3fe57ae6b24a
work_keys_str_mv	AT salimsazzed bengsentilexandbengswearlexcreatinglexiconsforsentimentanalysisandprofanitydetectioninlowresourcebengalilanguage
_version_	1718420767305105408

BengSentiLex and BengSwearLex: creating lexicons for sentiment analysis and profanity detection in low-resource Bengali language

Ejemplares similares