Characterisation of COVID-19-Related Tweets in the Croatian Language: Framework Based on the Cro-CoV-cseBERT Model

This study aims to provide insights into the COVID-19-related communication on Twitter in the Republic of Croatia. For that purpose, we developed an NL-based framework that enables automatic analysis of a large dataset of tweets in the Croatian language. We collected and analysed 206,196 tweets rela...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Karlo Babić, Milan Petrović, Slobodan Beliga, Sanda Martinčić-Ipšić, Mihaela Matešić, Ana Meštrović
Formato: article
Lenguaje:EN
Publicado: MDPI AG 2021
Materias:
T
Acceso en línea:https://doaj.org/article/bee88f1fe5e1455ca45cbc279d714be0
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:bee88f1fe5e1455ca45cbc279d714be0
record_format dspace
spelling oai:doaj.org-article:bee88f1fe5e1455ca45cbc279d714be02021-11-11T15:24:16ZCharacterisation of COVID-19-Related Tweets in the Croatian Language: Framework Based on the Cro-CoV-cseBERT Model10.3390/app1121104422076-3417https://doaj.org/article/bee88f1fe5e1455ca45cbc279d714be02021-11-01T00:00:00Zhttps://www.mdpi.com/2076-3417/11/21/10442https://doaj.org/toc/2076-3417This study aims to provide insights into the COVID-19-related communication on Twitter in the Republic of Croatia. For that purpose, we developed an NL-based framework that enables automatic analysis of a large dataset of tweets in the Croatian language. We collected and analysed 206,196 tweets related to COVID-19 and constructed a dataset of 10,000 tweets which we manually annotated with a sentiment label. We trained the Cro-CoV-cseBERT language model for the representation and clustering of tweets. Additionally, we compared the performance of four machine learning algorithms on the task of sentiment classification. After identifying the best performing setup of NLP methods, we applied the proposed framework in the task of characterisation of COVID-19 tweets in Croatia. More precisely, we performed sentiment analysis and tracked the sentiment over time. Furthermore, we detected how tweets are grouped into clusters with similar themes across three pandemic waves. Additionally, we characterised the tweets by analysing the distribution of sentiment polarity (in each thematic cluster and over time) and the number of retweets (in each thematic cluster and sentiment class). These results could be useful for additional research and interpretation in the domains of sociology, psychology or other sciences, as well as for the authorities, who could use them to address crisis communication problems.Karlo BabićMilan PetrovićSlobodan BeligaSanda Martinčić-IpšićMihaela MatešićAna MeštrovićMDPI AGarticlesentiment analysisclusteringBERT modelnatural language processingCOVID-19Twitter dataTechnologyTEngineering (General). Civil engineering (General)TA1-2040Biology (General)QH301-705.5PhysicsQC1-999ChemistryQD1-999ENApplied Sciences, Vol 11, Iss 10442, p 10442 (2021)
institution DOAJ
collection DOAJ
language EN
topic sentiment analysis
clustering
BERT model
natural language processing
COVID-19
Twitter data
Technology
T
Engineering (General). Civil engineering (General)
TA1-2040
Biology (General)
QH301-705.5
Physics
QC1-999
Chemistry
QD1-999
spellingShingle sentiment analysis
clustering
BERT model
natural language processing
COVID-19
Twitter data
Technology
T
Engineering (General). Civil engineering (General)
TA1-2040
Biology (General)
QH301-705.5
Physics
QC1-999
Chemistry
QD1-999
Karlo Babić
Milan Petrović
Slobodan Beliga
Sanda Martinčić-Ipšić
Mihaela Matešić
Ana Meštrović
Characterisation of COVID-19-Related Tweets in the Croatian Language: Framework Based on the Cro-CoV-cseBERT Model
description This study aims to provide insights into the COVID-19-related communication on Twitter in the Republic of Croatia. For that purpose, we developed an NL-based framework that enables automatic analysis of a large dataset of tweets in the Croatian language. We collected and analysed 206,196 tweets related to COVID-19 and constructed a dataset of 10,000 tweets which we manually annotated with a sentiment label. We trained the Cro-CoV-cseBERT language model for the representation and clustering of tweets. Additionally, we compared the performance of four machine learning algorithms on the task of sentiment classification. After identifying the best performing setup of NLP methods, we applied the proposed framework in the task of characterisation of COVID-19 tweets in Croatia. More precisely, we performed sentiment analysis and tracked the sentiment over time. Furthermore, we detected how tweets are grouped into clusters with similar themes across three pandemic waves. Additionally, we characterised the tweets by analysing the distribution of sentiment polarity (in each thematic cluster and over time) and the number of retweets (in each thematic cluster and sentiment class). These results could be useful for additional research and interpretation in the domains of sociology, psychology or other sciences, as well as for the authorities, who could use them to address crisis communication problems.
format article
author Karlo Babić
Milan Petrović
Slobodan Beliga
Sanda Martinčić-Ipšić
Mihaela Matešić
Ana Meštrović
author_facet Karlo Babić
Milan Petrović
Slobodan Beliga
Sanda Martinčić-Ipšić
Mihaela Matešić
Ana Meštrović
author_sort Karlo Babić
title Characterisation of COVID-19-Related Tweets in the Croatian Language: Framework Based on the Cro-CoV-cseBERT Model
title_short Characterisation of COVID-19-Related Tweets in the Croatian Language: Framework Based on the Cro-CoV-cseBERT Model
title_full Characterisation of COVID-19-Related Tweets in the Croatian Language: Framework Based on the Cro-CoV-cseBERT Model
title_fullStr Characterisation of COVID-19-Related Tweets in the Croatian Language: Framework Based on the Cro-CoV-cseBERT Model
title_full_unstemmed Characterisation of COVID-19-Related Tweets in the Croatian Language: Framework Based on the Cro-CoV-cseBERT Model
title_sort characterisation of covid-19-related tweets in the croatian language: framework based on the cro-cov-csebert model
publisher MDPI AG
publishDate 2021
url https://doaj.org/article/bee88f1fe5e1455ca45cbc279d714be0
work_keys_str_mv AT karlobabic characterisationofcovid19relatedtweetsinthecroatianlanguageframeworkbasedonthecrocovcsebertmodel
AT milanpetrovic characterisationofcovid19relatedtweetsinthecroatianlanguageframeworkbasedonthecrocovcsebertmodel
AT slobodanbeliga characterisationofcovid19relatedtweetsinthecroatianlanguageframeworkbasedonthecrocovcsebertmodel
AT sandamartincicipsic characterisationofcovid19relatedtweetsinthecroatianlanguageframeworkbasedonthecrocovcsebertmodel
AT mihaelamatesic characterisationofcovid19relatedtweetsinthecroatianlanguageframeworkbasedonthecrocovcsebertmodel
AT anamestrovic characterisationofcovid19relatedtweetsinthecroatianlanguageframeworkbasedonthecrocovcsebertmodel
_version_ 1718435389385998336