Stopwords in technical language processing.

There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While rese...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Serhad Sarica, Jianxi Luo
Formato: article
Lenguaje:EN
Publicado: Public Library of Science (PLoS) 2021
Materias:
R
Q
Acceso en línea:https://doaj.org/article/f8627466a7514c45b19875785fdc0289
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:f8627466a7514c45b19875785fdc0289
record_format dspace
spelling oai:doaj.org-article:f8627466a7514c45b19875785fdc02892021-12-02T20:15:15ZStopwords in technical language processing.1932-620310.1371/journal.pone.0254937https://doaj.org/article/f8627466a7514c45b19875785fdc02892021-01-01T00:00:00Zhttps://doi.org/10.1371/journal.pone.0254937https://doaj.org/toc/1932-6203There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications.Serhad SaricaJianxi LuoPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 16, Iss 8, p e0254937 (2021)
institution DOAJ
collection DOAJ
language EN
topic Medicine
R
Science
Q
spellingShingle Medicine
R
Science
Q
Serhad Sarica
Jianxi Luo
Stopwords in technical language processing.
description There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications.
format article
author Serhad Sarica
Jianxi Luo
author_facet Serhad Sarica
Jianxi Luo
author_sort Serhad Sarica
title Stopwords in technical language processing.
title_short Stopwords in technical language processing.
title_full Stopwords in technical language processing.
title_fullStr Stopwords in technical language processing.
title_full_unstemmed Stopwords in technical language processing.
title_sort stopwords in technical language processing.
publisher Public Library of Science (PLoS)
publishDate 2021
url https://doaj.org/article/f8627466a7514c45b19875785fdc0289
work_keys_str_mv AT serhadsarica stopwordsintechnicallanguageprocessing
AT jianxiluo stopwordsintechnicallanguageprocessing
_version_ 1718374621857710080