Stopwords in technical language processing.
There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While rese...
Guardado en:
Autores principales: | , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
Public Library of Science (PLoS)
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/f8627466a7514c45b19875785fdc0289 |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:f8627466a7514c45b19875785fdc0289 |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:f8627466a7514c45b19875785fdc02892021-12-02T20:15:15ZStopwords in technical language processing.1932-620310.1371/journal.pone.0254937https://doaj.org/article/f8627466a7514c45b19875785fdc02892021-01-01T00:00:00Zhttps://doi.org/10.1371/journal.pone.0254937https://doaj.org/toc/1932-6203There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications.Serhad SaricaJianxi LuoPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 16, Iss 8, p e0254937 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
Medicine R Science Q |
spellingShingle |
Medicine R Science Q Serhad Sarica Jianxi Luo Stopwords in technical language processing. |
description |
There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications. |
format |
article |
author |
Serhad Sarica Jianxi Luo |
author_facet |
Serhad Sarica Jianxi Luo |
author_sort |
Serhad Sarica |
title |
Stopwords in technical language processing. |
title_short |
Stopwords in technical language processing. |
title_full |
Stopwords in technical language processing. |
title_fullStr |
Stopwords in technical language processing. |
title_full_unstemmed |
Stopwords in technical language processing. |
title_sort |
stopwords in technical language processing. |
publisher |
Public Library of Science (PLoS) |
publishDate |
2021 |
url |
https://doaj.org/article/f8627466a7514c45b19875785fdc0289 |
work_keys_str_mv |
AT serhadsarica stopwordsintechnicallanguageprocessing AT jianxiluo stopwordsintechnicallanguageprocessing |
_version_ |
1718374621857710080 |