Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of tr...
Guardado en:
Autores principales: | , , , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
Frontiers Media S.A.
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/6ac0fcb24105417990a9ab076b52233a |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:6ac0fcb24105417990a9ab076b52233a |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:6ac0fcb24105417990a9ab076b52233a2021-11-19T06:39:57ZEnsemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora2504-053710.3389/frma.2021.689803https://doaj.org/article/6ac0fcb24105417990a9ab076b52233a2021-11-01T00:00:00Zhttps://www.frontiersin.org/articles/10.3389/frma.2021.689803/fullhttps://doaj.org/toc/2504-0537The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains—biology, chemistry, and medicine—available in different languages—English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.Nona NaderiNona NaderiJulien KnafouJulien KnafouJulien KnafouJenny CoparaJenny CoparaJenny CoparaPatrick RuchPatrick RuchDouglas TeodoroDouglas TeodoroDouglas TeodoroFrontiers Media S.A.articlenamed entity recognitiondeep learningpatent text miningtransformersclinical text miningchemical patentsBibliography. Library science. Information resourcesZENFrontiers in Research Metrics and Analytics, Vol 6 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
named entity recognition deep learning patent text mining transformers clinical text mining chemical patents Bibliography. Library science. Information resources Z |
spellingShingle |
named entity recognition deep learning patent text mining transformers clinical text mining chemical patents Bibliography. Library science. Information resources Z Nona Naderi Nona Naderi Julien Knafou Julien Knafou Julien Knafou Jenny Copara Jenny Copara Jenny Copara Patrick Ruch Patrick Ruch Douglas Teodoro Douglas Teodoro Douglas Teodoro Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora |
description |
The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains—biology, chemistry, and medicine—available in different languages—English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains. |
format |
article |
author |
Nona Naderi Nona Naderi Julien Knafou Julien Knafou Julien Knafou Jenny Copara Jenny Copara Jenny Copara Patrick Ruch Patrick Ruch Douglas Teodoro Douglas Teodoro Douglas Teodoro |
author_facet |
Nona Naderi Nona Naderi Julien Knafou Julien Knafou Julien Knafou Jenny Copara Jenny Copara Jenny Copara Patrick Ruch Patrick Ruch Douglas Teodoro Douglas Teodoro Douglas Teodoro |
author_sort |
Nona Naderi |
title |
Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora |
title_short |
Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora |
title_full |
Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora |
title_fullStr |
Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora |
title_full_unstemmed |
Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora |
title_sort |
ensemble of deep masked language models for effective named entity recognition in health and life science corpora |
publisher |
Frontiers Media S.A. |
publishDate |
2021 |
url |
https://doaj.org/article/6ac0fcb24105417990a9ab076b52233a |
work_keys_str_mv |
AT nonanaderi ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT nonanaderi ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT julienknafou ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT julienknafou ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT julienknafou ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT jennycopara ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT jennycopara ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT jennycopara ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT patrickruch ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT patrickruch ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT douglasteodoro ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT douglasteodoro ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora AT douglasteodoro ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora |
_version_ |
1718420317892771840 |