Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of tr...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Nona Naderi, Julien Knafou, Jenny Copara, Patrick Ruch, Douglas Teodoro
Formato: article
Lenguaje:EN
Publicado: Frontiers Media S.A. 2021
Materias:
Z
Acceso en línea:https://doaj.org/article/6ac0fcb24105417990a9ab076b52233a
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:6ac0fcb24105417990a9ab076b52233a
record_format dspace
spelling oai:doaj.org-article:6ac0fcb24105417990a9ab076b52233a2021-11-19T06:39:57ZEnsemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora2504-053710.3389/frma.2021.689803https://doaj.org/article/6ac0fcb24105417990a9ab076b52233a2021-11-01T00:00:00Zhttps://www.frontiersin.org/articles/10.3389/frma.2021.689803/fullhttps://doaj.org/toc/2504-0537The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains—biology, chemistry, and medicine—available in different languages—English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.Nona NaderiNona NaderiJulien KnafouJulien KnafouJulien KnafouJenny CoparaJenny CoparaJenny CoparaPatrick RuchPatrick RuchDouglas TeodoroDouglas TeodoroDouglas TeodoroFrontiers Media S.A.articlenamed entity recognitiondeep learningpatent text miningtransformersclinical text miningchemical patentsBibliography. Library science. Information resourcesZENFrontiers in Research Metrics and Analytics, Vol 6 (2021)
institution DOAJ
collection DOAJ
language EN
topic named entity recognition
deep learning
patent text mining
transformers
clinical text mining
chemical patents
Bibliography. Library science. Information resources
Z
spellingShingle named entity recognition
deep learning
patent text mining
transformers
clinical text mining
chemical patents
Bibliography. Library science. Information resources
Z
Nona Naderi
Nona Naderi
Julien Knafou
Julien Knafou
Julien Knafou
Jenny Copara
Jenny Copara
Jenny Copara
Patrick Ruch
Patrick Ruch
Douglas Teodoro
Douglas Teodoro
Douglas Teodoro
Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
description The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains—biology, chemistry, and medicine—available in different languages—English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.
format article
author Nona Naderi
Nona Naderi
Julien Knafou
Julien Knafou
Julien Knafou
Jenny Copara
Jenny Copara
Jenny Copara
Patrick Ruch
Patrick Ruch
Douglas Teodoro
Douglas Teodoro
Douglas Teodoro
author_facet Nona Naderi
Nona Naderi
Julien Knafou
Julien Knafou
Julien Knafou
Jenny Copara
Jenny Copara
Jenny Copara
Patrick Ruch
Patrick Ruch
Douglas Teodoro
Douglas Teodoro
Douglas Teodoro
author_sort Nona Naderi
title Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
title_short Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
title_full Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
title_fullStr Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
title_full_unstemmed Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
title_sort ensemble of deep masked language models for effective named entity recognition in health and life science corpora
publisher Frontiers Media S.A.
publishDate 2021
url https://doaj.org/article/6ac0fcb24105417990a9ab076b52233a
work_keys_str_mv AT nonanaderi ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora
AT nonanaderi ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora
AT julienknafou ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora
AT julienknafou ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora
AT julienknafou ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora
AT jennycopara ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora
AT jennycopara ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora
AT jennycopara ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora
AT patrickruch ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora
AT patrickruch ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora
AT douglasteodoro ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora
AT douglasteodoro ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora
AT douglasteodoro ensembleofdeepmaskedlanguagemodelsforeffectivenamedentityrecognitioninhealthandlifesciencecorpora
_version_ 1718420317892771840