A clinical specific BERT developed using a huge Japanese clinical text corpus.

Generalized language models that are pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate the...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Yoshimasa Kawazoe, Daisaku Shibata, Emiko Shinohara, Eiji Aramaki, Kazuhiko Ohe
Formato: article
Lenguaje:EN
Publicado: Public Library of Science (PLoS) 2021
Materias:
R
Q
Acceso en línea:https://doaj.org/article/2e0cce2d87e84b0e9269723053006110
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:2e0cce2d87e84b0e9269723053006110
record_format dspace
spelling oai:doaj.org-article:2e0cce2d87e84b0e92697230530061102021-12-02T20:07:42ZA clinical specific BERT developed using a huge Japanese clinical text corpus.1932-620310.1371/journal.pone.0259763https://doaj.org/article/2e0cce2d87e84b0e92697230530061102021-01-01T00:00:00Zhttps://doi.org/10.1371/journal.pone.0259763https://doaj.org/toc/1932-6203Generalized language models that are pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate the development of a clinical specific BERT model with a huge amount of Japanese clinical text and evaluate it on the NTCIR-13 MedWeb that has fake Twitter messages regarding medical concerns with eight labels. Approximately 120 million clinical texts stored at the University of Tokyo Hospital were used as our dataset. The BERT-base was pre-trained using the entire dataset and a vocabulary including 25,000 tokens. The pre-training was almost saturated at about 4 epochs, and the accuracies of Masked-LM and Next Sentence Prediction were 0.773 and 0.975, respectively. The developed BERT did not show significantly higher performance on the MedWeb task than the other BERT models that were pre-trained with Japanese Wikipedia text. The advantage of pre-training on clinical text may become apparent in more complex tasks on actual clinical text, and such an evaluation set needs to be developed.Yoshimasa KawazoeDaisaku ShibataEmiko ShinoharaEiji AramakiKazuhiko OhePublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 16, Iss 11, p e0259763 (2021)
institution DOAJ
collection DOAJ
language EN
topic Medicine
R
Science
Q
spellingShingle Medicine
R
Science
Q
Yoshimasa Kawazoe
Daisaku Shibata
Emiko Shinohara
Eiji Aramaki
Kazuhiko Ohe
A clinical specific BERT developed using a huge Japanese clinical text corpus.
description Generalized language models that are pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate the development of a clinical specific BERT model with a huge amount of Japanese clinical text and evaluate it on the NTCIR-13 MedWeb that has fake Twitter messages regarding medical concerns with eight labels. Approximately 120 million clinical texts stored at the University of Tokyo Hospital were used as our dataset. The BERT-base was pre-trained using the entire dataset and a vocabulary including 25,000 tokens. The pre-training was almost saturated at about 4 epochs, and the accuracies of Masked-LM and Next Sentence Prediction were 0.773 and 0.975, respectively. The developed BERT did not show significantly higher performance on the MedWeb task than the other BERT models that were pre-trained with Japanese Wikipedia text. The advantage of pre-training on clinical text may become apparent in more complex tasks on actual clinical text, and such an evaluation set needs to be developed.
format article
author Yoshimasa Kawazoe
Daisaku Shibata
Emiko Shinohara
Eiji Aramaki
Kazuhiko Ohe
author_facet Yoshimasa Kawazoe
Daisaku Shibata
Emiko Shinohara
Eiji Aramaki
Kazuhiko Ohe
author_sort Yoshimasa Kawazoe
title A clinical specific BERT developed using a huge Japanese clinical text corpus.
title_short A clinical specific BERT developed using a huge Japanese clinical text corpus.
title_full A clinical specific BERT developed using a huge Japanese clinical text corpus.
title_fullStr A clinical specific BERT developed using a huge Japanese clinical text corpus.
title_full_unstemmed A clinical specific BERT developed using a huge Japanese clinical text corpus.
title_sort clinical specific bert developed using a huge japanese clinical text corpus.
publisher Public Library of Science (PLoS)
publishDate 2021
url https://doaj.org/article/2e0cce2d87e84b0e9269723053006110
work_keys_str_mv AT yoshimasakawazoe aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT daisakushibata aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT emikoshinohara aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT eijiaramaki aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT kazuhikoohe aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT yoshimasakawazoe clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT daisakushibata clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT emikoshinohara clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT eijiaramaki clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
AT kazuhikoohe clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
_version_ 1718375231577391104