A clinical specific BERT developed using a huge Japanese clinical text corpus

Generalized language models that are pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate the...

Description complète

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Yoshimasa Kawazoe, Daisaku Shibata, Emiko Shinohara, Eiji Aramaki, Kazuhiko Ohe
Format:	article
Langue:	EN
Publié:	Public Library of Science (PLoS) 2021
Sujets:	Medicine R Science Q
Accès en ligne:	https://doaj.org/article/d91d1c1105f045dc8aaa84db58182b7f
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

id	oai:doaj.org-article:d91d1c1105f045dc8aaa84db58182b7f
record_format	dspace
spelling	oai:doaj.org-article:d91d1c1105f045dc8aaa84db58182b7f2021-11-18T06:34:24ZA clinical specific BERT developed using a huge Japanese clinical text corpus1932-6203https://doaj.org/article/d91d1c1105f045dc8aaa84db58182b7f2021-01-01T00:00:00Zhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8577751/?tool=EBIhttps://doaj.org/toc/1932-6203Generalized language models that are pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate the development of a clinical specific BERT model with a huge amount of Japanese clinical text and evaluate it on the NTCIR-13 MedWeb that has fake Twitter messages regarding medical concerns with eight labels. Approximately 120 million clinical texts stored at the University of Tokyo Hospital were used as our dataset. The BERT-base was pre-trained using the entire dataset and a vocabulary including 25,000 tokens. The pre-training was almost saturated at about 4 epochs, and the accuracies of Masked-LM and Next Sentence Prediction were 0.773 and 0.975, respectively. The developed BERT did not show significantly higher performance on the MedWeb task than the other BERT models that were pre-trained with Japanese Wikipedia text. The advantage of pre-training on clinical text may become apparent in more complex tasks on actual clinical text, and such an evaluation set needs to be developed.Yoshimasa KawazoeDaisaku ShibataEmiko ShinoharaEiji AramakiKazuhiko OhePublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 16, Iss 11 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Medicine R Science Q
spellingShingle	Medicine R Science Q Yoshimasa Kawazoe Daisaku Shibata Emiko Shinohara Eiji Aramaki Kazuhiko Ohe A clinical specific BERT developed using a huge Japanese clinical text corpus
description	Generalized language models that are pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate the development of a clinical specific BERT model with a huge amount of Japanese clinical text and evaluate it on the NTCIR-13 MedWeb that has fake Twitter messages regarding medical concerns with eight labels. Approximately 120 million clinical texts stored at the University of Tokyo Hospital were used as our dataset. The BERT-base was pre-trained using the entire dataset and a vocabulary including 25,000 tokens. The pre-training was almost saturated at about 4 epochs, and the accuracies of Masked-LM and Next Sentence Prediction were 0.773 and 0.975, respectively. The developed BERT did not show significantly higher performance on the MedWeb task than the other BERT models that were pre-trained with Japanese Wikipedia text. The advantage of pre-training on clinical text may become apparent in more complex tasks on actual clinical text, and such an evaluation set needs to be developed.
format	article
author	Yoshimasa Kawazoe Daisaku Shibata Emiko Shinohara Eiji Aramaki Kazuhiko Ohe
author_facet	Yoshimasa Kawazoe Daisaku Shibata Emiko Shinohara Eiji Aramaki Kazuhiko Ohe
author_sort	Yoshimasa Kawazoe
title	A clinical specific BERT developed using a huge Japanese clinical text corpus
title_short	A clinical specific BERT developed using a huge Japanese clinical text corpus
title_full	A clinical specific BERT developed using a huge Japanese clinical text corpus
title_fullStr	A clinical specific BERT developed using a huge Japanese clinical text corpus
title_full_unstemmed	A clinical specific BERT developed using a huge Japanese clinical text corpus
title_sort	clinical specific bert developed using a huge japanese clinical text corpus
publisher	Public Library of Science (PLoS)
publishDate	2021
url	https://doaj.org/article/d91d1c1105f045dc8aaa84db58182b7f
work_keys_str_mv	AT yoshimasakawazoe aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus AT daisakushibata aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus AT emikoshinohara aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus AT eijiaramaki aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus AT kazuhikoohe aclinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus AT yoshimasakawazoe clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus AT daisakushibata clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus AT emikoshinohara clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus AT eijiaramaki clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus AT kazuhikoohe clinicalspecificbertdevelopedusingahugejapaneseclinicaltextcorpus
_version_	1718424509418045440

A clinical specific BERT developed using a huge Japanese clinical text corpus

Documents similaires