The OpenDeID corpus for patient de-identification

Abstract For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured ele...

Description complète

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Jitendra Jonnagaddala, Aipeng Chen, Sean Batongbacal, Chandini Nekkantti
Format:	article
Langue:	EN
Publié:	Nature Portfolio 2021
Sujets:	Medicine R Science Q
Accès en ligne:	https://doaj.org/article/bf43a13da7444d1aa4e7e7642ca37eee
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

id	oai:doaj.org-article:bf43a13da7444d1aa4e7e7642ca37eee
record_format	dspace
spelling	oai:doaj.org-article:bf43a13da7444d1aa4e7e7642ca37eee2021-12-02T18:01:49ZThe OpenDeID corpus for patient de-identification10.1038/s41598-021-99554-92045-2322https://doaj.org/article/bf43a13da7444d1aa4e7e7642ca37eee2021-10-01T00:00:00Zhttps://doi.org/10.1038/s41598-021-99554-9https://doaj.org/toc/2045-2322Abstract For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers.Jitendra JonnagaddalaAipeng ChenSean BatongbacalChandini NekkanttiNature PortfolioarticleMedicineRScienceQENScientific Reports, Vol 11, Iss 1, Pp 1-8 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Medicine R Science Q
spellingShingle	Medicine R Science Q Jitendra Jonnagaddala Aipeng Chen Sean Batongbacal Chandini Nekkantti The OpenDeID corpus for patient de-identification
description	Abstract For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers.
format	article
author	Jitendra Jonnagaddala Aipeng Chen Sean Batongbacal Chandini Nekkantti
author_facet	Jitendra Jonnagaddala Aipeng Chen Sean Batongbacal Chandini Nekkantti
author_sort	Jitendra Jonnagaddala
title	The OpenDeID corpus for patient de-identification
title_short	The OpenDeID corpus for patient de-identification
title_full	The OpenDeID corpus for patient de-identification
title_fullStr	The OpenDeID corpus for patient de-identification
title_full_unstemmed	The OpenDeID corpus for patient de-identification
title_sort	opendeid corpus for patient de-identification
publisher	Nature Portfolio
publishDate	2021
url	https://doaj.org/article/bf43a13da7444d1aa4e7e7642ca37eee
work_keys_str_mv	AT jitendrajonnagaddala theopendeidcorpusforpatientdeidentification AT aipengchen theopendeidcorpusforpatientdeidentification AT seanbatongbacal theopendeidcorpusforpatientdeidentification AT chandininekkantti theopendeidcorpusforpatientdeidentification AT jitendrajonnagaddala opendeidcorpusforpatientdeidentification AT aipengchen opendeidcorpusforpatientdeidentification AT seanbatongbacal opendeidcorpusforpatientdeidentification AT chandininekkantti opendeidcorpusforpatientdeidentification
_version_	1718378951260241920

The OpenDeID corpus for patient de-identification

Documents similaires