Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning.

<h4>Background</h4>Electronic health records are invaluable for medical research, but much of the information is recorded as unstructured free text which is time-consuming to review manually.<h4>Aim</h4>To develop an algorithm to identify relevant free texts automatically bas...

Description complète

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Zhuoran Wang, Anoop D Shah, A Rosemary Tate, Spiros Denaxas, John Shawe-Taylor, Harry Hemingway
Format:	article
Langue:	EN
Publié:	Public Library of Science (PLoS) 2012
Sujets:	Medicine R Science Q
Accès en ligne:	https://doaj.org/article/d17e9e6b6da44b479e895c355a40dd42
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

id	oai:doaj.org-article:d17e9e6b6da44b479e895c355a40dd42
record_format	dspace
spelling	oai:doaj.org-article:d17e9e6b6da44b479e895c355a40dd422021-11-18T07:29:48ZExtracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning.1932-620310.1371/journal.pone.0030412https://doaj.org/article/d17e9e6b6da44b479e895c355a40dd422012-01-01T00:00:00Zhttps://www.ncbi.nlm.nih.gov/pmc/articles/pmid/22276193/?tool=EBIhttps://doaj.org/toc/1932-6203<h4>Background</h4>Electronic health records are invaluable for medical research, but much of the information is recorded as unstructured free text which is time-consuming to review manually.<h4>Aim</h4>To develop an algorithm to identify relevant free texts automatically based on labelled examples.<h4>Methods</h4>We developed a novel machine learning algorithm, the 'Semi-supervised Set Covering Machine' (S3CM), and tested its ability to detect the presence of coronary angiogram results and ovarian cancer diagnoses in free text in the General Practice Research Database. For training the algorithm, we used texts classified as positive and negative according to their associated Read diagnostic codes, rather than by manual annotation. We evaluated the precision (positive predictive value) and recall (sensitivity) of S3CM in classifying unlabelled texts against the gold standard of manual review. We compared the performance of S3CM with the Transductive Vector Support Machine (TVSM), the original fully-supervised Set Covering Machine (SCM) and our 'Freetext Matching Algorithm' natural language processor.<h4>Results</h4>Only 60% of texts with Read codes for angiogram actually contained angiogram results. However, the S3CM algorithm achieved 87% recall with 64% precision on detecting coronary angiogram results, outperforming the fully-supervised SCM (recall 78%, precision 60%) and TSVM (recall 2%, precision 3%). For ovarian cancer diagnoses, S3CM had higher recall than the other algorithms tested (86%). The Freetext Matching Algorithm had better precision than S3CM (85% versus 74%) but lower recall (62%).<h4>Conclusions</h4>Our novel S3CM machine learning algorithm effectively detected free texts in primary care records associated with angiogram results and ovarian cancer diagnoses, after training on pre-classified test sets. It should be easy to adapt to other disease areas as it does not rely on linguistic rules, but needs further testing in other electronic health record datasets.Zhuoran WangAnoop D ShahA Rosemary TateSpiros DenaxasJohn Shawe-TaylorHarry HemingwayPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 7, Iss 1, p e30412 (2012)
institution	DOAJ
collection	DOAJ
language	EN
topic	Medicine R Science Q
spellingShingle	Medicine R Science Q Zhuoran Wang Anoop D Shah A Rosemary Tate Spiros Denaxas John Shawe-Taylor Harry Hemingway Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning.
description	<h4>Background</h4>Electronic health records are invaluable for medical research, but much of the information is recorded as unstructured free text which is time-consuming to review manually.<h4>Aim</h4>To develop an algorithm to identify relevant free texts automatically based on labelled examples.<h4>Methods</h4>We developed a novel machine learning algorithm, the 'Semi-supervised Set Covering Machine' (S3CM), and tested its ability to detect the presence of coronary angiogram results and ovarian cancer diagnoses in free text in the General Practice Research Database. For training the algorithm, we used texts classified as positive and negative according to their associated Read diagnostic codes, rather than by manual annotation. We evaluated the precision (positive predictive value) and recall (sensitivity) of S3CM in classifying unlabelled texts against the gold standard of manual review. We compared the performance of S3CM with the Transductive Vector Support Machine (TVSM), the original fully-supervised Set Covering Machine (SCM) and our 'Freetext Matching Algorithm' natural language processor.<h4>Results</h4>Only 60% of texts with Read codes for angiogram actually contained angiogram results. However, the S3CM algorithm achieved 87% recall with 64% precision on detecting coronary angiogram results, outperforming the fully-supervised SCM (recall 78%, precision 60%) and TSVM (recall 2%, precision 3%). For ovarian cancer diagnoses, S3CM had higher recall than the other algorithms tested (86%). The Freetext Matching Algorithm had better precision than S3CM (85% versus 74%) but lower recall (62%).<h4>Conclusions</h4>Our novel S3CM machine learning algorithm effectively detected free texts in primary care records associated with angiogram results and ovarian cancer diagnoses, after training on pre-classified test sets. It should be easy to adapt to other disease areas as it does not rely on linguistic rules, but needs further testing in other electronic health record datasets.
format	article
author	Zhuoran Wang Anoop D Shah A Rosemary Tate Spiros Denaxas John Shawe-Taylor Harry Hemingway
author_facet	Zhuoran Wang Anoop D Shah A Rosemary Tate Spiros Denaxas John Shawe-Taylor Harry Hemingway
author_sort	Zhuoran Wang
title	Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning.
title_short	Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning.
title_full	Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning.
title_fullStr	Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning.
title_full_unstemmed	Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning.
title_sort	extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning.
publisher	Public Library of Science (PLoS)
publishDate	2012
url	https://doaj.org/article/d17e9e6b6da44b479e895c355a40dd42
work_keys_str_mv	AT zhuoranwang extractingdiagnosesandinvestigationresultsfromunstructuredtextinelectronichealthrecordsbysemisupervisedmachinelearning AT anoopdshah extractingdiagnosesandinvestigationresultsfromunstructuredtextinelectronichealthrecordsbysemisupervisedmachinelearning AT arosemarytate extractingdiagnosesandinvestigationresultsfromunstructuredtextinelectronichealthrecordsbysemisupervisedmachinelearning AT spirosdenaxas extractingdiagnosesandinvestigationresultsfromunstructuredtextinelectronichealthrecordsbysemisupervisedmachinelearning AT johnshawetaylor extractingdiagnosesandinvestigationresultsfromunstructuredtextinelectronichealthrecordsbysemisupervisedmachinelearning AT harryhemingway extractingdiagnosesandinvestigationresultsfromunstructuredtextinelectronichealthrecordsbysemisupervisedmachinelearning
_version_	1718423378002444288

Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning.

Documents similaires