Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis

Abstract Background As biomedical knowledge is rapidly evolving, concept enrichment of biomedical terminologies is an active research area involving automatic identification of missing or new concepts. Previously, we prototyped a lexical-based formal concept analysis (FCA) approach in which concepts...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Fengbo Zheng, Rashmie Abeysinghe, Licong Cui
Formato: article
Lenguaje:EN
Publicado: BMC 2021
Materias:
Acceso en línea:https://doaj.org/article/0f14a2016017465d8ef2be4ed1715ec2
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:0f14a2016017465d8ef2be4ed1715ec2
record_format dspace
spelling oai:doaj.org-article:0f14a2016017465d8ef2be4ed1715ec22021-11-14T12:29:09ZIdentification of missing concepts in biomedical terminologies using sequence-based formal concept analysis10.1186/s12911-021-01592-w1472-6947https://doaj.org/article/0f14a2016017465d8ef2be4ed1715ec22021-11-01T00:00:00Zhttps://doi.org/10.1186/s12911-021-01592-whttps://doaj.org/toc/1472-6947Abstract Background As biomedical knowledge is rapidly evolving, concept enrichment of biomedical terminologies is an active research area involving automatic identification of missing or new concepts. Previously, we prototyped a lexical-based formal concept analysis (FCA) approach in which concepts were derived by intersecting bags of words, to identify potentially missing concepts in the National Cancer Institute (NCI) Thesaurus. However, this prototype did not handle concept naming and positioning. In this paper, we introduce a sequenced-based FCA approach to identify potentially missing concepts, supporting concept naming and positioning. Methods We consider the concept name sequences as FCA attributes to construct the formal context. The concept-forming process is performed by computing the longest common substrings of concept name sequences. After new concepts are formalized, we further predict their potential positions in the original hierarchy by identifying their supertypes and subtypes from original concepts. Automated validation via external terminologies in the Unified Medical Language System (UMLS) and biomedical literature in PubMed is performed to evaluate the effectiveness of our approach. Results We applied our sequenced-based FCA approach to all the sub-hierarchies under Disease or Disorder in the NCI Thesaurus (19.08d version) and five sub-hierarchies under Clinical Finding and Procedure in the SNOMED CT (US Edition, March 2020 release). In total, 1397 potentially missing concepts were identified in the NCI Thesaurus and 7223 in the SNOMED CT. For NCI Thesaurus, 85 potentially missing concepts were found in external terminologies and 315 of the remaining 1312 appeared in biomedical literature. For SNOMED CT, 576 were found in external terminologies and 1159 out of the remaining 6647 were found in biomedical literature. Conclusion Our sequence-based FCA approach has shown the promise for identifying potentially missing concepts in biomedical terminologies.Fengbo ZhengRashmie AbeysingheLicong CuiBMCarticleQuality assuranceConcept enrichmentFormal concept analysisSNOMED CTNCI ThesaurusComputer applications to medicine. Medical informaticsR858-859.7ENBMC Medical Informatics and Decision Making, Vol 21, Iss S7, Pp 1-14 (2021)
institution DOAJ
collection DOAJ
language EN
topic Quality assurance
Concept enrichment
Formal concept analysis
SNOMED CT
NCI Thesaurus
Computer applications to medicine. Medical informatics
R858-859.7
spellingShingle Quality assurance
Concept enrichment
Formal concept analysis
SNOMED CT
NCI Thesaurus
Computer applications to medicine. Medical informatics
R858-859.7
Fengbo Zheng
Rashmie Abeysinghe
Licong Cui
Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
description Abstract Background As biomedical knowledge is rapidly evolving, concept enrichment of biomedical terminologies is an active research area involving automatic identification of missing or new concepts. Previously, we prototyped a lexical-based formal concept analysis (FCA) approach in which concepts were derived by intersecting bags of words, to identify potentially missing concepts in the National Cancer Institute (NCI) Thesaurus. However, this prototype did not handle concept naming and positioning. In this paper, we introduce a sequenced-based FCA approach to identify potentially missing concepts, supporting concept naming and positioning. Methods We consider the concept name sequences as FCA attributes to construct the formal context. The concept-forming process is performed by computing the longest common substrings of concept name sequences. After new concepts are formalized, we further predict their potential positions in the original hierarchy by identifying their supertypes and subtypes from original concepts. Automated validation via external terminologies in the Unified Medical Language System (UMLS) and biomedical literature in PubMed is performed to evaluate the effectiveness of our approach. Results We applied our sequenced-based FCA approach to all the sub-hierarchies under Disease or Disorder in the NCI Thesaurus (19.08d version) and five sub-hierarchies under Clinical Finding and Procedure in the SNOMED CT (US Edition, March 2020 release). In total, 1397 potentially missing concepts were identified in the NCI Thesaurus and 7223 in the SNOMED CT. For NCI Thesaurus, 85 potentially missing concepts were found in external terminologies and 315 of the remaining 1312 appeared in biomedical literature. For SNOMED CT, 576 were found in external terminologies and 1159 out of the remaining 6647 were found in biomedical literature. Conclusion Our sequence-based FCA approach has shown the promise for identifying potentially missing concepts in biomedical terminologies.
format article
author Fengbo Zheng
Rashmie Abeysinghe
Licong Cui
author_facet Fengbo Zheng
Rashmie Abeysinghe
Licong Cui
author_sort Fengbo Zheng
title Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
title_short Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
title_full Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
title_fullStr Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
title_full_unstemmed Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
title_sort identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
publisher BMC
publishDate 2021
url https://doaj.org/article/0f14a2016017465d8ef2be4ed1715ec2
work_keys_str_mv AT fengbozheng identificationofmissingconceptsinbiomedicalterminologiesusingsequencebasedformalconceptanalysis
AT rashmieabeysinghe identificationofmissingconceptsinbiomedicalterminologiesusingsequencebasedformalconceptanalysis
AT licongcui identificationofmissingconceptsinbiomedicalterminologiesusingsequencebasedformalconceptanalysis
_version_ 1718429149460168704