Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
Abstract Background As biomedical knowledge is rapidly evolving, concept enrichment of biomedical terminologies is an active research area involving automatic identification of missing or new concepts. Previously, we prototyped a lexical-based formal concept analysis (FCA) approach in which concepts...
Guardado en:
Autores principales: | , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
BMC
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/0f14a2016017465d8ef2be4ed1715ec2 |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:0f14a2016017465d8ef2be4ed1715ec2 |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:0f14a2016017465d8ef2be4ed1715ec22021-11-14T12:29:09ZIdentification of missing concepts in biomedical terminologies using sequence-based formal concept analysis10.1186/s12911-021-01592-w1472-6947https://doaj.org/article/0f14a2016017465d8ef2be4ed1715ec22021-11-01T00:00:00Zhttps://doi.org/10.1186/s12911-021-01592-whttps://doaj.org/toc/1472-6947Abstract Background As biomedical knowledge is rapidly evolving, concept enrichment of biomedical terminologies is an active research area involving automatic identification of missing or new concepts. Previously, we prototyped a lexical-based formal concept analysis (FCA) approach in which concepts were derived by intersecting bags of words, to identify potentially missing concepts in the National Cancer Institute (NCI) Thesaurus. However, this prototype did not handle concept naming and positioning. In this paper, we introduce a sequenced-based FCA approach to identify potentially missing concepts, supporting concept naming and positioning. Methods We consider the concept name sequences as FCA attributes to construct the formal context. The concept-forming process is performed by computing the longest common substrings of concept name sequences. After new concepts are formalized, we further predict their potential positions in the original hierarchy by identifying their supertypes and subtypes from original concepts. Automated validation via external terminologies in the Unified Medical Language System (UMLS) and biomedical literature in PubMed is performed to evaluate the effectiveness of our approach. Results We applied our sequenced-based FCA approach to all the sub-hierarchies under Disease or Disorder in the NCI Thesaurus (19.08d version) and five sub-hierarchies under Clinical Finding and Procedure in the SNOMED CT (US Edition, March 2020 release). In total, 1397 potentially missing concepts were identified in the NCI Thesaurus and 7223 in the SNOMED CT. For NCI Thesaurus, 85 potentially missing concepts were found in external terminologies and 315 of the remaining 1312 appeared in biomedical literature. For SNOMED CT, 576 were found in external terminologies and 1159 out of the remaining 6647 were found in biomedical literature. Conclusion Our sequence-based FCA approach has shown the promise for identifying potentially missing concepts in biomedical terminologies.Fengbo ZhengRashmie AbeysingheLicong CuiBMCarticleQuality assuranceConcept enrichmentFormal concept analysisSNOMED CTNCI ThesaurusComputer applications to medicine. Medical informaticsR858-859.7ENBMC Medical Informatics and Decision Making, Vol 21, Iss S7, Pp 1-14 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
Quality assurance Concept enrichment Formal concept analysis SNOMED CT NCI Thesaurus Computer applications to medicine. Medical informatics R858-859.7 |
spellingShingle |
Quality assurance Concept enrichment Formal concept analysis SNOMED CT NCI Thesaurus Computer applications to medicine. Medical informatics R858-859.7 Fengbo Zheng Rashmie Abeysinghe Licong Cui Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis |
description |
Abstract Background As biomedical knowledge is rapidly evolving, concept enrichment of biomedical terminologies is an active research area involving automatic identification of missing or new concepts. Previously, we prototyped a lexical-based formal concept analysis (FCA) approach in which concepts were derived by intersecting bags of words, to identify potentially missing concepts in the National Cancer Institute (NCI) Thesaurus. However, this prototype did not handle concept naming and positioning. In this paper, we introduce a sequenced-based FCA approach to identify potentially missing concepts, supporting concept naming and positioning. Methods We consider the concept name sequences as FCA attributes to construct the formal context. The concept-forming process is performed by computing the longest common substrings of concept name sequences. After new concepts are formalized, we further predict their potential positions in the original hierarchy by identifying their supertypes and subtypes from original concepts. Automated validation via external terminologies in the Unified Medical Language System (UMLS) and biomedical literature in PubMed is performed to evaluate the effectiveness of our approach. Results We applied our sequenced-based FCA approach to all the sub-hierarchies under Disease or Disorder in the NCI Thesaurus (19.08d version) and five sub-hierarchies under Clinical Finding and Procedure in the SNOMED CT (US Edition, March 2020 release). In total, 1397 potentially missing concepts were identified in the NCI Thesaurus and 7223 in the SNOMED CT. For NCI Thesaurus, 85 potentially missing concepts were found in external terminologies and 315 of the remaining 1312 appeared in biomedical literature. For SNOMED CT, 576 were found in external terminologies and 1159 out of the remaining 6647 were found in biomedical literature. Conclusion Our sequence-based FCA approach has shown the promise for identifying potentially missing concepts in biomedical terminologies. |
format |
article |
author |
Fengbo Zheng Rashmie Abeysinghe Licong Cui |
author_facet |
Fengbo Zheng Rashmie Abeysinghe Licong Cui |
author_sort |
Fengbo Zheng |
title |
Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis |
title_short |
Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis |
title_full |
Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis |
title_fullStr |
Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis |
title_full_unstemmed |
Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis |
title_sort |
identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis |
publisher |
BMC |
publishDate |
2021 |
url |
https://doaj.org/article/0f14a2016017465d8ef2be4ed1715ec2 |
work_keys_str_mv |
AT fengbozheng identificationofmissingconceptsinbiomedicalterminologiesusingsequencebasedformalconceptanalysis AT rashmieabeysinghe identificationofmissingconceptsinbiomedicalterminologiesusingsequencebasedformalconceptanalysis AT licongcui identificationofmissingconceptsinbiomedicalterminologiesusingsequencebasedformalconceptanalysis |
_version_ |
1718429149460168704 |