An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection

Fraud detection has received considerable attention from many academic research and industries worldwide due to its increasing popularity. Insurance datasets are enormous, with skewed distributions and high dimensionality. Skewed class distribution and its volume are considered significant problems...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Shamitha S. Kotekani, Ilango Velchamy
Formato: article
Lenguaje:EN
Publicado: University of Zagreb Faculty of Electrical Engineering and Computing 2020
Materias:
Acceso en línea:https://doaj.org/article/ba5f73725a294efba99f3d1a452d98c5
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:ba5f73725a294efba99f3d1a452d98c5
record_format dspace
spelling oai:doaj.org-article:ba5f73725a294efba99f3d1a452d98c52021-12-02T18:10:33ZAn Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection1330-11361846-3908https://doaj.org/article/ba5f73725a294efba99f3d1a452d98c52020-01-01T00:00:00Zhttps://hrcak.srce.hr/file/384993https://doaj.org/toc/1330-1136https://doaj.org/toc/1846-3908Fraud detection has received considerable attention from many academic research and industries worldwide due to its increasing popularity. Insurance datasets are enormous, with skewed distributions and high dimensionality. Skewed class distribution and its volume are considered significant problems while analyzing insurance datasets, as these issues increase the misclassification rates. Although sampling approaches, such as random oversampling and SMOTE can help balance the data, they can also increase the computational complexity and lead to a deterioration of model's performance. So, more sophisticated techniques are needed to balance the skewed classes efficiently. This research focuses on optimizing the learner for fraud detection by applying a Fused Resampling and Cleaning Ensemble (FusedRCE) for effective sampling in health insurance fraud detection. We hypothesized that meticulous oversampling followed with a guided data cleaning would improve the prediction performance and learner's understanding of the minority fraudulent classes compared to other sampling techniques. The proposed model works in three steps. As a first step, PCA is applied to extract the necessary features and reduce the dimensions in the data. In the second step, a hybrid combination of k-means clustering and SMOTE oversampling is used to resample the imbalanced data. Oversampling introduces lots of noise in the data. A thorough cleaning is performed on the balanced data to remove the noisy samples generated during oversampling using the Tomek Link algorithm in the third step. Tomek Link algorithm clears the boundary between minority and majority class samples and makes the data more precise and freer from noise. The resultant dataset is used by four different classification algorithms: Logistic Regression, Decision Tree Classifier, k-Nearest Neighbors, and Neural Networks using repeated 5-fold cross-validation. Compared to other classifiers, Neural Networks with FusedRCE had the highest average prediction rate of 98.9%. The results were also measured using parameters such as F1 score, Precision, Recall and AUC values. The results obtained show that the proposed method performed significantly better than any other fraud detection approach in health insurance by predicting more fraudulent data with greater accuracy and a 3x increase in speed during training.Shamitha S. KotekaniIlango VelchamyUniversity of Zagreb Faculty of Electrical Engineering and Computingarticlehealth insurancefraud detectionclass imbalancek-meansSMOTEclassification algorithmsElectronic computers. Computer scienceQA75.5-76.95ENJournal of Computing and Information Technology, Vol 28, Iss 4, Pp 269-285 (2020)
institution DOAJ
collection DOAJ
language EN
topic health insurance
fraud detection
class imbalance
k-means
SMOTE
classification algorithms
Electronic computers. Computer science
QA75.5-76.95
spellingShingle health insurance
fraud detection
class imbalance
k-means
SMOTE
classification algorithms
Electronic computers. Computer science
QA75.5-76.95
Shamitha S. Kotekani
Ilango Velchamy
An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection
description Fraud detection has received considerable attention from many academic research and industries worldwide due to its increasing popularity. Insurance datasets are enormous, with skewed distributions and high dimensionality. Skewed class distribution and its volume are considered significant problems while analyzing insurance datasets, as these issues increase the misclassification rates. Although sampling approaches, such as random oversampling and SMOTE can help balance the data, they can also increase the computational complexity and lead to a deterioration of model's performance. So, more sophisticated techniques are needed to balance the skewed classes efficiently. This research focuses on optimizing the learner for fraud detection by applying a Fused Resampling and Cleaning Ensemble (FusedRCE) for effective sampling in health insurance fraud detection. We hypothesized that meticulous oversampling followed with a guided data cleaning would improve the prediction performance and learner's understanding of the minority fraudulent classes compared to other sampling techniques. The proposed model works in three steps. As a first step, PCA is applied to extract the necessary features and reduce the dimensions in the data. In the second step, a hybrid combination of k-means clustering and SMOTE oversampling is used to resample the imbalanced data. Oversampling introduces lots of noise in the data. A thorough cleaning is performed on the balanced data to remove the noisy samples generated during oversampling using the Tomek Link algorithm in the third step. Tomek Link algorithm clears the boundary between minority and majority class samples and makes the data more precise and freer from noise. The resultant dataset is used by four different classification algorithms: Logistic Regression, Decision Tree Classifier, k-Nearest Neighbors, and Neural Networks using repeated 5-fold cross-validation. Compared to other classifiers, Neural Networks with FusedRCE had the highest average prediction rate of 98.9%. The results were also measured using parameters such as F1 score, Precision, Recall and AUC values. The results obtained show that the proposed method performed significantly better than any other fraud detection approach in health insurance by predicting more fraudulent data with greater accuracy and a 3x increase in speed during training.
format article
author Shamitha S. Kotekani
Ilango Velchamy
author_facet Shamitha S. Kotekani
Ilango Velchamy
author_sort Shamitha S. Kotekani
title An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection
title_short An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection
title_full An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection
title_fullStr An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection
title_full_unstemmed An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection
title_sort effective data sampling procedure for imbalanced data learning on health insurance fraud detection
publisher University of Zagreb Faculty of Electrical Engineering and Computing
publishDate 2020
url https://doaj.org/article/ba5f73725a294efba99f3d1a452d98c5
work_keys_str_mv AT shamithaskotekani aneffectivedatasamplingprocedureforimbalanceddatalearningonhealthinsurancefrauddetection
AT ilangovelchamy aneffectivedatasamplingprocedureforimbalanceddatalearningonhealthinsurancefrauddetection
AT shamithaskotekani effectivedatasamplingprocedureforimbalanceddatalearningonhealthinsurancefrauddetection
AT ilangovelchamy effectivedatasamplingprocedureforimbalanceddatalearningonhealthinsurancefrauddetection
_version_ 1718378594447654912