An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection

Fraud detection has received considerable attention from many academic research and industries worldwide due to its increasing popularity. Insurance datasets are enormous, with skewed distributions and high dimensionality. Skewed class distribution and its volume are considered significant problems...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Shamitha S. Kotekani, Ilango Velchamy
Formato:	article
Lenguaje:	EN
Publicado:	University of Zagreb Faculty of Electrical Engineering and Computing 2020
Materias:	health insurance fraud detection class imbalance k-means SMOTE classification algorithms Electronic computers. Computer science QA75.5-76.95
Acceso en línea:	https://doaj.org/article/ba5f73725a294efba99f3d1a452d98c5
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:ba5f73725a294efba99f3d1a452d98c5
record_format	dspace
spelling	oai:doaj.org-article:ba5f73725a294efba99f3d1a452d98c52021-12-02T18:10:33ZAn Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection1330-11361846-3908https://doaj.org/article/ba5f73725a294efba99f3d1a452d98c52020-01-01T00:00:00Zhttps://hrcak.srce.hr/file/384993https://doaj.org/toc/1330-1136https://doaj.org/toc/1846-3908Fraud detection has received considerable attention from many academic research and industries worldwide due to its increasing popularity. Insurance datasets are enormous, with skewed distributions and high dimensionality. Skewed class distribution and its volume are considered significant problems while analyzing insurance datasets, as these issues increase the misclassification rates. Although sampling approaches, such as random oversampling and SMOTE can help balance the data, they can also increase the computational complexity and lead to a deterioration of model's performance. So, more sophisticated techniques are needed to balance the skewed classes efficiently. This research focuses on optimizing the learner for fraud detection by applying a Fused Resampling and Cleaning Ensemble (FusedRCE) for effective sampling in health insurance fraud detection. We hypothesized that meticulous oversampling followed with a guided data cleaning would improve the prediction performance and learner's understanding of the minority fraudulent classes compared to other sampling techniques. The proposed model works in three steps. As a first step, PCA is applied to extract the necessary features and reduce the dimensions in the data. In the second step, a hybrid combination of k-means clustering and SMOTE oversampling is used to resample the imbalanced data. Oversampling introduces lots of noise in the data. A thorough cleaning is performed on the balanced data to remove the noisy samples generated during oversampling using the Tomek Link algorithm in the third step. Tomek Link algorithm clears the boundary between minority and majority class samples and makes the data more precise and freer from noise. The resultant dataset is used by four different classification algorithms: Logistic Regression, Decision Tree Classifier, k-Nearest Neighbors, and Neural Networks using repeated 5-fold cross-validation. Compared to other classifiers, Neural Networks with FusedRCE had the highest average prediction rate of 98.9%. The results were also measured using parameters such as F1 score, Precision, Recall and AUC values. The results obtained show that the proposed method performed significantly better than any other fraud detection approach in health insurance by predicting more fraudulent data with greater accuracy and a 3x increase in speed during training.Shamitha S. KotekaniIlango VelchamyUniversity of Zagreb Faculty of Electrical Engineering and Computingarticlehealth insurancefraud detectionclass imbalancek-meansSMOTEclassification algorithmsElectronic computers. Computer scienceQA75.5-76.95ENJournal of Computing and Information Technology, Vol 28, Iss 4, Pp 269-285 (2020)
institution	DOAJ
collection	DOAJ
language	EN
topic	health insurance fraud detection class imbalance k-means SMOTE classification algorithms Electronic computers. Computer science QA75.5-76.95
spellingShingle	health insurance fraud detection class imbalance k-means SMOTE classification algorithms Electronic computers. Computer science QA75.5-76.95 Shamitha S. Kotekani Ilango Velchamy An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection
description	Fraud detection has received considerable attention from many academic research and industries worldwide due to its increasing popularity. Insurance datasets are enormous, with skewed distributions and high dimensionality. Skewed class distribution and its volume are considered significant problems while analyzing insurance datasets, as these issues increase the misclassification rates. Although sampling approaches, such as random oversampling and SMOTE can help balance the data, they can also increase the computational complexity and lead to a deterioration of model's performance. So, more sophisticated techniques are needed to balance the skewed classes efficiently. This research focuses on optimizing the learner for fraud detection by applying a Fused Resampling and Cleaning Ensemble (FusedRCE) for effective sampling in health insurance fraud detection. We hypothesized that meticulous oversampling followed with a guided data cleaning would improve the prediction performance and learner's understanding of the minority fraudulent classes compared to other sampling techniques. The proposed model works in three steps. As a first step, PCA is applied to extract the necessary features and reduce the dimensions in the data. In the second step, a hybrid combination of k-means clustering and SMOTE oversampling is used to resample the imbalanced data. Oversampling introduces lots of noise in the data. A thorough cleaning is performed on the balanced data to remove the noisy samples generated during oversampling using the Tomek Link algorithm in the third step. Tomek Link algorithm clears the boundary between minority and majority class samples and makes the data more precise and freer from noise. The resultant dataset is used by four different classification algorithms: Logistic Regression, Decision Tree Classifier, k-Nearest Neighbors, and Neural Networks using repeated 5-fold cross-validation. Compared to other classifiers, Neural Networks with FusedRCE had the highest average prediction rate of 98.9%. The results were also measured using parameters such as F1 score, Precision, Recall and AUC values. The results obtained show that the proposed method performed significantly better than any other fraud detection approach in health insurance by predicting more fraudulent data with greater accuracy and a 3x increase in speed during training.
format	article
author	Shamitha S. Kotekani Ilango Velchamy
author_facet	Shamitha S. Kotekani Ilango Velchamy
author_sort	Shamitha S. Kotekani
title	An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection
title_short	An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection
title_full	An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection
title_fullStr	An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection
title_full_unstemmed	An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection
title_sort	effective data sampling procedure for imbalanced data learning on health insurance fraud detection
publisher	University of Zagreb Faculty of Electrical Engineering and Computing
publishDate	2020
url	https://doaj.org/article/ba5f73725a294efba99f3d1a452d98c5
work_keys_str_mv	AT shamithaskotekani aneffectivedatasamplingprocedureforimbalanceddatalearningonhealthinsurancefrauddetection AT ilangovelchamy aneffectivedatasamplingprocedureforimbalanceddatalearningonhealthinsurancefrauddetection AT shamithaskotekani effectivedatasamplingprocedureforimbalanceddatalearningonhealthinsurancefrauddetection AT ilangovelchamy effectivedatasamplingprocedureforimbalanceddatalearningonhealthinsurancefrauddetection
_version_	1718378594447654912

An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection

Ejemplares similares