When Two are Better Than One: Synthesizing Heavily Unbalanced Data

Nowadays, data is king and if treated and used properly it promises to give organizations a competitive edge over rivals by enabling them to develop and design Intelligent Systems to improve their services. However, they need to fully comply with not only ethical but also regulatory obligations, whe...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Francisco Ferreira, Nuno Lourenco, Bruno Cabral, Joao Paulo Fernandes
Formato: article
Lenguaje:EN
Publicado: IEEE 2021
Materias:
Acceso en línea:https://doaj.org/article/baae542a3fa84f2aa6f952e2cef0f249
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:baae542a3fa84f2aa6f952e2cef0f249
record_format dspace
spelling oai:doaj.org-article:baae542a3fa84f2aa6f952e2cef0f2492021-11-18T00:09:10ZWhen Two are Better Than One: Synthesizing Heavily Unbalanced Data2169-353610.1109/ACCESS.2021.3126656https://doaj.org/article/baae542a3fa84f2aa6f952e2cef0f2492021-01-01T00:00:00Zhttps://ieeexplore.ieee.org/document/9606863/https://doaj.org/toc/2169-3536Nowadays, data is king and if treated and used properly it promises to give organizations a competitive edge over rivals by enabling them to develop and design Intelligent Systems to improve their services. However, they need to fully comply with not only ethical but also regulatory obligations, where, e.g., privacy (strictly) needs to be respected when using or sharing data, thus protecting both the interests of users and organizations. Fraud Detection systems are examples of such systems where Machine Learning algorithms leverage information to classify financial transactions as legitimate or illicit. The data used to create these solutions is usually highly structured and contains categorical and continuous features characterised by complex distributions. One of the main challenges of fraud detection is concerned with the scarcity of fraudulent instances which results in highly unbalanced datasets. Additionally, privacy is crucial, and it is usually forbidden, or not possible, to share the data of organizations and individuals for creating or improving models.In this paper we propose a framework for private data sharing based on synthetic data generation using <italic>Generative Adversarial Networks (GAN)</italic> that learns the specificities of financial transactions data and generates fictitious data that keeps the utility of the original datasets. Our proposal, called Duo-GAN, uses two GAN generators to handle the data imbalance problem, one generator for fraudulent instances and the other for legitimate instances. With this approach, we observed, at most, a 5&#x0025; disparity in F1 scores between classifiers trained and tested with actual data and the ones trained with synthetic data and tested with actual data.Francisco FerreiraNuno LourencoBruno CabralJoao Paulo FernandesIEEEarticleFraud detectiongenerative adversarial networksprivacymachine learningsynthetic data generationtabular dataElectrical engineering. Electronics. Nuclear engineeringTK1-9971ENIEEE Access, Vol 9, Pp 150459-150469 (2021)
institution DOAJ
collection DOAJ
language EN
topic Fraud detection
generative adversarial networks
privacy
machine learning
synthetic data generation
tabular data
Electrical engineering. Electronics. Nuclear engineering
TK1-9971
spellingShingle Fraud detection
generative adversarial networks
privacy
machine learning
synthetic data generation
tabular data
Electrical engineering. Electronics. Nuclear engineering
TK1-9971
Francisco Ferreira
Nuno Lourenco
Bruno Cabral
Joao Paulo Fernandes
When Two are Better Than One: Synthesizing Heavily Unbalanced Data
description Nowadays, data is king and if treated and used properly it promises to give organizations a competitive edge over rivals by enabling them to develop and design Intelligent Systems to improve their services. However, they need to fully comply with not only ethical but also regulatory obligations, where, e.g., privacy (strictly) needs to be respected when using or sharing data, thus protecting both the interests of users and organizations. Fraud Detection systems are examples of such systems where Machine Learning algorithms leverage information to classify financial transactions as legitimate or illicit. The data used to create these solutions is usually highly structured and contains categorical and continuous features characterised by complex distributions. One of the main challenges of fraud detection is concerned with the scarcity of fraudulent instances which results in highly unbalanced datasets. Additionally, privacy is crucial, and it is usually forbidden, or not possible, to share the data of organizations and individuals for creating or improving models.In this paper we propose a framework for private data sharing based on synthetic data generation using <italic>Generative Adversarial Networks (GAN)</italic> that learns the specificities of financial transactions data and generates fictitious data that keeps the utility of the original datasets. Our proposal, called Duo-GAN, uses two GAN generators to handle the data imbalance problem, one generator for fraudulent instances and the other for legitimate instances. With this approach, we observed, at most, a 5&#x0025; disparity in F1 scores between classifiers trained and tested with actual data and the ones trained with synthetic data and tested with actual data.
format article
author Francisco Ferreira
Nuno Lourenco
Bruno Cabral
Joao Paulo Fernandes
author_facet Francisco Ferreira
Nuno Lourenco
Bruno Cabral
Joao Paulo Fernandes
author_sort Francisco Ferreira
title When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title_short When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title_full When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title_fullStr When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title_full_unstemmed When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title_sort when two are better than one: synthesizing heavily unbalanced data
publisher IEEE
publishDate 2021
url https://doaj.org/article/baae542a3fa84f2aa6f952e2cef0f249
work_keys_str_mv AT franciscoferreira whentwoarebetterthanonesynthesizingheavilyunbalanceddata
AT nunolourenco whentwoarebetterthanonesynthesizingheavilyunbalanceddata
AT brunocabral whentwoarebetterthanonesynthesizingheavilyunbalanceddata
AT joaopaulofernandes whentwoarebetterthanonesynthesizingheavilyunbalanceddata
_version_ 1718425247696289792