Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification

The number of malware has steadily increased as malware spread and evasion techniques have advanced. Machine learning has contributed to making malware analysis more efficient by detecting various behavioral and evasion patterns. However, when analyzing large-scale malware datasets, malware analysis...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Byunghyun Baek, Seoungyul Euh, Dongheon Baek, Donghoon Kim, Doosung Hwang
Formato: article
Lenguaje:EN
Publicado: IEEE 2021
Materias:
Acceso en línea:https://doaj.org/article/deaead13448e41498741bd0ea8718439
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:deaead13448e41498741bd0ea8718439
record_format dspace
spelling oai:doaj.org-article:deaead13448e41498741bd0ea87184392021-11-20T00:01:39ZHistogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification2169-353610.1109/ACCESS.2021.3127195https://doaj.org/article/deaead13448e41498741bd0ea87184392021-01-01T00:00:00Zhttps://ieeexplore.ieee.org/document/9611252/https://doaj.org/toc/2169-3536The number of malware has steadily increased as malware spread and evasion techniques have advanced. Machine learning has contributed to making malware analysis more efficient by detecting various behavioral and evasion patterns. However, when analyzing large-scale malware datasets, malware analysis through learning models has both high temporal and spatial complexity. In order to address these problems, this work proposes a low-dimensional feature using histogram entropy and a prototype selection algorithm using hyperrectangles. The low-dimensional feature forms an <inline-formula> <tex-math notation="LaTeX">$L \times 256$ </tex-math></inline-formula> map according to the preselected parameter <inline-formula> <tex-math notation="LaTeX">$L$ </tex-math></inline-formula>. The prototype selection algorithm divides the input space into overlapping subspaces where each subspace is decided by its hyperrectangle that becomes a prototype in the same class. A set cover optimization algorithm is employed to select a small number of prototypes that construct a new training dataset. A set of prototypes selected by the prototype selection algorithm is used to classify malware families. The experiment compares the performance of machine learning models for the histogram entropy feature using both the BIG 2015 dataset and the collected dataset. The integrated approach is evaluated using learning algorithms, such as Decision Tree, Random Forest, XGBoost, and CNN. The experimental results indicate that learning models perform competitively when compared to the entire dataset, while the proposed selection approach benefits from smaller datasets and lower time complexity.Byunghyun BaekSeoungyul EuhDongheon BaekDonghoon KimDoosung HwangIEEEarticleMalware family classificationhistogram entropylow-dimensional featurehyperrectangleprototype selectionensemble modelElectrical engineering. Electronics. Nuclear engineeringTK1-9971ENIEEE Access, Vol 9, Pp 152098-152114 (2021)
institution DOAJ
collection DOAJ
language EN
topic Malware family classification
histogram entropy
low-dimensional feature
hyperrectangle
prototype selection
ensemble model
Electrical engineering. Electronics. Nuclear engineering
TK1-9971
spellingShingle Malware family classification
histogram entropy
low-dimensional feature
hyperrectangle
prototype selection
ensemble model
Electrical engineering. Electronics. Nuclear engineering
TK1-9971
Byunghyun Baek
Seoungyul Euh
Dongheon Baek
Donghoon Kim
Doosung Hwang
Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification
description The number of malware has steadily increased as malware spread and evasion techniques have advanced. Machine learning has contributed to making malware analysis more efficient by detecting various behavioral and evasion patterns. However, when analyzing large-scale malware datasets, malware analysis through learning models has both high temporal and spatial complexity. In order to address these problems, this work proposes a low-dimensional feature using histogram entropy and a prototype selection algorithm using hyperrectangles. The low-dimensional feature forms an <inline-formula> <tex-math notation="LaTeX">$L \times 256$ </tex-math></inline-formula> map according to the preselected parameter <inline-formula> <tex-math notation="LaTeX">$L$ </tex-math></inline-formula>. The prototype selection algorithm divides the input space into overlapping subspaces where each subspace is decided by its hyperrectangle that becomes a prototype in the same class. A set cover optimization algorithm is employed to select a small number of prototypes that construct a new training dataset. A set of prototypes selected by the prototype selection algorithm is used to classify malware families. The experiment compares the performance of machine learning models for the histogram entropy feature using both the BIG 2015 dataset and the collected dataset. The integrated approach is evaluated using learning algorithms, such as Decision Tree, Random Forest, XGBoost, and CNN. The experimental results indicate that learning models perform competitively when compared to the entire dataset, while the proposed selection approach benefits from smaller datasets and lower time complexity.
format article
author Byunghyun Baek
Seoungyul Euh
Dongheon Baek
Donghoon Kim
Doosung Hwang
author_facet Byunghyun Baek
Seoungyul Euh
Dongheon Baek
Donghoon Kim
Doosung Hwang
author_sort Byunghyun Baek
title Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification
title_short Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification
title_full Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification
title_fullStr Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification
title_full_unstemmed Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification
title_sort histogram entropy representation and prototype based machine learning approach for malware family classification
publisher IEEE
publishDate 2021
url https://doaj.org/article/deaead13448e41498741bd0ea8718439
work_keys_str_mv AT byunghyunbaek histogramentropyrepresentationandprototypebasedmachinelearningapproachformalwarefamilyclassification
AT seoungyuleuh histogramentropyrepresentationandprototypebasedmachinelearningapproachformalwarefamilyclassification
AT dongheonbaek histogramentropyrepresentationandprototypebasedmachinelearningapproachformalwarefamilyclassification
AT donghoonkim histogramentropyrepresentationandprototypebasedmachinelearningapproachformalwarefamilyclassification
AT doosunghwang histogramentropyrepresentationandprototypebasedmachinelearningapproachformalwarefamilyclassification
_version_ 1718419868896722944