Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification
The number of malware has steadily increased as malware spread and evasion techniques have advanced. Machine learning has contributed to making malware analysis more efficient by detecting various behavioral and evasion patterns. However, when analyzing large-scale malware datasets, malware analysis...
Guardado en:
Autores principales: | , , , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
IEEE
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/deaead13448e41498741bd0ea8718439 |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:deaead13448e41498741bd0ea8718439 |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:deaead13448e41498741bd0ea87184392021-11-20T00:01:39ZHistogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification2169-353610.1109/ACCESS.2021.3127195https://doaj.org/article/deaead13448e41498741bd0ea87184392021-01-01T00:00:00Zhttps://ieeexplore.ieee.org/document/9611252/https://doaj.org/toc/2169-3536The number of malware has steadily increased as malware spread and evasion techniques have advanced. Machine learning has contributed to making malware analysis more efficient by detecting various behavioral and evasion patterns. However, when analyzing large-scale malware datasets, malware analysis through learning models has both high temporal and spatial complexity. In order to address these problems, this work proposes a low-dimensional feature using histogram entropy and a prototype selection algorithm using hyperrectangles. The low-dimensional feature forms an <inline-formula> <tex-math notation="LaTeX">$L \times 256$ </tex-math></inline-formula> map according to the preselected parameter <inline-formula> <tex-math notation="LaTeX">$L$ </tex-math></inline-formula>. The prototype selection algorithm divides the input space into overlapping subspaces where each subspace is decided by its hyperrectangle that becomes a prototype in the same class. A set cover optimization algorithm is employed to select a small number of prototypes that construct a new training dataset. A set of prototypes selected by the prototype selection algorithm is used to classify malware families. The experiment compares the performance of machine learning models for the histogram entropy feature using both the BIG 2015 dataset and the collected dataset. The integrated approach is evaluated using learning algorithms, such as Decision Tree, Random Forest, XGBoost, and CNN. The experimental results indicate that learning models perform competitively when compared to the entire dataset, while the proposed selection approach benefits from smaller datasets and lower time complexity.Byunghyun BaekSeoungyul EuhDongheon BaekDonghoon KimDoosung HwangIEEEarticleMalware family classificationhistogram entropylow-dimensional featurehyperrectangleprototype selectionensemble modelElectrical engineering. Electronics. Nuclear engineeringTK1-9971ENIEEE Access, Vol 9, Pp 152098-152114 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
Malware family classification histogram entropy low-dimensional feature hyperrectangle prototype selection ensemble model Electrical engineering. Electronics. Nuclear engineering TK1-9971 |
spellingShingle |
Malware family classification histogram entropy low-dimensional feature hyperrectangle prototype selection ensemble model Electrical engineering. Electronics. Nuclear engineering TK1-9971 Byunghyun Baek Seoungyul Euh Dongheon Baek Donghoon Kim Doosung Hwang Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification |
description |
The number of malware has steadily increased as malware spread and evasion techniques have advanced. Machine learning has contributed to making malware analysis more efficient by detecting various behavioral and evasion patterns. However, when analyzing large-scale malware datasets, malware analysis through learning models has both high temporal and spatial complexity. In order to address these problems, this work proposes a low-dimensional feature using histogram entropy and a prototype selection algorithm using hyperrectangles. The low-dimensional feature forms an <inline-formula> <tex-math notation="LaTeX">$L \times 256$ </tex-math></inline-formula> map according to the preselected parameter <inline-formula> <tex-math notation="LaTeX">$L$ </tex-math></inline-formula>. The prototype selection algorithm divides the input space into overlapping subspaces where each subspace is decided by its hyperrectangle that becomes a prototype in the same class. A set cover optimization algorithm is employed to select a small number of prototypes that construct a new training dataset. A set of prototypes selected by the prototype selection algorithm is used to classify malware families. The experiment compares the performance of machine learning models for the histogram entropy feature using both the BIG 2015 dataset and the collected dataset. The integrated approach is evaluated using learning algorithms, such as Decision Tree, Random Forest, XGBoost, and CNN. The experimental results indicate that learning models perform competitively when compared to the entire dataset, while the proposed selection approach benefits from smaller datasets and lower time complexity. |
format |
article |
author |
Byunghyun Baek Seoungyul Euh Dongheon Baek Donghoon Kim Doosung Hwang |
author_facet |
Byunghyun Baek Seoungyul Euh Dongheon Baek Donghoon Kim Doosung Hwang |
author_sort |
Byunghyun Baek |
title |
Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification |
title_short |
Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification |
title_full |
Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification |
title_fullStr |
Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification |
title_full_unstemmed |
Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification |
title_sort |
histogram entropy representation and prototype based machine learning approach for malware family classification |
publisher |
IEEE |
publishDate |
2021 |
url |
https://doaj.org/article/deaead13448e41498741bd0ea8718439 |
work_keys_str_mv |
AT byunghyunbaek histogramentropyrepresentationandprototypebasedmachinelearningapproachformalwarefamilyclassification AT seoungyuleuh histogramentropyrepresentationandprototypebasedmachinelearningapproachformalwarefamilyclassification AT dongheonbaek histogramentropyrepresentationandprototypebasedmachinelearningapproachformalwarefamilyclassification AT donghoonkim histogramentropyrepresentationandprototypebasedmachinelearningapproachformalwarefamilyclassification AT doosunghwang histogramentropyrepresentationandprototypebasedmachinelearningapproachformalwarefamilyclassification |
_version_ |
1718419868896722944 |