Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification

The number of malware has steadily increased as malware spread and evasion techniques have advanced. Machine learning has contributed to making malware analysis more efficient by detecting various behavioral and evasion patterns. However, when analyzing large-scale malware datasets, malware analysis...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Byunghyun Baek, Seoungyul Euh, Dongheon Baek, Donghoon Kim, Doosung Hwang
Formato: article
Lenguaje:EN
Publicado: IEEE 2021
Materias:
Acceso en línea:https://doaj.org/article/deaead13448e41498741bd0ea8718439
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Sumario:The number of malware has steadily increased as malware spread and evasion techniques have advanced. Machine learning has contributed to making malware analysis more efficient by detecting various behavioral and evasion patterns. However, when analyzing large-scale malware datasets, malware analysis through learning models has both high temporal and spatial complexity. In order to address these problems, this work proposes a low-dimensional feature using histogram entropy and a prototype selection algorithm using hyperrectangles. The low-dimensional feature forms an <inline-formula> <tex-math notation="LaTeX">$L \times 256$ </tex-math></inline-formula> map according to the preselected parameter <inline-formula> <tex-math notation="LaTeX">$L$ </tex-math></inline-formula>. The prototype selection algorithm divides the input space into overlapping subspaces where each subspace is decided by its hyperrectangle that becomes a prototype in the same class. A set cover optimization algorithm is employed to select a small number of prototypes that construct a new training dataset. A set of prototypes selected by the prototype selection algorithm is used to classify malware families. The experiment compares the performance of machine learning models for the histogram entropy feature using both the BIG 2015 dataset and the collected dataset. The integrated approach is evaluated using learning algorithms, such as Decision Tree, Random Forest, XGBoost, and CNN. The experimental results indicate that learning models perform competitively when compared to the entire dataset, while the proposed selection approach benefits from smaller datasets and lower time complexity.