Robust Ensemble Machine Learning Model for Filtering Phishing URLs: Expandable Random Gradient Stacked Voting Classifier (ERG-SVC)

As cyber-attacks grow fast and complicated, the cybersecurity industry faces challenges to utilize state-of-the-art technology and strategies to battle the consistently present malicious threats. Phishing is a sort of social engineering attack produced technically and classified as identity theft an...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Pubudu L. Indrasiri, Malka N. Halgamuge, Azeem Mohammad
Formato: article
Lenguaje:EN
Publicado: IEEE 2021
Materias:
NLP
Acceso en línea:https://doaj.org/article/823e60f9eeac44cea8f7163caca7b006
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:823e60f9eeac44cea8f7163caca7b006
record_format dspace
spelling oai:doaj.org-article:823e60f9eeac44cea8f7163caca7b0062021-11-18T00:02:45ZRobust Ensemble Machine Learning Model for Filtering Phishing URLs: Expandable Random Gradient Stacked Voting Classifier (ERG-SVC)2169-353610.1109/ACCESS.2021.3124628https://doaj.org/article/823e60f9eeac44cea8f7163caca7b0062021-01-01T00:00:00Zhttps://ieeexplore.ieee.org/document/9597532/https://doaj.org/toc/2169-3536As cyber-attacks grow fast and complicated, the cybersecurity industry faces challenges to utilize state-of-the-art technology and strategies to battle the consistently present malicious threats. Phishing is a sort of social engineering attack produced technically and classified as identity theft and complicated attack vectors to steal information of internet users. In this perspective, our main objective of this study is to propose a unique, robust ensemble machine learning model architecture that provides the highest prediction accuracy with a low error rate while proposing few other robust machine learning models. Both <italic>supervised</italic> and <italic>unsupervised</italic> techniques were used for the detection process. For our experiments, seven classification algorithms, one clustering algorithm, two ensemble techniques, and two large standard legitimate datasets with 73,575 URLs and 100,000 URLs were used. Two test modes (percentage split, K-Fold cross-validation) were utilized for conducting experiments and final predictions. Mechanisms were developed to (I) identify the best <inline-formula> <tex-math notation="LaTeX">$N$ </tex-math></inline-formula>, which is the optimal heuristic-based threshold value for splitting words into subwords for each classifier, (II) tune hyperparameters for each classifier to specify the best parameter combination, (III) select prominent features using various feature selection techniques, (IV) propose a robust ensemble model (classifier) called the <italic>Expandable Random Gradient Stacked Voting Classifier</italic> (<italic>ERG-SVC</italic>) utilizing a voting classifier along with a model architecture, (V) analyze possible clusters of the dataset using k-means clustering, (VI) thoroughly analyze the <italic>gradient boost</italic> classifier (<italic>GB</italic>) with respect to utilizing the &#x201C;criterion&#x201D; parameter with the Mean Absolute Error (<italic>MAE</italic>), Mean Squared Error (<italic>MSE</italic>), and <italic>Friendman&#x005F;MSE</italic>, and(VII) propose a lightweight preprocessor to reduce computational cost and preprocessing time. Initial experiments were carried out with 46 features; the number of features was reduced to 22 after the experiments. The results show that the <italic>GB</italic> classifier outperformed with the least number of <italic>NLP</italic> based features by achieving a 98.118&#x0025; prediction accuracy. Furthermore, our stacking ensemble model and proposed voting ensemble model (<italic>ERG-SVC</italic>) outperformed other tested approaches and yielded reliable prediction accuracy results in detecting malicious URLs at rates of 98.23&#x0025; and 98.27&#x0025;, respectively.Pubudu L. IndrasiriMalka N. HalgamugeAzeem MohammadIEEEarticlePhishing URLscybersecuritymachine learningNLPsupervisedunsupervisedElectrical engineering. Electronics. Nuclear engineeringTK1-9971ENIEEE Access, Vol 9, Pp 150142-150161 (2021)
institution DOAJ
collection DOAJ
language EN
topic Phishing URLs
cybersecurity
machine learning
NLP
supervised
unsupervised
Electrical engineering. Electronics. Nuclear engineering
TK1-9971
spellingShingle Phishing URLs
cybersecurity
machine learning
NLP
supervised
unsupervised
Electrical engineering. Electronics. Nuclear engineering
TK1-9971
Pubudu L. Indrasiri
Malka N. Halgamuge
Azeem Mohammad
Robust Ensemble Machine Learning Model for Filtering Phishing URLs: Expandable Random Gradient Stacked Voting Classifier (ERG-SVC)
description As cyber-attacks grow fast and complicated, the cybersecurity industry faces challenges to utilize state-of-the-art technology and strategies to battle the consistently present malicious threats. Phishing is a sort of social engineering attack produced technically and classified as identity theft and complicated attack vectors to steal information of internet users. In this perspective, our main objective of this study is to propose a unique, robust ensemble machine learning model architecture that provides the highest prediction accuracy with a low error rate while proposing few other robust machine learning models. Both <italic>supervised</italic> and <italic>unsupervised</italic> techniques were used for the detection process. For our experiments, seven classification algorithms, one clustering algorithm, two ensemble techniques, and two large standard legitimate datasets with 73,575 URLs and 100,000 URLs were used. Two test modes (percentage split, K-Fold cross-validation) were utilized for conducting experiments and final predictions. Mechanisms were developed to (I) identify the best <inline-formula> <tex-math notation="LaTeX">$N$ </tex-math></inline-formula>, which is the optimal heuristic-based threshold value for splitting words into subwords for each classifier, (II) tune hyperparameters for each classifier to specify the best parameter combination, (III) select prominent features using various feature selection techniques, (IV) propose a robust ensemble model (classifier) called the <italic>Expandable Random Gradient Stacked Voting Classifier</italic> (<italic>ERG-SVC</italic>) utilizing a voting classifier along with a model architecture, (V) analyze possible clusters of the dataset using k-means clustering, (VI) thoroughly analyze the <italic>gradient boost</italic> classifier (<italic>GB</italic>) with respect to utilizing the &#x201C;criterion&#x201D; parameter with the Mean Absolute Error (<italic>MAE</italic>), Mean Squared Error (<italic>MSE</italic>), and <italic>Friendman&#x005F;MSE</italic>, and(VII) propose a lightweight preprocessor to reduce computational cost and preprocessing time. Initial experiments were carried out with 46 features; the number of features was reduced to 22 after the experiments. The results show that the <italic>GB</italic> classifier outperformed with the least number of <italic>NLP</italic> based features by achieving a 98.118&#x0025; prediction accuracy. Furthermore, our stacking ensemble model and proposed voting ensemble model (<italic>ERG-SVC</italic>) outperformed other tested approaches and yielded reliable prediction accuracy results in detecting malicious URLs at rates of 98.23&#x0025; and 98.27&#x0025;, respectively.
format article
author Pubudu L. Indrasiri
Malka N. Halgamuge
Azeem Mohammad
author_facet Pubudu L. Indrasiri
Malka N. Halgamuge
Azeem Mohammad
author_sort Pubudu L. Indrasiri
title Robust Ensemble Machine Learning Model for Filtering Phishing URLs: Expandable Random Gradient Stacked Voting Classifier (ERG-SVC)
title_short Robust Ensemble Machine Learning Model for Filtering Phishing URLs: Expandable Random Gradient Stacked Voting Classifier (ERG-SVC)
title_full Robust Ensemble Machine Learning Model for Filtering Phishing URLs: Expandable Random Gradient Stacked Voting Classifier (ERG-SVC)
title_fullStr Robust Ensemble Machine Learning Model for Filtering Phishing URLs: Expandable Random Gradient Stacked Voting Classifier (ERG-SVC)
title_full_unstemmed Robust Ensemble Machine Learning Model for Filtering Phishing URLs: Expandable Random Gradient Stacked Voting Classifier (ERG-SVC)
title_sort robust ensemble machine learning model for filtering phishing urls: expandable random gradient stacked voting classifier (erg-svc)
publisher IEEE
publishDate 2021
url https://doaj.org/article/823e60f9eeac44cea8f7163caca7b006
work_keys_str_mv AT pubudulindrasiri robustensemblemachinelearningmodelforfilteringphishingurlsexpandablerandomgradientstackedvotingclassifierergsvc
AT malkanhalgamuge robustensemblemachinelearningmodelforfilteringphishingurlsexpandablerandomgradientstackedvotingclassifierergsvc
AT azeemmohammad robustensemblemachinelearningmodelforfilteringphishingurlsexpandablerandomgradientstackedvotingclassifierergsvc
_version_ 1718425233677877248