Machine Learning-Based Effective Malicious Web Page Detection

Anıl Utku; Ümit Can

Araştırma Makalesi

BibTex

RIS

Kaynak Göster

Machine Learning-Based Effective Malicious Web Page Detection

Yıl 2022, Cilt: 11 Sayı: 4, 28 - 39, 31.12.2022

Anıl Utku Ümit Can

Öz

The use of the Internet is becoming more and more widespread day by day, putting millions of users at risk of cyberattacks.
Especially during the Covid-19 epidemic, internet usage has increased significantly and various cyber-attacks have been
made through malicious websites. With these attacks, much information such as people’s private information, bank information,
and social information can be captured. Many methods have been developed to prevent cyber-attacks. In particular, methods
that use machine learning methods other than traditional methods give more successful results. In this study, it has been tried
to automatically detect malicious websites by using the URL properties of malicious websites. For this purpose, popular machine
learning methods such as DT, kNN, LightGBM, LR, MLP, RF, SVM, and XGBoost were used. According to the experimental results,
the RF algorithm achieved 96% accuracy.

Anahtar Kelimeler

Malicious websites, cyber attacks, machine learning.

Kaynakça

U. Can and B. Alatas, “Cyberbullying and cyberstalking on online social networks,” in Securing Social Networks in Cyberspace. CRC Press, 2021, pp. 141–162.
R. S. ARSLAN, “K¨ot¨uc¨ul url filtreleme ic¸in derin ¨o˘grenme modeli tasarımı,” Avrupa Bilim ve Teknoloji Dergisi, no. 29, pp. 122–128, 2021.
S. He, B. Li, H. Peng, J. Xin, and E. Zhang, “An effective cost-sensitive xgboost method for malicious urls detection in imbalanced dataset,” IEEE Access, vol. 9, pp. 93 089–93 096, 2021.
A. Sirageldin, B. B. Baharudin, and L. T. Jung, “Malicious web page detection: A machine learning approach,” in Advances in computer science and its applications. Springer, 2014, pp. 217–224.
Y.-T. Hou, Y. Chang, T. Chen, C.-S. Laih, and C.-M. Chen, “Malicious web content detection by machine learning,” expert systems with applications, vol. 37, no. 1, pp. 55–60, 2010.
J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Learning to detect malicious urls,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 1–24, 2011.
W. Zhang, Y.-X. Ding, Y. Tang, and B. Zhao, “Malicious web page detection based on on-line learning algorithm,” in 2011 International Conference on Machine Learning and Cybernetics, vol. 4. IEEE, 2011, pp. 1914–1919.
B. Eshete, “Effective analysis, characterization, and detection of malicious web pages,” in Proceedings of the 22nd International Conference on World Wide Web, 2013, pp. 355–360.
H. B. Kazemian and S. Ahmed, “Comparisons of machine learning techniques for detecting malicious webpages,” Expert Systems with Applications, vol. 42, no. 3, pp. 1166–1177, 2015.
O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learning based phishing detection from urls,” Expert Systems with Applications, vol. 117, pp. 345–357, 2019.
D. Liu and J.-H. Lee, “Cnn based malicious website detection by invalidating multiple web spams,” IEEE access, vol. 8, pp. 97 258–97 266, 2020.
J. Li, Z. Zhang, and C. Guo, “Machine learning-based malicious x. 509 certificates’ detection,” Applied Sciences, vol. 11, no. 5, p. 2164, 2021.
A. S. Raja, R. Vinodini, and A. Kavitha, “Lexical features based malicious url detection using machine learning techniques,” Materials Today: Proceedings, vol. 47, pp. 163–166, 2021.
SPSS, AnwerTree Algorithm Summary. USA: SPSS White Paper, 1999.
J. Sun and H. Li, “Data mining method for listed companies’ financial distress prediction,” Knowledge-Based Systems, vol. 21, no. 1, pp. 1–5, 2008.
T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967.
M. Khan, Q. Ding, and W. Perrizo, “k-nearest neighbor classification on spatial data streams using p-trees,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2002, pp. 517–528.
E. Erdem and F. Bozkurt, “A comparison of various supervised machine learning techniques for prostate cancer prediction,” Avrupa Bilim ve Teknoloji Dergisi, no. 21, pp. 610–620, 2021.
C. Mood, “Logistic regression: Why we cannot do what we think we can do, and what we can do about it,” European sociological review, vol. 26, no. 1, pp. 67–82, 2010.
S. Dom´ınguez-Almendros, N. Ben´ıtez-Parejo, and A. R. Gonzalez-Ramirez, “Logistic regression models,” Allergologia et immunopathologia, vol. 39, no. 5, pp. 295–305, 2011.
H. Ramchoun, Y. Ghanou, M. Ettaouil, and M. A. Janati Idrissi, “Multilayer perceptron: Architecture optimization and training,” International Journal of Interactive Multimedia and Artificial Intelligence, 2016.
H. Faris, I. Aljarah, N. Al-Madi, and S. Mirjalili, “Optimizing the learning process of feedforward neural networks using lightning search algorithm,” International Journal on Artificial Intelligence Tools, vol. 25, no. 06, p. 1650033, 2016.
L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
M. Belgiu and L. Dr˘agut¸, “Random forest in remote sensing: A review of applications and future directions,” ISPRS journal of photogrammetry and remote sensing, vol. 114, pp. 24–31, 2016.
M. Mursalin, Y. Zhang, Y. Chen, and N. V. Chawla, “Automated epileptic seizure detection using improved correlation-based feature selection with random forest classifier,” Neurocomputing, vol. 241, pp. 204–214, 2017.
H. Chen, Z. Lin, H. Wu, L. Wang, T. Wu, and C. Tan, “Diagnosis of colorectal cancer by near-infrared optical fiber spectroscopy and random forest,” Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, vol. 135, pp. 185–191, 2015.
C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” Advances in neural information processing systems, vol. 30, 2017.
T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp.785–794.
O. Sagi and L. Rokach, “Ensemble learning: A survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 4, p. e1249, 2018. 15

Yıl 2022, Cilt: 11 Sayı: 4, 28 - 39, 31.12.2022

Anıl Utku Ümit Can

Öz

Kaynakça

U. Can and B. Alatas, “Cyberbullying and cyberstalking on online social networks,” in Securing Social Networks in Cyberspace. CRC Press, 2021, pp. 141–162.
R. S. ARSLAN, “K¨ot¨uc¨ul url filtreleme ic¸in derin ¨o˘grenme modeli tasarımı,” Avrupa Bilim ve Teknoloji Dergisi, no. 29, pp. 122–128, 2021.
S. He, B. Li, H. Peng, J. Xin, and E. Zhang, “An effective cost-sensitive xgboost method for malicious urls detection in imbalanced dataset,” IEEE Access, vol. 9, pp. 93 089–93 096, 2021.
A. Sirageldin, B. B. Baharudin, and L. T. Jung, “Malicious web page detection: A machine learning approach,” in Advances in computer science and its applications. Springer, 2014, pp. 217–224.
Y.-T. Hou, Y. Chang, T. Chen, C.-S. Laih, and C.-M. Chen, “Malicious web content detection by machine learning,” expert systems with applications, vol. 37, no. 1, pp. 55–60, 2010.
J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Learning to detect malicious urls,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 1–24, 2011.
W. Zhang, Y.-X. Ding, Y. Tang, and B. Zhao, “Malicious web page detection based on on-line learning algorithm,” in 2011 International Conference on Machine Learning and Cybernetics, vol. 4. IEEE, 2011, pp. 1914–1919.
B. Eshete, “Effective analysis, characterization, and detection of malicious web pages,” in Proceedings of the 22nd International Conference on World Wide Web, 2013, pp. 355–360.
H. B. Kazemian and S. Ahmed, “Comparisons of machine learning techniques for detecting malicious webpages,” Expert Systems with Applications, vol. 42, no. 3, pp. 1166–1177, 2015.
O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learning based phishing detection from urls,” Expert Systems with Applications, vol. 117, pp. 345–357, 2019.
D. Liu and J.-H. Lee, “Cnn based malicious website detection by invalidating multiple web spams,” IEEE access, vol. 8, pp. 97 258–97 266, 2020.
J. Li, Z. Zhang, and C. Guo, “Machine learning-based malicious x. 509 certificates’ detection,” Applied Sciences, vol. 11, no. 5, p. 2164, 2021.
A. S. Raja, R. Vinodini, and A. Kavitha, “Lexical features based malicious url detection using machine learning techniques,” Materials Today: Proceedings, vol. 47, pp. 163–166, 2021.
SPSS, AnwerTree Algorithm Summary. USA: SPSS White Paper, 1999.
J. Sun and H. Li, “Data mining method for listed companies’ financial distress prediction,” Knowledge-Based Systems, vol. 21, no. 1, pp. 1–5, 2008.
T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967.
M. Khan, Q. Ding, and W. Perrizo, “k-nearest neighbor classification on spatial data streams using p-trees,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2002, pp. 517–528.
E. Erdem and F. Bozkurt, “A comparison of various supervised machine learning techniques for prostate cancer prediction,” Avrupa Bilim ve Teknoloji Dergisi, no. 21, pp. 610–620, 2021.
C. Mood, “Logistic regression: Why we cannot do what we think we can do, and what we can do about it,” European sociological review, vol. 26, no. 1, pp. 67–82, 2010.
S. Dom´ınguez-Almendros, N. Ben´ıtez-Parejo, and A. R. Gonzalez-Ramirez, “Logistic regression models,” Allergologia et immunopathologia, vol. 39, no. 5, pp. 295–305, 2011.
H. Ramchoun, Y. Ghanou, M. Ettaouil, and M. A. Janati Idrissi, “Multilayer perceptron: Architecture optimization and training,” International Journal of Interactive Multimedia and Artificial Intelligence, 2016.
H. Faris, I. Aljarah, N. Al-Madi, and S. Mirjalili, “Optimizing the learning process of feedforward neural networks using lightning search algorithm,” International Journal on Artificial Intelligence Tools, vol. 25, no. 06, p. 1650033, 2016.
L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
M. Belgiu and L. Dr˘agut¸, “Random forest in remote sensing: A review of applications and future directions,” ISPRS journal of photogrammetry and remote sensing, vol. 114, pp. 24–31, 2016.
M. Mursalin, Y. Zhang, Y. Chen, and N. V. Chawla, “Automated epileptic seizure detection using improved correlation-based feature selection with random forest classifier,” Neurocomputing, vol. 241, pp. 204–214, 2017.
H. Chen, Z. Lin, H. Wu, L. Wang, T. Wu, and C. Tan, “Diagnosis of colorectal cancer by near-infrared optical fiber spectroscopy and random forest,” Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, vol. 135, pp. 185–191, 2015.
C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” Advances in neural information processing systems, vol. 30, 2017.
T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp.785–794.
O. Sagi and L. Rokach, “Ensemble learning: A survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 4, p. e1249, 2018. 15

Toplam 30 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Bilgisayar Yazılımı
Bölüm	Araştırma Makalesi
Yazarlar	Anıl Utku 0000-0002-7240-8713 Ümit Can 0000-0002-8832-6317
Yayımlanma Tarihi	31 Aralık 2022
Gönderilme Tarihi	25 Temmuz 2022
Yayımlandığı Sayı	Yıl 2022 Cilt: 11 Sayı: 4

Kaynak Göster

IEEE	A. Utku ve Ü. Can, “Machine Learning-Based Effective Malicious Web Page Detection”, IJISS, c. 11, sy. 4, ss. 28–39, 2022.

Makale Dosyaları

Tam Metin