Phishing Email Classification Approach Using Machine Learning Algorithms - A Literature Review

Authors

  • Firman Universitas Pamulang
  • Tukiyat Universitas Pamulang
  • Sudarno Wiharjo Universitas Pamulang

DOI:

https://doi.org/10.61978/data.v3i3.692

Keywords:

Phishing, Email, Machine Learning, Email Classification, K-Nearest Neighbor (KNN), Naïve Bayes, Support Vector Machine (SVM), Random Forest, NLP Feature Extraction

Abstract

Email phishing is one of the cybersecurity threats that continues to grow, utilizing social engineering to obtain sensitive data. Various machine learning-based approaches have been researched to detect and classify phishing emails. This article presents a literature review of phishing email classification methods, including the K-Nearest Neighbor (KNN) algorithm, Naïve Bayes, Support Vector Machine (SVM), Random Forest, and deep learning-based approaches. The discussion included feature extraction techniques (TF-IDF, Word2Vec, BERT), handling data imbalances, and model performance evaluation. This review identifies current research trends, challenges, and gaps for further research.

References

Adipa, M., Zy, A. T., & Effendi, M. M. (2023). Classification of Phishing Emails Using the K-Nearest Neighbor Algorithm. RESTIKOM Journal: Informatics and Computer Engineering Research, 5(2), 148–157. https://doi.org/10.52005/restikom.v5i2.152

Akinyelu, A. A., & Adewumi, A. O. (2014). Classification of phishing email using random forest machine learning technique. Journal of Applied Mathematics. https://doi.org/10.1155/2014/425731

Al Tawil, A. (2024). Comparative Analysis of Machine Learning Algorithms for Email Phishing Detection Using TF-IDF, Word2Vec, and BERT. Computers, Materials and Continua, 81(2), 3395–3412. https://doi.org/10.32604/cmc.2024.057279

Alazaidah, R. (2024). Website Phishing Detection Using Machine Learning Techniques. Journal of Statistics Applications and Probability, 13(1), 119–129. https://doi.org/10.18576/jsap/130108

Anugroho, P., & Winarno, I. (2018). Classify spam emails with the naïve bayes classifier method using java programming. Its, 1–11.

Bachri, C. M., & Gunawan, W. (2024). Spam Email Detection using Convolutional Neural Network (CNN) Algorithm. Informatics Education and Research, 10(1), 88–94.

Butt, U. A. (2023). Cloud-based email phishing attack using machine and deep learning algorithm. Complex and Intelligent Systems, 9(3), 3043–3070. https://doi.org/10.1007/s40747-022-00760-3

Delcourt, K., Trouilhet, S., Arcangeli, J.-P., & Adreit, F. (2024). The Human in Interactive Machine Learning: Analysis and Perspectives for Ambient Intelligence. Journal of Artificial Intelligence Research, 81, 263–305. https://doi.org/10.1613/jair.1.15665

Deng, M. (2025). Machine Learning Advances in Technology Applications: Cultural Heritage Tourism Trends in Experience Design. International Journal of Advanced Computer Science and Applications, 16(4), 186–196. https://doi.org/10.14569/IJACSA.2025.0160420

Diantika, S. (2023). Application of random oversampling technique to overcome class imbalance in the classification of phishing websites using the lightgbm algorithm. JATI, 7(1), 19–25. https://doi.org/10.36040/jati.v7i1.6006

Eldeeb, N., Ren, C., & Shapiro, V. B. (2025). Parent information seeking and sharing: Using unsupervised machine learning to identify common parenting issues. Children and Youth Services Review, 172. https://doi.org/10.1016/j.childyouth.2025.108210

Ester, R. (2024). Optimization of Decision Tree Classification Algorithm (CART) with the Bagging Method. JSR, 8(1). http://ojsamik.amikmitragama.ac.id

Fauzan, R. (2025). Application of Classification Algorithms in Machine Learning for Phishing Detection (Vol. 5, Issue April, pp. 531–540).

Firmansyah, F. A. (2025). Application of Naive Bayes Algorithm with Chi-Square for Email Spam Classification (Vol. 13, Issue 1).

Hayuningtyas, R. Y. (2017). The Filtering of Email Spam application uses Naïve Bayes. IJCIT, 2(1), 53–60.

Irawan, D. (2021). Comparison of SMS Classification Using SVM, Naive Bayes, and Random Forest. Sisfokom Journal, 10(3), 432–437. https://doi.org/10.32736/sisfokom.v10i3.1302

Kapko, M. (2023). Compromised credential use jumps 300% in cloud intrusions. Cybersecuritydive. https://www.cybersecuritydive.com/news/compromised-credentials-cloud-intrusions-ibm/693482

Kencana, A. K. (2022). Implementation of the Random Forest Classification Method for Phishing Links (Vol. 4, Issue 2, pp. 55–59).

Mahmud, A. F., & Wirawan, S. (2024). Phishing Website Detection using Machine Learning. Systemasi, 13(4). http://sistemasi.ftik.unisi.ac.id

Mayang Sari, G. M. (2024). Naive Bayes Classifier for Spam Email Detection (Vol. 15, Issue 4, pp. 675–680).

Nguyen, K., Wilson, D. L., DiIulio, J., Hall, B., Militello, L., Gellad, W. F., Harle, C. A., Lewis, M., Schmidt, S., Rosenberg, E. I., Nelson, D., He, X., Wu, Y., Bian, J., Staras, S. A. S., Gordon, A. J., Cochran, J., Kuza, C., Yang, S., & Lo-Ciganic, W. (2024). Design and development of a machine-learning-driven opioid overdose risk prediction tool integrated in electronic health records in primary care settings. Bioelectronic Medicine, 10(1). https://doi.org/10.1186/s42234-024-00156-3

Nthurima, F., & Matheka, A. (2023). A Classifier Model to Detect Phishing Emails Using Ensemble Technique. OJIT, 6(2), 157–172. https://doi.org/10.32591/coas.ojit.0602.06157n

Panggabean, D. S. O., Buulolo, E., & Silalahi, N. (2020). Penerapan Data Mining Untuk Memprediksi Pemesanan Bibit Pohon Dengan Regresi Linear Berganda. JURIKOM (Jurnal Riset Komputer, 7(1), 56. https://doi.org/10.30865/jurikom.v7i1.1947

Probierz, B., Stefanski, P., & Kozak, J. (2021). Rapid detection of fake news based on machine learning methods. Procedia Computer Science, 192, 2893–2902. https://doi.org/10.1016/j.procs.2021.09.060

Roihan, A., Sunarya, P. A., & Rafika, A. S. (2020). Pemanfaatan Machine Learning dalam Berbagai Bidang: Review paper. IJCIT (Indonesian Journal on Computer and Information Technology, 5(1), 75–82. https://doi.org/10.31294/ijcit.v5i1.7951

Salim, A. N. (2024). Detect Spam and Non-Spam Emails Using KNN and SVM. Syntax Idea, 6(2), 991–1001. https://doi.org/10.46799/syntax-idea.v6i2.3052

Salloum, S. (2022). Phishing Email Detection Using NLP: A Systematic Review. IEEE Access, 10, 65703–65727. https://doi.org/10.1109/ACCESS.2022.3183083

Sandag, G. A., Leopold, J., & Ong, V. F. (2018). Klasifikasi Malicious Websites Menggunakan Algoritma K-NN Berdasarkan Application Layers dan Network Characteristics. CogITo Smart Journal, 4(1), 37–45. https://doi.org/10.31154/cogito.v4i1.100.37-45

Tangkere, B. B. (2024). Performance Analysis of Logistic Regression and Support Vector Classification for Phishing Emails. Journal of Information Systems Management Economics, 5(4), 442–450. https://doi.org/10.31933/jemsi.v5i4.1916

Umam, C., & Handoko, L. B. (2024). Prediksi Email Phising Menggunakan Support Vector Machine. Semnas Ristek (Seminar Nasional Riset Dan Inovasi Teknologi, 8(01), 85–89. https://doi.org/10.30998/semnasristek.v8i01.7138

Wibisono, A. D., Dadi Rizkiono, S., & Wantoro, A. (2020). Filtering Spam Email Menggunakan Metode Naive Bayes. TELEFORTECH : Journal of Telematics and Information Technology, 1(1). https://doi.org/10.33365/tft.v1i1.685

Downloads

Published

2025-07-31

How to Cite

Firman, Tukiyat, & Wiharjo, S. (2025). Phishing Email Classification Approach Using Machine Learning Algorithms - A Literature Review. Data : Journal of Information Systems and Management, 3(3), 135–145. https://doi.org/10.61978/data.v3i3.692

Issue

Section

Articles