Phishing Email Classification Approach Using Machine Learning Algorithms - A Literature Review
DOI:
https://doi.org/10.61978/data.v3i3.692Keywords:
Phishing, Email, Machine Learning, Email Classification, K-Nearest Neighbor (KNN), Naïve Bayes, Support Vector Machine (SVM), Random Forest, NLP Feature ExtractionAbstract
Email phishing is one of the cybersecurity threats that continues to grow, utilizing social engineering to obtain sensitive data. Various machine learning-based approaches have been researched to detect and classify phishing emails. This article presents a literature review of phishing email classification methods, including the K-Nearest Neighbor (KNN) algorithm, Naïve Bayes, Support Vector Machine (SVM), Random Forest, and deep learning-based approaches. The discussion included feature extraction techniques (TF-IDF, Word2Vec, BERT), handling data imbalances, and model performance evaluation. This review identifies current research trends, challenges, and gaps for further research.
References
Adipa, M., Zy, A. T., & Effendi, M. M. (2023). Classification of Phishing Emails Using the K-Nearest Neighbor Algorithm. RESTIKOM Journal: Informatics and Computer Engineering Research, 5(2), 148–157. https://doi.org/10.52005/restikom.v5i2.152
Akinyelu, A. A., & Adewumi, A. O. (2014). Classification of phishing email using random forest machine learning technique. Journal of Applied Mathematics. https://doi.org/10.1155/2014/425731
Al Tawil, A. (2024). Comparative Analysis of Machine Learning Algorithms for Email Phishing Detection Using TF-IDF, Word2Vec, and BERT. Computers, Materials and Continua, 81(2), 3395–3412. https://doi.org/10.32604/cmc.2024.057279
Alazaidah, R. (2024). Website Phishing Detection Using Machine Learning Techniques. Journal of Statistics Applications and Probability, 13(1), 119–129. https://doi.org/10.18576/jsap/130108
Anugroho, P., & Winarno, I. (2018). Classify spam emails with the naïve bayes classifier method using java programming. Its, 1–11.
Bachri, C. M., & Gunawan, W. (2024). Spam Email Detection using Convolutional Neural Network (CNN) Algorithm. Informatics Education and Research, 10(1), 88–94.
Butt, U. A. (2023). Cloud-based email phishing attack using machine and deep learning algorithm. Complex and Intelligent Systems, 9(3), 3043–3070. https://doi.org/10.1007/s40747-022-00760-3
Delcourt, K., Trouilhet, S., Arcangeli, J.-P., & Adreit, F. (2024). The Human in Interactive Machine Learning: Analysis and Perspectives for Ambient Intelligence. Journal of Artificial Intelligence Research, 81, 263–305. https://doi.org/10.1613/jair.1.15665
Deng, M. (2025). Machine Learning Advances in Technology Applications: Cultural Heritage Tourism Trends in Experience Design. International Journal of Advanced Computer Science and Applications, 16(4), 186–196. https://doi.org/10.14569/IJACSA.2025.0160420
Diantika, S. (2023). Application of random oversampling technique to overcome class imbalance in the classification of phishing websites using the lightgbm algorithm. JATI, 7(1), 19–25. https://doi.org/10.36040/jati.v7i1.6006
Eldeeb, N., Ren, C., & Shapiro, V. B. (2025). Parent information seeking and sharing: Using unsupervised machine learning to identify common parenting issues. Children and Youth Services Review, 172. https://doi.org/10.1016/j.childyouth.2025.108210
Ester, R. (2024). Optimization of Decision Tree Classification Algorithm (CART) with the Bagging Method. JSR, 8(1). http://ojsamik.amikmitragama.ac.id
Fauzan, R. (2025). Application of Classification Algorithms in Machine Learning for Phishing Detection (Vol. 5, Issue April, pp. 531–540).
Firmansyah, F. A. (2025). Application of Naive Bayes Algorithm with Chi-Square for Email Spam Classification (Vol. 13, Issue 1).
Hayuningtyas, R. Y. (2017). The Filtering of Email Spam application uses Naïve Bayes. IJCIT, 2(1), 53–60.
Irawan, D. (2021). Comparison of SMS Classification Using SVM, Naive Bayes, and Random Forest. Sisfokom Journal, 10(3), 432–437. https://doi.org/10.32736/sisfokom.v10i3.1302
Kapko, M. (2023). Compromised credential use jumps 300% in cloud intrusions. Cybersecuritydive. https://www.cybersecuritydive.com/news/compromised-credentials-cloud-intrusions-ibm/693482
Kencana, A. K. (2022). Implementation of the Random Forest Classification Method for Phishing Links (Vol. 4, Issue 2, pp. 55–59).
Mahmud, A. F., & Wirawan, S. (2024). Phishing Website Detection using Machine Learning. Systemasi, 13(4). http://sistemasi.ftik.unisi.ac.id
Mayang Sari, G. M. (2024). Naive Bayes Classifier for Spam Email Detection (Vol. 15, Issue 4, pp. 675–680).
Nguyen, K., Wilson, D. L., DiIulio, J., Hall, B., Militello, L., Gellad, W. F., Harle, C. A., Lewis, M., Schmidt, S., Rosenberg, E. I., Nelson, D., He, X., Wu, Y., Bian, J., Staras, S. A. S., Gordon, A. J., Cochran, J., Kuza, C., Yang, S., & Lo-Ciganic, W. (2024). Design and development of a machine-learning-driven opioid overdose risk prediction tool integrated in electronic health records in primary care settings. Bioelectronic Medicine, 10(1). https://doi.org/10.1186/s42234-024-00156-3
Nthurima, F., & Matheka, A. (2023). A Classifier Model to Detect Phishing Emails Using Ensemble Technique. OJIT, 6(2), 157–172. https://doi.org/10.32591/coas.ojit.0602.06157n
Panggabean, D. S. O., Buulolo, E., & Silalahi, N. (2020). Penerapan Data Mining Untuk Memprediksi Pemesanan Bibit Pohon Dengan Regresi Linear Berganda. JURIKOM (Jurnal Riset Komputer, 7(1), 56. https://doi.org/10.30865/jurikom.v7i1.1947
Probierz, B., Stefanski, P., & Kozak, J. (2021). Rapid detection of fake news based on machine learning methods. Procedia Computer Science, 192, 2893–2902. https://doi.org/10.1016/j.procs.2021.09.060
Roihan, A., Sunarya, P. A., & Rafika, A. S. (2020). Pemanfaatan Machine Learning dalam Berbagai Bidang: Review paper. IJCIT (Indonesian Journal on Computer and Information Technology, 5(1), 75–82. https://doi.org/10.31294/ijcit.v5i1.7951
Salim, A. N. (2024). Detect Spam and Non-Spam Emails Using KNN and SVM. Syntax Idea, 6(2), 991–1001. https://doi.org/10.46799/syntax-idea.v6i2.3052
Salloum, S. (2022). Phishing Email Detection Using NLP: A Systematic Review. IEEE Access, 10, 65703–65727. https://doi.org/10.1109/ACCESS.2022.3183083
Sandag, G. A., Leopold, J., & Ong, V. F. (2018). Klasifikasi Malicious Websites Menggunakan Algoritma K-NN Berdasarkan Application Layers dan Network Characteristics. CogITo Smart Journal, 4(1), 37–45. https://doi.org/10.31154/cogito.v4i1.100.37-45
Tangkere, B. B. (2024). Performance Analysis of Logistic Regression and Support Vector Classification for Phishing Emails. Journal of Information Systems Management Economics, 5(4), 442–450. https://doi.org/10.31933/jemsi.v5i4.1916
Umam, C., & Handoko, L. B. (2024). Prediksi Email Phising Menggunakan Support Vector Machine. Semnas Ristek (Seminar Nasional Riset Dan Inovasi Teknologi, 8(01), 85–89. https://doi.org/10.30998/semnasristek.v8i01.7138
Wibisono, A. D., Dadi Rizkiono, S., & Wantoro, A. (2020). Filtering Spam Email Menggunakan Metode Naive Bayes. TELEFORTECH : Journal of Telematics and Information Technology, 1(1). https://doi.org/10.33365/tft.v1i1.685



