Evaluating the Impact of Data Balancing Techniques on the k-Nearest Neighbors Algorithm for Microarray Data Classification

  • Febi Nur Salisah Universitas Islam Negeri Sultan Syarif Kasim
  • Inggih Permana Universitas Islam Negeri Sultan Syarif Kasim
  • Sanusi Universitas Teuku Umar
  • Shir Li Wang Universiti Pendidikan Sultan Idris
Keywords: Microarray, kNN, RUS, ROS, SMOTE

Abstract

Microarray data classification poses significant challenges in bioinformatics due to the nature of the data, which has a very high number of features but a limited number of samples, and an unbalanced class distribution. This condition can cause a decrease in the performance of classification models, including k-Nearest Neighbor (kNN). This study aims to evaluate the performance of the kNN algorithm in classifying unbalanced and balanced data. The balancing techniques used are Random Undersampling (RUS), Random Oversampling (ROS), and Synthetic Minority Over-sampling Technique (SMOTE). The datasets used in this study are three leukemia datasets with different class structures, namely two, three, and four classes. The experimental results show that the ROS and SMOTE techniques consistently improve the performance of kNN, with the best accuracy reaching more than 97%. In the two-class dataset, ROS gave the best performance (99.4%), while in the three-class dataset, SMOTE showed the most optimal results (98.5%). In the four-class dataset, the performance improvement due to balancing was very significant; SMOTE and ROS were able to improve the accuracy from 89.7% (without balancing) to 99.0% and 98.8%, respectively. Although RUS recorded perfect accuracy of 100%, the results were anomalous and inconsistent. RUS showed less stable performance and was often lower than the condition without balancing, especially on datasets with four classes. Overall, the SMOTE technique proved to be the most stable and effective for various class structures. This study shows the importance of balancing strategies in the classification of complex and imbalanced microarray data.

Downloads

Download data is not yet available.

References

Peng, Y. (2006). A novel ensemble machine learning for robust microarray data classification. Computers in Biology and Medicine, 36(6), 553-573.

Bolón-Canedo, V., Sánchez-Marono, N., Alonso-Betanzos, A., Benítez, J. M., & Herrera, F. (2014). A review of microarray datasets and applied feature selection methods. Information sciences, 282, 111-135.

Alrefai, N., & Ibrahim, O. (2022). Optimized feature selection method using particle swarm intelligence with ensemble learning for cancer classification based on microarray datasets. Neural Computing and Applications, 34(16), 13513-13528.

Zeng, Y., Zhang, Y., Xiao, Z., & Sui, H. (2025). A multi-classification deep neural network for cancer type identification from high-dimension, small-sample and imbalanced gene microarray data. Scientific Reports, 15(1), 5239.

Kumar, P., Bhatnagar, R., Gaur, K., & Bhatnagar, A. (2021, March). Classification of imbalanced data: review of methods and applications. In IOP conference series: materials science and engineering (Vol. 1099, No. 1, p. 012077). IOP Publishing.

Wu, G., & Chang, E. Y. (2005). KBA: Kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on knowledge and data engineering, 17(6), 786-795.

Jadhav, A., Mostafa, S. M., Elmannai, H., & Karim, F. K. (2022). An empirical assessment of performance of data balancing techniques in classification task. Applied Sciences, 12(8), 3928.

Dal Pozzolo, A., Caelen, O., & Bontempi, G. (2015, August). When is undersampling effective in unbalanced classification tasks?. In Joint european conference on machine learning and knowledge discovery in databases (pp. 200-215). Cham: Springer International Publishing.

Susan, S., & Kumar, A. (2021). The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art. Engineering Reports, 3(4), e12298.

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.

Halder, R. K., Uddin, M. N., Uddin, M. A., Aryal, S., & Khraisat, A. (2024). Enhancing K-nearest neighbor algorithm: a comprehensive review and performance analysis of modifications. Journal of Big Data, 11(1), 113.

Juliusdottir, T., Keedwell, E., Corne, D., & Narayanan, A. (2005, November). Two-phase EA/k-NN for feature selection and classification in cancer microarray datasets. In 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (pp. 1-8). IEEE.

Deegalla, S., & Boström, H. (2007). Classification of microarrays with knn: comparison of dimensionality reduction methods. In Intelligent Data Engineering and Automated Learning-IDEAL 2007: 8th International Conference, Birmingham, UK, December 16-19, 2007. Proceedings 8 (pp. 800-809). Springer Berlin Heidelberg.

Meesad, P., & Hengpraprohm, K. (2008, June). Combination of knn-based feature selection and knnbased missing-value imputation of microarray data. In 2008 3rd International Conference on Innovative Computing Information and Control (pp. 341-341). IEEE.

Parry, R. M., Jones, W., Stokes, T. H., Phan, J. H., Moffitt, R. A., Fang, H., ... & Wang, M. D. (2010). k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction. The pharmacogenomics journal, 10(4), 292-309.

Keerin, P., & Boongoen, T. (2021). Improved knn imputation for missing values in gene expression data. Computers, Materials and Continua, 70(2), 4009-4025.

Wojtowicz, A., Mrukowicz, M., Gałka, W., Balicki, K., Rzasa, W., & Bentkowska, U. (2024). Binary ensemble kNN based classifier for microarray datasets. Procedia Computer Science, 246, 4411-4420.

Zhu, Z., Ong, Y. S., & Dash, M. (2007). Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognition, 40(11), 3236-3248.

Omuya, E. O., Okeyo, G. O., & Kimwele, M. W. (2021). Feature selection for classification using principal component analysis and information gain. Expert Systems with Applications, 174, 114765.

Kumar, V., Lalotra, G. S., Sasikala, P., Rajput, D. S., Kaluri, R., Lakshmanna, K., ... & Uddin, M. (2022, July). Addressing binary classification over class imbalanced clinical datasets using computationally intelligent techniques. In Healthcare (Vol. 10, No. 7, p. 1293). MDPI.

Fernández, A., Garcia, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research, 61, 863-905.

Henderi, H., Wahyuningsih, T., & Rahwanto, E. (2021). Comparison of Min-Max normalization and Z-Score Normalization in the K-nearest neighbor (kNN) Algorithm to Test the Accuracy of Types of Breast Cancer. International Journal of Informatics and Information Systems, 4(1), 13-20.

Nasution, D. A., Khotimah, H. H., & Chamidah, N. (2019). Perbandingan normalisasi data untuk klasifikasi wine menggunakan algoritma K-NN. Comput. Eng. Sci. Syst. J, 4(1), 78.

Pandey, A., & Jain, A. (2017). Comparative analysis of KNN algorithm using various normalization techniques. International Journal of Computer Network and Information Security, 10(11), 36.

Chamidah, N., & Salamah, U. (2012). Pengaruh normalisasi data pada jaringan syaraf tiruan backpropagasi gradient descent adaptive gain (bpgdag) untuk klasifikasi. ITSMART: Jurnal Teknologi dan Informasi, 1(1), 28-33.

Published
2025-07-14
How to Cite
[1]
Febi Nur Salisah, Inggih Permana, Sanusi, and Shir Li Wang, “Evaluating the Impact of Data Balancing Techniques on the k-Nearest Neighbors Algorithm for Microarray Data Classification”, JI, vol. 10, no. 2, pp. 261-271, Jul. 2025.