This study compares the classification accuracy of text mining algorithms for foreign language proficiency exam items. The dataset included 2,868 items from ÜDS English tests (2006–2012) across Natural and Applied Sciences (n=956), Health Sciences (n=956), and Social Sciences (n=956). Algorithms tested were k-Nearest Neighbors (kNN), Naïve Bayes (NB), Naïve Bayes-Kernel (NB-K), Random Forest (RF), and Support Vector Machines (SVM). Binary classification accuracies ranged from 83.08% (NB) to 92.48% (SVM), while multiclass accuracies ranged from 71.93% (NB) to 84.96% (kNN). Expert analysis and cross-validation identified class-inconsistent items that negatively affected accuracy. Removing these items improved binary classification by 7.39%–9.83% and multiclass classification by 10.58%–17.89%. Among algorithms, kNN was least impacted by class-inconsistent data. These findings highlight the importance of addressing inconsistencies for improving algorithmic performance, with kNN showing robust results across scenarios.
Text mining Document classification Class-inconsistent data Robustness of classification algorithms
Primary Language | English |
---|---|
Subjects | Software Engineering (Other) |
Journal Section | Research Article |
Authors | |
Early Pub Date | September 24, 2025 |
Publication Date | September 30, 2025 |
Submission Date | January 24, 2025 |
Acceptance Date | July 16, 2025 |
Published in Issue | Year 2025 Volume: 8 Issue: 3 |
The papers in this journal are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License