Research Article

Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data

Volume: 8 Number: 3 September 30, 2025
EN

Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data

Abstract

This study compares the classification accuracy of text mining algorithms for foreign language proficiency exam items. The dataset included 2,868 items from ÜDS English tests (2006–2012) across Natural and Applied Sciences (n=956), Health Sciences (n=956), and Social Sciences (n=956). Algorithms tested were k-Nearest Neighbors (kNN), Naïve Bayes (NB), Naïve Bayes-Kernel (NB-K), Random Forest (RF), and Support Vector Machines (SVM). Binary classification accuracies ranged from 83.08% (NB) to 92.48% (SVM), while multiclass accuracies ranged from 71.93% (NB) to 84.96% (kNN). Expert analysis and cross-validation identified class-inconsistent items that negatively affected accuracy. Removing these items improved binary classification by 7.39%–9.83% and multiclass classification by 10.58%–17.89%. Among algorithms, kNN was least impacted by class-inconsistent data. These findings highlight the importance of addressing inconsistencies for improving algorithmic performance, with kNN showing robust results across scenarios.

Keywords

References

  1. P. Baldi, S. Brunak, Y. Chauvin, C. A. F. Andersen, and H. Nielsen, “Assessing the accuracy of prediction algorithms for classification: an overview,” Bioinformatics, vol. 16, no. 5, pp. 412–424, May 2000, doi: 10.1093/bioinformatics/16.5.412.
  2. K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey,” Information, vol. 10, no. 4, Art. no. 150, Apr. 2019, doi: 10.3390/info10040150.
  3. J. Riggs and T. Lalonde, Handbook for Applied Modeling: Non-Gaussian and Correlated Data. Cambridge, U.K.: Cambridge Univ. Press, 2017.
  4. T. W. Anderson, An Introduction to Multivariate Statistical Analysis, 3rd ed. Hoboken, NJ, USA: Wiley-Interscience, 2003.
  5. R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Ann. Eugenics, vol. 7, no. 2, pp. 179–188, Sep. 1936.
  6. S. Har-Peled, D. Roth, and D. Zimak, “Constraint classification for multiclass classification and ranking,” in Proc. 16th Int. Conf. Neural Inf. Process. Syst. (NIPS), Vancouver, BC, Canada, Dec. 2002, pp. 809–816.
  7. N. Matloff, Statistical Regression and Classification: From Linear Models to Machine Learning. Boca Raton, FL, USA: CRC Press, 2017.
  8. E. Apostolova and R. A. Kreek, “Training and prediction data discrepancies: Challenges of text classification with noisy, historical data,” in Proc. 2018 EMNLP Workshop W-NUT: 4th Workshop on Noisy User-Generated Text, Brussels, Belgium, Nov. 2018, pp. 104–109. doi: 10.18653/v1/W18-6114.

Details

Primary Language

English

Subjects

Software Engineering (Other)

Journal Section

Research Article

Early Pub Date

September 24, 2025

Publication Date

September 30, 2025

Submission Date

January 24, 2025

Acceptance Date

July 16, 2025

Published in Issue

Year 2025 Volume: 8 Number: 3

APA
Ataseven, H., & Çokluk-bökeoglu, Ö. (2025). Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data. Sakarya University Journal of Computer and Information Sciences, 8(3), 422-440. https://doi.org/10.35377/saucis...1626239
AMA
1.Ataseven H, Çokluk-bökeoglu Ö. Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data. SAUCIS. 2025;8(3):422-440. doi:10.35377/saucis.1626239
Chicago
Ataseven, Hüseyin, and Ömay Çokluk-bökeoglu. 2025. “Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data”. Sakarya University Journal of Computer and Information Sciences 8 (3): 422-40. https://doi.org/10.35377/saucis. 1626239.
EndNote
Ataseven H, Çokluk-bökeoglu Ö (September 1, 2025) Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data. Sakarya University Journal of Computer and Information Sciences 8 3 422–440.
IEEE
[1]H. Ataseven and Ö. Çokluk-bökeoglu, “Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data”, SAUCIS, vol. 8, no. 3, pp. 422–440, Sept. 2025, doi: 10.35377/saucis...1626239.
ISNAD
Ataseven, Hüseyin - Çokluk-bökeoglu, Ömay. “Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data”. Sakarya University Journal of Computer and Information Sciences 8/3 (September 1, 2025): 422-440. https://doi.org/10.35377/saucis. 1626239.
JAMA
1.Ataseven H, Çokluk-bökeoglu Ö. Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data. SAUCIS. 2025;8:422–440.
MLA
Ataseven, Hüseyin, and Ömay Çokluk-bökeoglu. “Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data”. Sakarya University Journal of Computer and Information Sciences, vol. 8, no. 3, Sept. 2025, pp. 422-40, doi:10.35377/saucis. 1626239.
Vancouver
1.Hüseyin Ataseven, Ömay Çokluk-bökeoglu. Investigating the Robustness of Text Mining Classification Algorithms: A Study of Algorithm and Expert Performance on Class-Inconsistent Data. SAUCIS. 2025 Sep. 1;8(3):422-40. doi:10.35377/saucis. 1626239

 

INDEXING & ABSTRACTING & ARCHIVING

 

31045 31044   ResimLink - Resim Yükle  31047 

31043 28939 28938 34240
 

 

29070    The papers in this journal are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License