Research Article

The Effects of Preprocessing on Turkish and English News Data

Volume: 6 Number: 1 April 30, 2023
EN

The Effects of Preprocessing on Turkish and English News Data

Abstract

In a standard text classification (TC) study, preprocessing is one of the key components to improve performance. This study aims to look at how preprocessing effects TC according to news text, text language, and feature selection. All potential combinations of commonly used preprocessing techniques are compared on one domain, namely news data, and in two different news datasets for this aim. Preprocessing technique contributions to classification performance at multiple feature sizes, possible interconnections among these techniques, and technique dependency on corresponding languages are all evaluated in this way. Using best combinations of preprocessing techniques rather than using or not using them all, experimental studies on public datasets reveals that, choosing best combinations of preprocessing techniques can improve classification accuracy significantly.

Keywords

References

  1. [1] G. Salton, A. Wong, and C.-S. Yang, "A vector space model for automatic indexing". Communications of the ACM, 1975. 18(11): p. 613-620.
  2. [2] T. Joachims, "Text categorization with support vector machines: Learning with many relevant features". in European conference on machine learning. 1998. Springer.
  3. [3] Y. Yang, and J.O. Pedersen. "A comparative study on feature selection in text categorization." in ICML. 1997.
  4. [4] C. Lee, and G.G. Lee," Information gain and divergence-based feature selection for machine learning-based text categorization." Information processing & management, 2006. 42(1): p. 155-165.
  5. [5] S.R. Singh, H.A. Murthy, and T.A. Gonsalves, "Feature Selection for Text Classification Based on Gini Coefficient of Inequality. "Fsdm, 2010. 10: p. 76-85.
  6. [6] A. Rehman, K. Javed, and H.A. Babri, "Feature selection based on a normalized difference measure for text classification." Information Processing & Management, 2017. 53(2): p. 473-489.
  7. [7] A. Rehman, et al., "Selection of the most relevant terms based on a max-min ratio metric for text classification." Expert Systems with Applications, 2018. 114: p. 78-96.
  8. [8] Parlak, B. and A.K. Uysal, A novel filter feature selection method for text classification: Extensive Feature Selector. Journal of Information Science, 2021: p. 0165551521991037.

Details

Primary Language

English

Subjects

Computer Software , Software Engineering (Other)

Journal Section

Research Article

Early Pub Date

April 28, 2023

Publication Date

April 30, 2023

Submission Date

November 21, 2022

Acceptance Date

March 30, 2023

Published in Issue

Year 2023 Volume: 6 Number: 1

APA
Parlak, B. (2023). The Effects of Preprocessing on Turkish and English News Data. Sakarya University Journal of Computer and Information Sciences, 6(1), 59-66. https://doi.org/10.35377/saucis...1207742
AMA
1.Parlak B. The Effects of Preprocessing on Turkish and English News Data. SAUCIS. 2023;6(1):59-66. doi:10.35377/saucis.1207742
Chicago
Parlak, Bekir. 2023. “The Effects of Preprocessing on Turkish and English News Data”. Sakarya University Journal of Computer and Information Sciences 6 (1): 59-66. https://doi.org/10.35377/saucis. 1207742.
EndNote
Parlak B (April 1, 2023) The Effects of Preprocessing on Turkish and English News Data. Sakarya University Journal of Computer and Information Sciences 6 1 59–66.
IEEE
[1]B. Parlak, “The Effects of Preprocessing on Turkish and English News Data”, SAUCIS, vol. 6, no. 1, pp. 59–66, Apr. 2023, doi: 10.35377/saucis...1207742.
ISNAD
Parlak, Bekir. “The Effects of Preprocessing on Turkish and English News Data”. Sakarya University Journal of Computer and Information Sciences 6/1 (April 1, 2023): 59-66. https://doi.org/10.35377/saucis. 1207742.
JAMA
1.Parlak B. The Effects of Preprocessing on Turkish and English News Data. SAUCIS. 2023;6:59–66.
MLA
Parlak, Bekir. “The Effects of Preprocessing on Turkish and English News Data”. Sakarya University Journal of Computer and Information Sciences, vol. 6, no. 1, Apr. 2023, pp. 59-66, doi:10.35377/saucis. 1207742.
Vancouver
1.Bekir Parlak. The Effects of Preprocessing on Turkish and English News Data. SAUCIS. 2023 Apr. 1;6(1):59-66. doi:10.35377/saucis. 1207742

Cited By

 

INDEXING & ABSTRACTING & ARCHIVING

 

31045 31044   ResimLink - Resim Yükle  31047 

31043 28939 28938 34240
 

 

29070    The papers in this journal are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License