In a standard text classification (TC) study, preprocessing is one of the key components to improve performance. This study aims to look at how preprocessing effects TC according to news text, text language, and feature selection. All potential combinations of commonly used preprocessing techniques are compared on one domain, namely news data, and in two different news datasets for this aim. Preprocessing technique contributions to classification performance at multiple feature sizes, possible interconnections among these techniques, and technique dependency on corresponding languages are all evaluated in this way. Using best combinations of preprocessing techniques rather than using or not using them all, experimental studies on public datasets reveals that, choosing best combinations of preprocessing techniques can improve classification accuracy significantly.
Primary Language | English |
---|---|
Subjects | Computer Software, Software Engineering (Other) |
Journal Section | Articles |
Authors | |
Early Pub Date | April 28, 2023 |
Publication Date | April 30, 2023 |
Submission Date | November 21, 2022 |
Acceptance Date | March 30, 2023 |
Published in Issue | Year 2023 |
The papers in this journal are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License