Research Article

Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class with LSTM

Volume: 5 Number: 1 April 30, 2022
EN

Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class with LSTM

Abstract

The classification of documents is one of the problems studied since ancient times and still continues to be studied. With the social media becoming a part of daily life and its misuse, the importance of text classification has started to increase. This paper investigates the effect of data augmentation with sentence generation on classification performance in an imbalanced dataset. We propose an LSTM based sentence generation method, Term Frequency-Inverse Document Frequency (TF-IDF) and Word2vec and apply Logistic Regression (LR), Support Vector Machine (SVM), K Nearest Neighbour (KNN), Multilayer Perceptron (MLP), Extremly Randomized Trees (Extra tree), Random Forest, eXtreme Gradient Boosting (Xgboost), Adaptive Boosting (AdaBoost) and Bagging. Our experiment results on imbalanced Offensive Language Identification Dataset (OLID) that machine learning with sentence generation significantly outperforms.

Keywords

References

  1. [1] S. Rosenthal, P. Atanasova, G. Karadzhov, M. Zampieri, and P. Nakov, "OLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification," arXiv preprint arXiv:2004.14454, 2020.
  2. [2] G. Wiedemann, E. Ruppert, R. Jindal and C. Biemann, "Transfer Learning from LDA to BiLSTM-CNN for Offensive Language Detection in Twitter," arXiv preprint arXiv:1811.02906v1, 2018.
  3. [3] H. Mubarak and K. Darwish K., "Arabic Offensive Language Classification on Twitter," Lecture Notes in Computer Science. Springer, Cham, 2019.
  4. [4] E. Ekinci, S. İlhan Omurca and S. Sevim, "Improve Offensive Language Detection with Ensemble Classifiers," IJISAE, vol. 8, no. 2, pp. 109–115, 2020.
  5. [5] M. Djandji, F. Baly, W. Antoun and H. Hajj, "Multi-Task Learning using AraBert for Offensive Language Detection," Proc. - 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pp. 97–101, 2020.
  6. [6] Y. Tung and Y. Q. Zhang, "Granular SVM with Repetitive Undersampling for Highly Imbalanced Protein Homology Prediction," Proc. - 2006 IEEE International Conference on Granular Computing, pp. 457–460, 2006.
  7. [7] J. Brownlee, Imbalanced Classification with Python. Machine Learning Mastery, 2020.
  8. [8] Q. Zou, S. Xie, Z. Lin, M. Wu and Y. Ju, "Imbalanced classification is one of most popular topics in the field of machine learning," Big Data Res., vol. 5, pp. 2–8, 2016.

Details

Primary Language

English

Subjects

Artificial Intelligence

Journal Section

Research Article

Publication Date

April 30, 2022

Submission Date

February 10, 2022

Acceptance Date

April 18, 2022

Published in Issue

Year 1970 Volume: 5 Number: 1

APA
Ekinci, E. (2022). Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class with LSTM. Sakarya University Journal of Computer and Information Sciences, 5(1), 121-133. https://doi.org/10.35377/saucis...1070822
AMA
1.Ekinci E. Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class with LSTM. SAUCIS. 2022;5(1):121-133. doi:10.35377/saucis.1070822
Chicago
Ekinci, Ekin. 2022. “Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class With LSTM”. Sakarya University Journal of Computer and Information Sciences 5 (1): 121-33. https://doi.org/10.35377/saucis. 1070822.
EndNote
Ekinci E (April 1, 2022) Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class with LSTM. Sakarya University Journal of Computer and Information Sciences 5 1 121–133.
IEEE
[1]E. Ekinci, “Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class with LSTM”, SAUCIS, vol. 5, no. 1, pp. 121–133, Apr. 2022, doi: 10.35377/saucis...1070822.
ISNAD
Ekinci, Ekin. “Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class With LSTM”. Sakarya University Journal of Computer and Information Sciences 5/1 (April 1, 2022): 121-133. https://doi.org/10.35377/saucis. 1070822.
JAMA
1.Ekinci E. Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class with LSTM. SAUCIS. 2022;5:121–133.
MLA
Ekinci, Ekin. “Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class With LSTM”. Sakarya University Journal of Computer and Information Sciences, vol. 5, no. 1, Apr. 2022, pp. 121-33, doi:10.35377/saucis. 1070822.
Vancouver
1.Ekin Ekinci. Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class with LSTM. SAUCIS. 2022 Apr. 1;5(1):121-33. doi:10.35377/saucis. 1070822

Cited By

 

INDEXING & ABSTRACTING & ARCHIVING

 

31045 31044   ResimLink - Resim Yükle  31047 

31043 28939 28938 34240
 

 

29070    The papers in this journal are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License