Research Article
BibTex RIS Cite

Offensive Language Detection from Turkish Tweets with Deep and Shallow Machine Learning Methods

Year 2023, Volume: 16 Issue: 1, 1 - 10, 29.06.2023
https://doi.org/10.54525/tbbmd.1169009

Abstract

Hate speech is the general name for speech that expresses hatred towards a person or a group or encourages violence. These discourses have recently increased uncontrollably in digital environments. Written hate speech, especially on social media such as Twitter, has reached dangerous dimensions for both individuals and communities. In order to prevent the spread of hate speech in digital environments easily and quickly, systems that can automatically detect these speeches are needed. In our study, artificial intelligence models that can automatically detect 'offensive' speech, which is one of the most common hate speeches, are discussed. In our study, in which deep and shallow machine learning methods are used comparatively, the discourses in Turkish tweets can be divided into 2 categories as offensive or not. In the models we developed using a dataset with an imbalance of approximately 75%-25%, successful results are obtained with a rate of 0.85 on the accuracy and 0.74 on the f-score. The classification results obtained from shallow models trained using term frequency-inverse document frequency (tf-idf) vectors of tweets in the dataset and deep models trained using word embeddings are presented comparatively in this study. Experimental studies have shown that the hate speech detection model developed using Bidirectional Long Short-Term Memory (BiLSTM) technique produces more successful results than shallow methods and some other deep learning methods.

References

  • Statista, Number of social network users in selected countries in 2017 and 2022 (in millions), Statista, 2017
  • Fortuna P., Nunes S., A survey on automatic detection of hate speech in text, ACM Comput Surv, 2018, 51
  • T.D.K., Türk Dil Kurumu, Türk Tarih Kurumu Basımevi, 1954
  • Evans M., Weber A., Council of Europe Manuals - Human Rights in Culturally Diverse Societies (2 vols.), 2010
  • Burnap P., Williams M.L., Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making, Policy Internet, 2015, 7
  • Sahi H., Kilic Y., Saglam R.B., Automated Detection of Hate Speech towards Woman on Twitter, In: UBMK 2018 - 3rd International Conference on Computer Science and Engineering, 2018
  • Dağaşan T., Automatic hate speech detection on social media: Turkish tweets as an example, 2019
  • Hüsünbeyi Z.M., Detecting hate speech in Turkish texts, 2020
  • MAYDA İ., DİRİ B., YILDIZ T., Türkçe Tweetler üzerinde Makine Öğrenmesi ile Nefret Söylemi Tespiti, European Journal of Science and Technology, 2021
  • Zampieri M., Nakov P., Rosenthal S., Atanasova P., Karadzhov G., Mubarak H., et al., SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020), ArXiv, 2020
  • Charitidis P., Doropoulos S., Vologiannidis S., Papastergiou I., Karakeva S., Towards countering hate speech against journalists on social media, Online Soc Netw Media, 2020, 17
  • Guellil I., Adeel A., Azouaou F., Chennoufi S., Maafi H., Hamitouche T., Detecting hate speech against politicians in Arabic community on social media, International Journal of Web Information Systems, 2020, 16
  • Pitropakis N., Kokot K., Gkatzia D., Ludwiniak R., Mylonas A., Kandias M., Monitoring Users’ Behavior: Anti-Immigration Speech Detection on Twitter, Mach Learn Knowl Extr, 2020, 2
  • Pronoza E., Panicheva P., Koltsova O., Rosso P., Detecting ethnicity-targeted hate speech in Russian social media texts, Inf Process Manag, 2021, 58
  • Jiang A., Yang X., Liu Y., Zubiaga A., SWSR: A Chinese dataset and lexicon for online sexism detection, Online Soc Netw Media, 2022, 27
  • Chiril P., Moriceau V., Benamara F., Mari A., Origgi G., Coulomb-Gully M., An annotated corpus for sexism detection in French tweets, In: LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings, 2020
  • Parikh P., Abburi H., Badjatiya P., Krishnan R., Chhaya N., Gupta M., et al., Multi-label categorization of accounts of sexism using a neural framework, In: EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2019
  • Wullach T., Adler A., Minkov E., Character-level HyperNetworks for Hate Speech Detection, Expert Syst Appl, 2022, 205, 117571
  • Wu X.-K., Zhao T.-F., Lu L., Chen W.-N., Predicting the Hate: A GSTM Model based on COVID-19 Hate Speech Datasets, Inf Process Manag, 2022, 59, 102998
  • Plaza-del-Arco F.M., Molina-González M.D., Ureña-López L.A., Martín-Valdivia M.T., Comparing pre-trained language models for Spanish hate speech detection, Expert Syst Appl, 2021, 166
  • García-Díaz J.A., Jiménez-Zafra S.M., García-Cumbreras M.A., Valencia-García R., Evaluating feature combination strategies for hate-speech detection in Spanish using linguistic features and transformers, Complex & Intelligent Systems, 2022
  • Duwairi R., Hayajneh A., Quwaider M., A Deep Learning Framework for Automatic Detection of Hate Speech Embedded in Arabic Tweets, Arab J Sci Eng, 2021, 46
  • Al-Hassan A., Al-Dossari H., Detection of hate speech in Arabic tweets using deep learning, In: Multimedia Systems, 2021
  • Kalra S., Agrawal M., Sharma Y., Detection of Threat Records by Analyzing the Tweets in Urdu Language Exploring Deep Learning Transformer , In: Forum for Information Retrieval Evaluation, 2021
  • Ali R., Farooq U., Arshad U., Shahzad W., Beg M.O., Hate speech detection on Twitter using transfer learning, Comput Speech Lang, 2022, 74
  • Karayiğit H., Akdagli A., Aci Ç.İ., Homophobic and Hate Speech Detection Using Multilingual-BERT Model on Turkish Social Media, Information Technology and Control, 2022, 51, 356–375
  • Cagri T., Furkan Ş., Eyup Halit Yilmaz, Large-Scale Hate Speech Detection with Cross-Domain Transfer, ArXiv, 2022
  • Aizawa A., An information-theoretic perspective of tf–idf measures, Inf Process Manag, 2003, 39, 45–65
  • Canbay P., Sezer E.A., Detection of Stylometric Writeprint from the Turkish Texts, In: 2020 28th Signal Processing and Communications Applications Conference, SIU 2020 - Proceedings, 2020
  • Wang S., Zhou W., Jiang C., A survey of word embeddings based on deep learning, Computing, 2020, 102
  • Mikolov T., Chen K., Corrado G., Dean J., Efficient estimation of word representations in vector space, In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, International Conference on Learning Representations, ICLR, 2013
  • Pennington J., Socher R., Manning C.D., GloVe: Global vectors for word representation, In: EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2014, 1532–1543
  • Bojanowski P., Grave E., Joulin A., Mikolov T., Enriching Word Vectors with Subword Information, Trans Assoc Comput Linguist, 2017, 5
  • Ekinci E., Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class with LSTM, Sakarya University Journal of Computer and Information Sciences, 2022
  • Küçüksille E.U., Ateş N., Destek Vektör Makineleri ile Yaramaz Elektronik Postaların Filtrelenmesi , Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 2013
  • Soygazi F., Mostafapour V., Inan E., TurkiS: A Turkish Sentiment Analyzer Using Domain-specific Automatic Labelled Dataset, International Journal of Intelligent Systems and Applications in Engineering, 2019, 7
  • Ganaie M.A., Tanveer M., Suganthan P.N., Snasel V., Oblique and rotation double random forest, Neural Networks, 2022, 153, 496–517
  • Yakowitz S., NEAREST‐NEIGHBOUR METHODS FOR TIME SERIES ANALYSIS, J Time Ser Anal, 1987, 8
  • Ekinci E., Takcı H., Alagöz S., Poet Classification Using ANN and DNN, Electronic Letters on Science and Engineering, 2022
  • Albawi S., Mohammed T.A., Al-Zawi S., Understanding of a convolutional neural network, In: Proceedings of 2017 International Conference on Engineering and Technology, ICET 2017, 2018
  • Siami-Namini S., Tavakoli N., Namin A.S., The Performance of LSTM and BiLSTM in Forecasting Time Series, In: Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019, 2019
  • Ekinci E., İlhan Omurca S., Özbay B., Comparative assessment of modeling deep learning networks for modeling ground-level ozone concentrations of pandemic lock-down period, Ecol Modell, 2021, 457
  • Graves A., Schmidhuber J., Framewise phoneme classification with bidirectional LSTM and other neural network architectures, In: Neural Networks, 2005
  • [44] Zhang X., Li R., Dai H., Liu Y., Zhou B., Wang Z., Localization of myocardial infarction with multi-lead bidirectional gated recurrent unit neural network, IEEE Access, 2019, 7
  • [45] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.

Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti

Year 2023, Volume: 16 Issue: 1, 1 - 10, 29.06.2023
https://doi.org/10.54525/tbbmd.1169009

Abstract

Nefret söylemi, bir kişiye veya bir gruba yönelik nefreti ifade eden veya şiddeti teşvik eden söylemlerin genel adıdır. Bu söylemler son zamanlarda dijital ortamlarda kontrol edilemez bir şekilde artmıştır. Özellikle Twitter gibi sosyal mecralardaki yazılı nefret söylemleri hem kişiler hem de topluluklar için tehlikeli boyutlara ulaşmıştır. Nefret söyleminin dijital ortamlarda kolaylıkla ve hızlıca yayılabilmesinin önüne geçebilmek için bu söylemleri otomatik tespit edebilecek sistemlere ihtiyaç vardır. Çalışmamızda, en yaygın nefret söylemlerinden biri olan ‘saldırgan’ söylemleri otomatik olarak tespit edebilen yapay zeka modelleri ele alınmıştır. Derin ve sığ makine öğrenmesi yöntemlerinin karşılaştırmalı olarak kullanıldığı çalışmamızda, Türkçe tweetler’deki söylemler saldırgan veya değil olmak üzere 2 kategoriye ayrılabilmektedir. Yaklaşık %75-%25 dengesizliğindeki bir veri kümesini kullanarak geliştirdiğimiz modellerde, doğruluk ölçeğinde 0,85, f-skor ölçeğinde 0,74 oranında başarılı sonuçlar elde edilmiştir. Veri kümesinde bulunan tweetler’in terim frekansı-ters doküman frekansı (tf-idf) vektörleri kullanılarak eğitilen sığ modeller ile sözcük yerleştirmeleri kullanılarak eğitilen derin modellerden elde edilen sınıflandırma sonuçları karşılaştırmalı olarak bu çalışmada sunulmuştur. Yapılan deneysel çalışmalar ile Çift-Yönlü Uzun Kısa Süreli Bellek (BiLSTM) tekniği kullanılarak geliştirilen saldırgan söylem tespit modelinin, sığ yöntemlerden ve diğer bazı derin öğrenme yöntemlerinden daha başarılı sonuçlar ürettiği gösterilmiştir.

References

  • Statista, Number of social network users in selected countries in 2017 and 2022 (in millions), Statista, 2017
  • Fortuna P., Nunes S., A survey on automatic detection of hate speech in text, ACM Comput Surv, 2018, 51
  • T.D.K., Türk Dil Kurumu, Türk Tarih Kurumu Basımevi, 1954
  • Evans M., Weber A., Council of Europe Manuals - Human Rights in Culturally Diverse Societies (2 vols.), 2010
  • Burnap P., Williams M.L., Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making, Policy Internet, 2015, 7
  • Sahi H., Kilic Y., Saglam R.B., Automated Detection of Hate Speech towards Woman on Twitter, In: UBMK 2018 - 3rd International Conference on Computer Science and Engineering, 2018
  • Dağaşan T., Automatic hate speech detection on social media: Turkish tweets as an example, 2019
  • Hüsünbeyi Z.M., Detecting hate speech in Turkish texts, 2020
  • MAYDA İ., DİRİ B., YILDIZ T., Türkçe Tweetler üzerinde Makine Öğrenmesi ile Nefret Söylemi Tespiti, European Journal of Science and Technology, 2021
  • Zampieri M., Nakov P., Rosenthal S., Atanasova P., Karadzhov G., Mubarak H., et al., SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020), ArXiv, 2020
  • Charitidis P., Doropoulos S., Vologiannidis S., Papastergiou I., Karakeva S., Towards countering hate speech against journalists on social media, Online Soc Netw Media, 2020, 17
  • Guellil I., Adeel A., Azouaou F., Chennoufi S., Maafi H., Hamitouche T., Detecting hate speech against politicians in Arabic community on social media, International Journal of Web Information Systems, 2020, 16
  • Pitropakis N., Kokot K., Gkatzia D., Ludwiniak R., Mylonas A., Kandias M., Monitoring Users’ Behavior: Anti-Immigration Speech Detection on Twitter, Mach Learn Knowl Extr, 2020, 2
  • Pronoza E., Panicheva P., Koltsova O., Rosso P., Detecting ethnicity-targeted hate speech in Russian social media texts, Inf Process Manag, 2021, 58
  • Jiang A., Yang X., Liu Y., Zubiaga A., SWSR: A Chinese dataset and lexicon for online sexism detection, Online Soc Netw Media, 2022, 27
  • Chiril P., Moriceau V., Benamara F., Mari A., Origgi G., Coulomb-Gully M., An annotated corpus for sexism detection in French tweets, In: LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings, 2020
  • Parikh P., Abburi H., Badjatiya P., Krishnan R., Chhaya N., Gupta M., et al., Multi-label categorization of accounts of sexism using a neural framework, In: EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2019
  • Wullach T., Adler A., Minkov E., Character-level HyperNetworks for Hate Speech Detection, Expert Syst Appl, 2022, 205, 117571
  • Wu X.-K., Zhao T.-F., Lu L., Chen W.-N., Predicting the Hate: A GSTM Model based on COVID-19 Hate Speech Datasets, Inf Process Manag, 2022, 59, 102998
  • Plaza-del-Arco F.M., Molina-González M.D., Ureña-López L.A., Martín-Valdivia M.T., Comparing pre-trained language models for Spanish hate speech detection, Expert Syst Appl, 2021, 166
  • García-Díaz J.A., Jiménez-Zafra S.M., García-Cumbreras M.A., Valencia-García R., Evaluating feature combination strategies for hate-speech detection in Spanish using linguistic features and transformers, Complex & Intelligent Systems, 2022
  • Duwairi R., Hayajneh A., Quwaider M., A Deep Learning Framework for Automatic Detection of Hate Speech Embedded in Arabic Tweets, Arab J Sci Eng, 2021, 46
  • Al-Hassan A., Al-Dossari H., Detection of hate speech in Arabic tweets using deep learning, In: Multimedia Systems, 2021
  • Kalra S., Agrawal M., Sharma Y., Detection of Threat Records by Analyzing the Tweets in Urdu Language Exploring Deep Learning Transformer , In: Forum for Information Retrieval Evaluation, 2021
  • Ali R., Farooq U., Arshad U., Shahzad W., Beg M.O., Hate speech detection on Twitter using transfer learning, Comput Speech Lang, 2022, 74
  • Karayiğit H., Akdagli A., Aci Ç.İ., Homophobic and Hate Speech Detection Using Multilingual-BERT Model on Turkish Social Media, Information Technology and Control, 2022, 51, 356–375
  • Cagri T., Furkan Ş., Eyup Halit Yilmaz, Large-Scale Hate Speech Detection with Cross-Domain Transfer, ArXiv, 2022
  • Aizawa A., An information-theoretic perspective of tf–idf measures, Inf Process Manag, 2003, 39, 45–65
  • Canbay P., Sezer E.A., Detection of Stylometric Writeprint from the Turkish Texts, In: 2020 28th Signal Processing and Communications Applications Conference, SIU 2020 - Proceedings, 2020
  • Wang S., Zhou W., Jiang C., A survey of word embeddings based on deep learning, Computing, 2020, 102
  • Mikolov T., Chen K., Corrado G., Dean J., Efficient estimation of word representations in vector space, In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, International Conference on Learning Representations, ICLR, 2013
  • Pennington J., Socher R., Manning C.D., GloVe: Global vectors for word representation, In: EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2014, 1532–1543
  • Bojanowski P., Grave E., Joulin A., Mikolov T., Enriching Word Vectors with Subword Information, Trans Assoc Comput Linguist, 2017, 5
  • Ekinci E., Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class with LSTM, Sakarya University Journal of Computer and Information Sciences, 2022
  • Küçüksille E.U., Ateş N., Destek Vektör Makineleri ile Yaramaz Elektronik Postaların Filtrelenmesi , Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 2013
  • Soygazi F., Mostafapour V., Inan E., TurkiS: A Turkish Sentiment Analyzer Using Domain-specific Automatic Labelled Dataset, International Journal of Intelligent Systems and Applications in Engineering, 2019, 7
  • Ganaie M.A., Tanveer M., Suganthan P.N., Snasel V., Oblique and rotation double random forest, Neural Networks, 2022, 153, 496–517
  • Yakowitz S., NEAREST‐NEIGHBOUR METHODS FOR TIME SERIES ANALYSIS, J Time Ser Anal, 1987, 8
  • Ekinci E., Takcı H., Alagöz S., Poet Classification Using ANN and DNN, Electronic Letters on Science and Engineering, 2022
  • Albawi S., Mohammed T.A., Al-Zawi S., Understanding of a convolutional neural network, In: Proceedings of 2017 International Conference on Engineering and Technology, ICET 2017, 2018
  • Siami-Namini S., Tavakoli N., Namin A.S., The Performance of LSTM and BiLSTM in Forecasting Time Series, In: Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019, 2019
  • Ekinci E., İlhan Omurca S., Özbay B., Comparative assessment of modeling deep learning networks for modeling ground-level ozone concentrations of pandemic lock-down period, Ecol Modell, 2021, 457
  • Graves A., Schmidhuber J., Framewise phoneme classification with bidirectional LSTM and other neural network architectures, In: Neural Networks, 2005
  • [44] Zhang X., Li R., Dai H., Liu Y., Zhou B., Wang Z., Localization of myocardial infarction with multi-lead bidirectional gated recurrent unit neural network, IEEE Access, 2019, 7
  • [45] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
There are 45 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Makaleler(Araştırma)
Authors

Pelin Canbay 0000-0002-8067-3365

Ekin Ekinci 0000-0003-0658-592X

Early Pub Date June 29, 2023
Publication Date June 29, 2023
Published in Issue Year 2023 Volume: 16 Issue: 1

Cite

APA Canbay, P., & Ekinci, E. (2023). Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi, 16(1), 1-10. https://doi.org/10.54525/tbbmd.1169009
AMA Canbay P, Ekinci E. Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti. TBV-BBMD. June 2023;16(1):1-10. doi:10.54525/tbbmd.1169009
Chicago Canbay, Pelin, and Ekin Ekinci. “Derin Ve Sığ Makine Öğrenmesi Yöntemleri Ile Türkçe Tweetlerden Saldırgan Dil Tespiti”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi 16, no. 1 (June 2023): 1-10. https://doi.org/10.54525/tbbmd.1169009.
EndNote Canbay P, Ekinci E (June 1, 2023) Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi 16 1 1–10.
IEEE P. Canbay and E. Ekinci, “Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti”, TBV-BBMD, vol. 16, no. 1, pp. 1–10, 2023, doi: 10.54525/tbbmd.1169009.
ISNAD Canbay, Pelin - Ekinci, Ekin. “Derin Ve Sığ Makine Öğrenmesi Yöntemleri Ile Türkçe Tweetlerden Saldırgan Dil Tespiti”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi 16/1 (June 2023), 1-10. https://doi.org/10.54525/tbbmd.1169009.
JAMA Canbay P, Ekinci E. Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti. TBV-BBMD. 2023;16:1–10.
MLA Canbay, Pelin and Ekin Ekinci. “Derin Ve Sığ Makine Öğrenmesi Yöntemleri Ile Türkçe Tweetlerden Saldırgan Dil Tespiti”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi, vol. 16, no. 1, 2023, pp. 1-10, doi:10.54525/tbbmd.1169009.
Vancouver Canbay P, Ekinci E. Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti. TBV-BBMD. 2023;16(1):1-10.

Article Acceptance

Use user registration/login to upload articles online.

The acceptance process of the articles sent to the journal consists of the following stages:

1. Each submitted article is sent to at least two referees at the first stage.

2. Referee appointments are made by the journal editors. There are approximately 200 referees in the referee pool of the journal and these referees are classified according to their areas of interest. Each referee is sent an article on the subject he is interested in. The selection of the arbitrator is done in a way that does not cause any conflict of interest.

3. In the articles sent to the referees, the names of the authors are closed.

4. Referees are explained how to evaluate an article and are asked to fill in the evaluation form shown below.

5. The articles in which two referees give positive opinion are subjected to similarity review by the editors. The similarity in the articles is expected to be less than 25%.

6. A paper that has passed all stages is reviewed by the editor in terms of language and presentation, and necessary corrections and improvements are made. If necessary, the authors are notified of the situation.

0

.   This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.