Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti

Pelin Canbay; Ekin Ekinci

doi:10.54525/tbbmd.1169009

Research Article

Offensive Language Detection from Turkish Tweets with Deep and Shallow Machine Learning Methods

Year 2023, Volume: 16 Issue: 1, 1 - 10, 29.06.2023

Pelin Canbay Ekin Ekinci

https://doi.org/10.54525/tbbmd.1169009

Abstract

Hate speech is the general name for speech that expresses hatred towards a person or a group or encourages violence. These discourses have recently increased uncontrollably in digital environments. Written hate speech, especially on social media such as Twitter, has reached dangerous dimensions for both individuals and communities. In order to prevent the spread of hate speech in digital environments easily and quickly, systems that can automatically detect these speeches are needed. In our study, artificial intelligence models that can automatically detect 'offensive' speech, which is one of the most common hate speeches, are discussed. In our study, in which deep and shallow machine learning methods are used comparatively, the discourses in Turkish tweets can be divided into 2 categories as offensive or not. In the models we developed using a dataset with an imbalance of approximately 75%-25%, successful results are obtained with a rate of 0.85 on the accuracy and 0.74 on the f-score. The classification results obtained from shallow models trained using term frequency-inverse document frequency (tf-idf) vectors of tweets in the dataset and deep models trained using word embeddings are presented comparatively in this study. Experimental studies have shown that the hate speech detection model developed using Bidirectional Long Short-Term Memory (BiLSTM) technique produces more successful results than shallow methods and some other deep learning methods.

Keywords

Deep learning, Machine learning, Hate speech, Offensive speech, BiLSTM

References

Statista, Number of social network users in selected countries in 2017 and 2022 (in millions), Statista, 2017
Fortuna P., Nunes S., A survey on automatic detection of hate speech in text, ACM Comput Surv, 2018, 51
T.D.K., Türk Dil Kurumu, Türk Tarih Kurumu Basımevi, 1954
Evans M., Weber A., Council of Europe Manuals - Human Rights in Culturally Diverse Societies (2 vols.), 2010
Burnap P., Williams M.L., Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making, Policy Internet, 2015, 7
Sahi H., Kilic Y., Saglam R.B., Automated Detection of Hate Speech towards Woman on Twitter, In: UBMK 2018 - 3rd International Conference on Computer Science and Engineering, 2018
Dağaşan T., Automatic hate speech detection on social media: Turkish tweets as an example, 2019
Hüsünbeyi Z.M., Detecting hate speech in Turkish texts, 2020
MAYDA İ., DİRİ B., YILDIZ T., Türkçe Tweetler üzerinde Makine Öğrenmesi ile Nefret Söylemi Tespiti, European Journal of Science and Technology, 2021
Zampieri M., Nakov P., Rosenthal S., Atanasova P., Karadzhov G., Mubarak H., et al., SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020), ArXiv, 2020
Charitidis P., Doropoulos S., Vologiannidis S., Papastergiou I., Karakeva S., Towards countering hate speech against journalists on social media, Online Soc Netw Media, 2020, 17
Guellil I., Adeel A., Azouaou F., Chennoufi S., Maafi H., Hamitouche T., Detecting hate speech against politicians in Arabic community on social media, International Journal of Web Information Systems, 2020, 16
Pitropakis N., Kokot K., Gkatzia D., Ludwiniak R., Mylonas A., Kandias M., Monitoring Users’ Behavior: Anti-Immigration Speech Detection on Twitter, Mach Learn Knowl Extr, 2020, 2
Pronoza E., Panicheva P., Koltsova O., Rosso P., Detecting ethnicity-targeted hate speech in Russian social media texts, Inf Process Manag, 2021, 58
Jiang A., Yang X., Liu Y., Zubiaga A., SWSR: A Chinese dataset and lexicon for online sexism detection, Online Soc Netw Media, 2022, 27
Chiril P., Moriceau V., Benamara F., Mari A., Origgi G., Coulomb-Gully M., An annotated corpus for sexism detection in French tweets, In: LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings, 2020
Parikh P., Abburi H., Badjatiya P., Krishnan R., Chhaya N., Gupta M., et al., Multi-label categorization of accounts of sexism using a neural framework, In: EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2019
Wullach T., Adler A., Minkov E., Character-level HyperNetworks for Hate Speech Detection, Expert Syst Appl, 2022, 205, 117571
Wu X.-K., Zhao T.-F., Lu L., Chen W.-N., Predicting the Hate: A GSTM Model based on COVID-19 Hate Speech Datasets, Inf Process Manag, 2022, 59, 102998
Plaza-del-Arco F.M., Molina-González M.D., Ureña-López L.A., Martín-Valdivia M.T., Comparing pre-trained language models for Spanish hate speech detection, Expert Syst Appl, 2021, 166
García-Díaz J.A., Jiménez-Zafra S.M., García-Cumbreras M.A., Valencia-García R., Evaluating feature combination strategies for hate-speech detection in Spanish using linguistic features and transformers, Complex & Intelligent Systems, 2022
Duwairi R., Hayajneh A., Quwaider M., A Deep Learning Framework for Automatic Detection of Hate Speech Embedded in Arabic Tweets, Arab J Sci Eng, 2021, 46
Al-Hassan A., Al-Dossari H., Detection of hate speech in Arabic tweets using deep learning, In: Multimedia Systems, 2021
Kalra S., Agrawal M., Sharma Y., Detection of Threat Records by Analyzing the Tweets in Urdu Language Exploring Deep Learning Transformer , In: Forum for Information Retrieval Evaluation, 2021
Ali R., Farooq U., Arshad U., Shahzad W., Beg M.O., Hate speech detection on Twitter using transfer learning, Comput Speech Lang, 2022, 74
Karayiğit H., Akdagli A., Aci Ç.İ., Homophobic and Hate Speech Detection Using Multilingual-BERT Model on Turkish Social Media, Information Technology and Control, 2022, 51, 356–375
Cagri T., Furkan Ş., Eyup Halit Yilmaz, Large-Scale Hate Speech Detection with Cross-Domain Transfer, ArXiv, 2022
Aizawa A., An information-theoretic perspective of tf–idf measures, Inf Process Manag, 2003, 39, 45–65
Canbay P., Sezer E.A., Detection of Stylometric Writeprint from the Turkish Texts, In: 2020 28th Signal Processing and Communications Applications Conference, SIU 2020 - Proceedings, 2020
Wang S., Zhou W., Jiang C., A survey of word embeddings based on deep learning, Computing, 2020, 102
Mikolov T., Chen K., Corrado G., Dean J., Efficient estimation of word representations in vector space, In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, International Conference on Learning Representations, ICLR, 2013
Pennington J., Socher R., Manning C.D., GloVe: Global vectors for word representation, In: EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2014, 1532–1543
Bojanowski P., Grave E., Joulin A., Mikolov T., Enriching Word Vectors with Subword Information, Trans Assoc Comput Linguist, 2017, 5
Ekinci E., Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class with LSTM, Sakarya University Journal of Computer and Information Sciences, 2022
Küçüksille E.U., Ateş N., Destek Vektör Makineleri ile Yaramaz Elektronik Postaların Filtrelenmesi , Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 2013
Soygazi F., Mostafapour V., Inan E., TurkiS: A Turkish Sentiment Analyzer Using Domain-specific Automatic Labelled Dataset, International Journal of Intelligent Systems and Applications in Engineering, 2019, 7
Ganaie M.A., Tanveer M., Suganthan P.N., Snasel V., Oblique and rotation double random forest, Neural Networks, 2022, 153, 496–517
Yakowitz S., NEAREST‐NEIGHBOUR METHODS FOR TIME SERIES ANALYSIS, J Time Ser Anal, 1987, 8
Ekinci E., Takcı H., Alagöz S., Poet Classification Using ANN and DNN, Electronic Letters on Science and Engineering, 2022
Albawi S., Mohammed T.A., Al-Zawi S., Understanding of a convolutional neural network, In: Proceedings of 2017 International Conference on Engineering and Technology, ICET 2017, 2018
Siami-Namini S., Tavakoli N., Namin A.S., The Performance of LSTM and BiLSTM in Forecasting Time Series, In: Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019, 2019
Ekinci E., İlhan Omurca S., Özbay B., Comparative assessment of modeling deep learning networks for modeling ground-level ozone concentrations of pandemic lock-down period, Ecol Modell, 2021, 457
Graves A., Schmidhuber J., Framewise phoneme classification with bidirectional LSTM and other neural network architectures, In: Neural Networks, 2005
[44] Zhang X., Li R., Dai H., Liu Y., Zhou B., Wang Z., Localization of myocardial infarction with multi-lead bidirectional gated recurrent unit neural network, IEEE Access, 2019, 7
[45] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.

Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti

Year 2023, Volume: 16 Issue: 1, 1 - 10, 29.06.2023

Pelin Canbay Ekin Ekinci

https://doi.org/10.54525/tbbmd.1169009

Abstract

Nefret söylemi, bir kişiye veya bir gruba yönelik nefreti ifade eden veya şiddeti teşvik eden söylemlerin genel adıdır. Bu söylemler son zamanlarda dijital ortamlarda kontrol edilemez bir şekilde artmıştır. Özellikle Twitter gibi sosyal mecralardaki yazılı nefret söylemleri hem kişiler hem de topluluklar için tehlikeli boyutlara ulaşmıştır. Nefret söyleminin dijital ortamlarda kolaylıkla ve hızlıca yayılabilmesinin önüne geçebilmek için bu söylemleri otomatik tespit edebilecek sistemlere ihtiyaç vardır. Çalışmamızda, en yaygın nefret söylemlerinden biri olan ‘saldırgan’ söylemleri otomatik olarak tespit edebilen yapay zeka modelleri ele alınmıştır. Derin ve sığ makine öğrenmesi yöntemlerinin karşılaştırmalı olarak kullanıldığı çalışmamızda, Türkçe tweetler’deki söylemler saldırgan veya değil olmak üzere 2 kategoriye ayrılabilmektedir. Yaklaşık %75-%25 dengesizliğindeki bir veri kümesini kullanarak geliştirdiğimiz modellerde, doğruluk ölçeğinde 0,85, f-skor ölçeğinde 0,74 oranında başarılı sonuçlar elde edilmiştir. Veri kümesinde bulunan tweetler’in terim frekansı-ters doküman frekansı (tf-idf) vektörleri kullanılarak eğitilen sığ modeller ile sözcük yerleştirmeleri kullanılarak eğitilen derin modellerden elde edilen sınıflandırma sonuçları karşılaştırmalı olarak bu çalışmada sunulmuştur. Yapılan deneysel çalışmalar ile Çift-Yönlü Uzun Kısa Süreli Bellek (BiLSTM) tekniği kullanılarak geliştirilen saldırgan söylem tespit modelinin, sığ yöntemlerden ve diğer bazı derin öğrenme yöntemlerinden daha başarılı sonuçlar ürettiği gösterilmiştir.

Keywords

Derin öğrenme, Makine öğrenmesi, Nefret söylemi, Saldırgan söylem, BiLSTM

References

Statista, Number of social network users in selected countries in 2017 and 2022 (in millions), Statista, 2017
Fortuna P., Nunes S., A survey on automatic detection of hate speech in text, ACM Comput Surv, 2018, 51
T.D.K., Türk Dil Kurumu, Türk Tarih Kurumu Basımevi, 1954
Evans M., Weber A., Council of Europe Manuals - Human Rights in Culturally Diverse Societies (2 vols.), 2010
Burnap P., Williams M.L., Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making, Policy Internet, 2015, 7
Sahi H., Kilic Y., Saglam R.B., Automated Detection of Hate Speech towards Woman on Twitter, In: UBMK 2018 - 3rd International Conference on Computer Science and Engineering, 2018
Dağaşan T., Automatic hate speech detection on social media: Turkish tweets as an example, 2019
Hüsünbeyi Z.M., Detecting hate speech in Turkish texts, 2020
MAYDA İ., DİRİ B., YILDIZ T., Türkçe Tweetler üzerinde Makine Öğrenmesi ile Nefret Söylemi Tespiti, European Journal of Science and Technology, 2021
Zampieri M., Nakov P., Rosenthal S., Atanasova P., Karadzhov G., Mubarak H., et al., SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020), ArXiv, 2020
Charitidis P., Doropoulos S., Vologiannidis S., Papastergiou I., Karakeva S., Towards countering hate speech against journalists on social media, Online Soc Netw Media, 2020, 17
Guellil I., Adeel A., Azouaou F., Chennoufi S., Maafi H., Hamitouche T., Detecting hate speech against politicians in Arabic community on social media, International Journal of Web Information Systems, 2020, 16
Pitropakis N., Kokot K., Gkatzia D., Ludwiniak R., Mylonas A., Kandias M., Monitoring Users’ Behavior: Anti-Immigration Speech Detection on Twitter, Mach Learn Knowl Extr, 2020, 2
Pronoza E., Panicheva P., Koltsova O., Rosso P., Detecting ethnicity-targeted hate speech in Russian social media texts, Inf Process Manag, 2021, 58
Jiang A., Yang X., Liu Y., Zubiaga A., SWSR: A Chinese dataset and lexicon for online sexism detection, Online Soc Netw Media, 2022, 27
Chiril P., Moriceau V., Benamara F., Mari A., Origgi G., Coulomb-Gully M., An annotated corpus for sexism detection in French tweets, In: LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings, 2020
Parikh P., Abburi H., Badjatiya P., Krishnan R., Chhaya N., Gupta M., et al., Multi-label categorization of accounts of sexism using a neural framework, In: EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2019
Wullach T., Adler A., Minkov E., Character-level HyperNetworks for Hate Speech Detection, Expert Syst Appl, 2022, 205, 117571
Wu X.-K., Zhao T.-F., Lu L., Chen W.-N., Predicting the Hate: A GSTM Model based on COVID-19 Hate Speech Datasets, Inf Process Manag, 2022, 59, 102998
Plaza-del-Arco F.M., Molina-González M.D., Ureña-López L.A., Martín-Valdivia M.T., Comparing pre-trained language models for Spanish hate speech detection, Expert Syst Appl, 2021, 166
García-Díaz J.A., Jiménez-Zafra S.M., García-Cumbreras M.A., Valencia-García R., Evaluating feature combination strategies for hate-speech detection in Spanish using linguistic features and transformers, Complex & Intelligent Systems, 2022
Duwairi R., Hayajneh A., Quwaider M., A Deep Learning Framework for Automatic Detection of Hate Speech Embedded in Arabic Tweets, Arab J Sci Eng, 2021, 46
Al-Hassan A., Al-Dossari H., Detection of hate speech in Arabic tweets using deep learning, In: Multimedia Systems, 2021
Kalra S., Agrawal M., Sharma Y., Detection of Threat Records by Analyzing the Tweets in Urdu Language Exploring Deep Learning Transformer , In: Forum for Information Retrieval Evaluation, 2021
Ali R., Farooq U., Arshad U., Shahzad W., Beg M.O., Hate speech detection on Twitter using transfer learning, Comput Speech Lang, 2022, 74
Karayiğit H., Akdagli A., Aci Ç.İ., Homophobic and Hate Speech Detection Using Multilingual-BERT Model on Turkish Social Media, Information Technology and Control, 2022, 51, 356–375
Cagri T., Furkan Ş., Eyup Halit Yilmaz, Large-Scale Hate Speech Detection with Cross-Domain Transfer, ArXiv, 2022
Aizawa A., An information-theoretic perspective of tf–idf measures, Inf Process Manag, 2003, 39, 45–65
Canbay P., Sezer E.A., Detection of Stylometric Writeprint from the Turkish Texts, In: 2020 28th Signal Processing and Communications Applications Conference, SIU 2020 - Proceedings, 2020
Wang S., Zhou W., Jiang C., A survey of word embeddings based on deep learning, Computing, 2020, 102
Mikolov T., Chen K., Corrado G., Dean J., Efficient estimation of word representations in vector space, In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, International Conference on Learning Representations, ICLR, 2013
Pennington J., Socher R., Manning C.D., GloVe: Global vectors for word representation, In: EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2014, 1532–1543
Bojanowski P., Grave E., Joulin A., Mikolov T., Enriching Word Vectors with Subword Information, Trans Assoc Comput Linguist, 2017, 5
Ekinci E., Classification of Imbalanced Offensive Dataset – Sentence Generation for Minority Class with LSTM, Sakarya University Journal of Computer and Information Sciences, 2022
Küçüksille E.U., Ateş N., Destek Vektör Makineleri ile Yaramaz Elektronik Postaların Filtrelenmesi , Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 2013
Soygazi F., Mostafapour V., Inan E., TurkiS: A Turkish Sentiment Analyzer Using Domain-specific Automatic Labelled Dataset, International Journal of Intelligent Systems and Applications in Engineering, 2019, 7
Ganaie M.A., Tanveer M., Suganthan P.N., Snasel V., Oblique and rotation double random forest, Neural Networks, 2022, 153, 496–517
Yakowitz S., NEAREST‐NEIGHBOUR METHODS FOR TIME SERIES ANALYSIS, J Time Ser Anal, 1987, 8
Ekinci E., Takcı H., Alagöz S., Poet Classification Using ANN and DNN, Electronic Letters on Science and Engineering, 2022
Albawi S., Mohammed T.A., Al-Zawi S., Understanding of a convolutional neural network, In: Proceedings of 2017 International Conference on Engineering and Technology, ICET 2017, 2018
Siami-Namini S., Tavakoli N., Namin A.S., The Performance of LSTM and BiLSTM in Forecasting Time Series, In: Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019, 2019
Ekinci E., İlhan Omurca S., Özbay B., Comparative assessment of modeling deep learning networks for modeling ground-level ozone concentrations of pandemic lock-down period, Ecol Modell, 2021, 457
Graves A., Schmidhuber J., Framewise phoneme classification with bidirectional LSTM and other neural network architectures, In: Neural Networks, 2005
[44] Zhang X., Li R., Dai H., Liu Y., Zhou B., Wang Z., Localization of myocardial infarction with multi-lead bidirectional gated recurrent unit neural network, IEEE Access, 2019, 7
[45] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.

There are 45 citations in total.

Details

Primary Language	Turkish
Subjects	Engineering
Journal Section	Makaleler(Araştırma)
Authors	Pelin Canbay 0000-0002-8067-3365 Ekin Ekinci 0000-0003-0658-592X
Early Pub Date	June 29, 2023
Publication Date	June 29, 2023
Published in Issue	Year 2023 Volume: 16 Issue: 1

Cite

APA	Canbay, P., & Ekinci, E. (2023). Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi, 16(1), 1-10. https://doi.org/10.54525/tbbmd.1169009
AMA	Canbay P, Ekinci E. Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti. TBV-BBMD. June 2023;16(1):1-10. doi:10.54525/tbbmd.1169009
Chicago	Canbay, Pelin, and Ekin Ekinci. “Derin Ve Sığ Makine Öğrenmesi Yöntemleri Ile Türkçe Tweetlerden Saldırgan Dil Tespiti”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi 16, no. 1 (June 2023): 1-10. https://doi.org/10.54525/tbbmd.1169009.
EndNote	Canbay P, Ekinci E (June 1, 2023) Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi 16 1 1–10.
IEEE	P. Canbay and E. Ekinci, “Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti”, TBV-BBMD, vol. 16, no. 1, pp. 1–10, 2023, doi: 10.54525/tbbmd.1169009.
ISNAD	Canbay, Pelin - Ekinci, Ekin. “Derin Ve Sığ Makine Öğrenmesi Yöntemleri Ile Türkçe Tweetlerden Saldırgan Dil Tespiti”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi 16/1 (June 2023), 1-10. https://doi.org/10.54525/tbbmd.1169009.
JAMA	Canbay P, Ekinci E. Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti. TBV-BBMD. 2023;16:1–10.
MLA	Canbay, Pelin and Ekin Ekinci. “Derin Ve Sığ Makine Öğrenmesi Yöntemleri Ile Türkçe Tweetlerden Saldırgan Dil Tespiti”. Türkiye Bilişim Vakfı Bilgisayar Bilimleri Ve Mühendisliği Dergisi, vol. 16, no. 1, 2023, pp. 1-10, doi:10.54525/tbbmd.1169009.
Vancouver	Canbay P, Ekinci E. Derin ve Sığ Makine Öğrenmesi Yöntemleri ile Türkçe Tweetlerden Saldırgan Dil Tespiti. TBV-BBMD. 2023;16(1):1-10.

Download Cover Image

Article Files

Full Text

Article Acceptance

Use user registration/login to upload articles online.

The acceptance process of the articles sent to the journal consists of the following stages:

1. Each submitted article is sent to at least two referees at the first stage.

2. Referee appointments are made by the journal editors. There are approximately 200 referees in the referee pool of the journal and these referees are classified according to their areas of interest. Each referee is sent an article on the subject he is interested in. The selection of the arbitrator is done in a way that does not cause any conflict of interest.

3. In the articles sent to the referees, the names of the authors are closed.

4. Referees are explained how to evaluate an article and are asked to fill in the evaluation form shown below.

5. The articles in which two referees give positive opinion are subjected to similarity review by the editors. The similarity in the articles is expected to be less than 25%.

6. A paper that has passed all stages is reviewed by the editor in terms of language and presentation, and necessary corrections and improvements are made. If necessary, the authors are notified of the situation.

. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.