Research Article

An Approach for Audio-Visual Content Understanding of Video using Multimodal Deep Learning Methodology

Volume: 5 Number: 2 August 31, 2022
EN

An Approach for Audio-Visual Content Understanding of Video using Multimodal Deep Learning Methodology

Abstract

This study contains an approach for recognizing the sound environment class from a video to understand the spoken content with its sentimental context via some sort of analysis that is achieved by the processing of audio-visual content using multimodal deep learning methodology. This approach begins with cutting the parts of a given video which the most action happened by using deep learning and this cutted parts get concanarated as a new video clip. With the help of a deep learning network model which was trained before for sound recognition, a sound prediction process takes place. The model was trained by using different sound clips of ten different categories to predict sound classes. These categories have been selected by where the action could have happened the most. Then, to strengthen the result of sound recognition if there is a speech in the new video, this speech has been taken. By using Natural Language Processing (NLP) and Named Entity Recognition (NER) this speech has been categorized according to if the word of a speech has connotation of any of the ten categories. Sentiment analysis and Apriori Algorithm from Association Rule Mining (ARM) processes are preceded by identifying the frequent categories in the concanarated video and helps us to define the relationship between the categories owned. According to the highest performance evaluation values from our experiments, the accuracy for sound environment recognition for a given video's processed scene is 70%, average Bilingual Evaluation Understudy (BLEU) score for speech to text with VOSK speech recognition toolkit's English language model is 90% on average and for Turkish language model is 81% on average. Discussion and conclusion based on scientific findings are included in our study.

Keywords

References

  1. B. Karakaya, E.B. Boztepe, and B. Karasulu, "Development of a Deep Learning Based Model for Recognizing the Environmental Sounds in Videos," in The SETSCI Conference Proceedings Book, vol. 5, no. 1, pp. 53-58, 2022.
  2. B. Karasulu, “Çoklu Ortam Sistemleri İçin Siber Güvenlik Kapsamında Derin Öğrenme Kullanarak Ses Sahne ve Olaylarının Tespiti,” Acta Infologica, vol. 3, no. 2, pp. 60-82, 2019.
  3. E. A. Kıvrak, B. Karasulu, C. Sözbir ve A. Türkay, “Ses Özniteliklerini Kullanan Ses Duygu Durum Sınıflandırma İçin Derin Öğrenme Tabanlı Bir Yazılımsal Araç,” Veri Bilim Dergisi, vol. 4, no. 3, pp.14-27, 2021.
  4. S. Albawi, T. A. Mohammed, and S. Al-Zawi, “Understanding of a Convolutional Neural Network,” in Proceedings of the International Conference on Engineering and Technology (ICET), Antalya, Turkey, pp. 1-6, 2018.
  5. Y. Zhao, X. Jin, and X. Hu, “Recurrent Convolutional Neural Network for Speech Processing,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5300-5304, 2017.
  6. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal Deep Learning,” in Proceedings of the 28th International Conference on Machine Learning (ICML11), Bellevue, Washington, USA, pp. 689–696, 2011.
  7. S. Bird, E. Loper, and J. Baldridge, "Multidisciplinary Instruction with the Natural Language Toolkit," in Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, Columbus, Ohio, pp. 62–70, 2008.
  8. J. Joseph, and J. R. Jeba, "Information Extraction Using Tokenization And Clustering Methods," International Journal of Recent Technology and Engineering, vol. 8 no. 4, pp. 3680-3692, 2019.

Details

Primary Language

English

Subjects

Artificial Intelligence

Journal Section

Research Article

Publication Date

August 31, 2022

Submission Date

July 2, 2022

Acceptance Date

July 6, 2022

Published in Issue

Year 1970 Volume: 5 Number: 2

APA
Boztepe, E. B., Karakaya, B., Karasulu, B., & Ünlü, İ. (2022). An Approach for Audio-Visual Content Understanding of Video using Multimodal Deep Learning Methodology. Sakarya University Journal of Computer and Information Sciences, 5(2), 181-207. https://doi.org/10.35377/saucis...1139765
AMA
1.Boztepe EB, Karakaya B, Karasulu B, Ünlü İ. An Approach for Audio-Visual Content Understanding of Video using Multimodal Deep Learning Methodology. SAUCIS. 2022;5(2):181-207. doi:10.35377/saucis.1139765
Chicago
Boztepe, Emre Beray, Bedirhan Karakaya, Bahadir Karasulu, and İsmet Ünlü. 2022. “An Approach for Audio-Visual Content Understanding of Video Using Multimodal Deep Learning Methodology”. Sakarya University Journal of Computer and Information Sciences 5 (2): 181-207. https://doi.org/10.35377/saucis. 1139765.
EndNote
Boztepe EB, Karakaya B, Karasulu B, Ünlü İ (August 1, 2022) An Approach for Audio-Visual Content Understanding of Video using Multimodal Deep Learning Methodology. Sakarya University Journal of Computer and Information Sciences 5 2 181–207.
IEEE
[1]E. B. Boztepe, B. Karakaya, B. Karasulu, and İ. Ünlü, “An Approach for Audio-Visual Content Understanding of Video using Multimodal Deep Learning Methodology”, SAUCIS, vol. 5, no. 2, pp. 181–207, Aug. 2022, doi: 10.35377/saucis...1139765.
ISNAD
Boztepe, Emre Beray - Karakaya, Bedirhan - Karasulu, Bahadir - Ünlü, İsmet. “An Approach for Audio-Visual Content Understanding of Video Using Multimodal Deep Learning Methodology”. Sakarya University Journal of Computer and Information Sciences 5/2 (August 1, 2022): 181-207. https://doi.org/10.35377/saucis. 1139765.
JAMA
1.Boztepe EB, Karakaya B, Karasulu B, Ünlü İ. An Approach for Audio-Visual Content Understanding of Video using Multimodal Deep Learning Methodology. SAUCIS. 2022;5:181–207.
MLA
Boztepe, Emre Beray, et al. “An Approach for Audio-Visual Content Understanding of Video Using Multimodal Deep Learning Methodology”. Sakarya University Journal of Computer and Information Sciences, vol. 5, no. 2, Aug. 2022, pp. 181-07, doi:10.35377/saucis. 1139765.
Vancouver
1.Emre Beray Boztepe, Bedirhan Karakaya, Bahadir Karasulu, İsmet Ünlü. An Approach for Audio-Visual Content Understanding of Video using Multimodal Deep Learning Methodology. SAUCIS. 2022 Aug. 1;5(2):181-207. doi:10.35377/saucis. 1139765

Cited By

 

INDEXING & ABSTRACTING & ARCHIVING

 

31045 31044   ResimLink - Resim Yükle  31047 

31043 28939 28938 34240
 

 

29070    The papers in this journal are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License