Deep Gated Recurrent Unit for Smartphone-Based Image Captioning

Volkan Kılıç

doi:10.35377/saucis.04.02.866409

EN

Deep Gated Recurrent Unit for Smartphone-Based Image Captioning

Abstract

Expressing the visual content of an image in natural language form has gained relevance due to technological and algorithmic advances together with improved computational processing capacity. Many smartphone applications for image captioning have been developed recently as built-in cameras provide advantages of easy-operation and portability, resulting in capturing an image whenever or wherever needed. Here, an encoder-decoder framework based new image captioning approach with a multi-layer gated recurrent unit is proposed. The Inception-v3 convolutional neural network is employed in the encoder due to its capability of more feature extraction from small regions. The proposed recurrent neural network-based decoder utilizes these features in the multi-layer gated recurrent unit to produce a natural language expression word-by-word. Experimental evaluations on the MSCOCO dataset demonstrate that our proposed approach has the advantage over existing approaches consistently across different evaluation metrics. With the integration of the proposed approach to our custom-designed Android application, named “VirtualEye+”, it has great potential to implement image captioning in daily routine.

Keywords

References

B. Makav and V. Kılıç, "A New Image Captioning Approach for Visually Impaired People," in 11th International Conference on Electrical and Electronics Engineering, 2019, pp. 945-949: IEEE.
B. Makav and V. Kılıç, "Smartphone-based Image Captioning for Visually and Hearing Impaired," in 11th International Conference on Electrical and Electronics Engineering, 2019, pp. 950-953: IEEE.
G. Kulkarni et al., "Baby talk: Understanding and generating image descriptions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1601-1608.
M. Mitchell et al., "Midge: Generating image descriptions from computer vision detections," in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, pp. 747-756: Association for Computational Linguistics.
D. Elliott and F. Keller, "Image description using visual dependency representations," in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1292-1302.
X. Zhang, X. Wang, X. Tang, H. Zhou, and C. Li, "Description generation for remote sensing images using attribute attention mechanism," Remote Sensing, vol. 11, no. 6, p. 612, 2019.
H. Fang et al., "From captions to visual concepts and back," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1473-1482.
R. Mason and E. Charniak, "Nonparametric method for data-driven image captioning," in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2014, pp. 592-598.

P. Kuznetsova, V. Ordonez, T. L. Berg, and Y. Choi, "Treetalk: Composition and compression of trees for image descriptions," Transactions of the Association for Computational Linguistics, vol. 2, pp. 351-362, 2014.
R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, "Grounded compositional semantics for finding and describing images with sentences," Transactions of the Association for Computational Linguistics, vol. 2, pp. 207-218, 2014.
M. Yang et al., "An Ensemble of Generation-and Retrieval-Based Image Captioning With Dual Generator Generative Adversarial Network," IEEE Transactions on Image Processing, vol. 29, pp. 9627-9640, 2020.
J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, "Deep captioning with multimodal recurrent neural networks (m-rnn)," arXiv preprint arXiv:1412.6632, pp. 1-17, 2015.
A. Oluwasammi et al., "Features to Text: A Comprehensive Survey of Deep Learning on Semantic Segmentation and Image Captioning," Complexity, vol. 2021, 2021.
J. Donahue et al., "Long-term recurrent convolutional networks for visual recognition and description," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625-2634.
D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:. 2014.
I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," in Advances in Neural Information Processing Systems, 2014, pp. 3104-3112.
R. Kiros, R. Salakhutdinov, and R. S. Zemel, "Unifying visual-semantic embeddings with multimodal neural language models," arXiv preprint arXiv:. 2014.
D. W. Otter, J. R. Medina, and J. K. Kalita, "A survey of the usages of deep learning for natural language processing," IEEE Transactions on Neural Networks Learning Systems, 2020.
S. Kalra and A. Leekha, "Survey of convolutional neural networks for image captioning," Journal of Information Optimization Sciences, vol. 41, no. 1, pp. 239-260, 2020.
B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, "Learning transferable architectures for scalable image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8697-8710.
F. Chollet, "Xception: Deep learning with depthwise separable convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1251-1258.
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the inception architecture for computer vision," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818-2826.
M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, "A comprehensive survey of deep learning for image captioning," ACM Computing Surveys, vol. 51, no. 6, pp. 1-36, 2019.
H. Wang, Y. Zhang, and X. Yu, "An Overview of Image Caption Generation Methods," Computational Intelligence Neuroscience, vol. 2020, 2020.
S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv preprint arXiv:. 2014.
T. Chen et al., "``Factual''or``Emotional'': Stylized Image Captioning with Adaptive Learning and Attention," in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 519-535.
Q. You, H. Jin, and J. Luo, "Image captioning at will: A versatile scheme for effectively injecting sentiments into image descriptions," arXiv preprint arXiv:.10121, 2018.
A. Mathews, L. Xie, and X. He, "SemStyle: Learning to Generate Stylised Image Captions Using Unaligned Text," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8591-8600.
M. Tanti, A. Gatt, and K. P. Camilleri, "Where to put the image in an image caption generator," Natural Language Engineering, vol. 24, no. 3, pp. 467-489, 2018.
N. Xu, A.-A. Liu, J. Liu, W. Nie, and Y. Su, "Scene graph captioner: Image captioning based on structural visual representation," Journal of Visual Communication Image Representation, vol. 58, pp. 477-485, 2019.
O. Nina and A. Rodriguez, "Simplified LSTM unit and search space probability exploration for image description," in 2015 10th International Conference on Information, Communications and Signal Processing (ICICS), 2015, pp. 1-5: IEEE.
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156-3164.
S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, "Self-critical sequence training for image captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7008-7024.
J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei, "A hierarchical approach for generating descriptive image paragraphs," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 317-325.
T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, "Boosting image captioning with attributes," in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4894-4902.
J. Mao, X. Wei, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille, "Learning like a child: Fast novel visual concept learning from sentence descriptions of images," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2533-2541.
T. A. Praveen and J. A. A. Jothi, "Enhancing Image Caption Quality with Pre-post Image Injections," in Advances in Machine Learning and Computational Intelligence: Springer, 2021, pp. 805-812.
H. Wang, H. Wang, and K. Xu, "Evolutionary Recurrent Neural Network for Image Captioning," Neurocomputing, 2020.
Y. Tao, X. Wang, R.-V. Sánchez, S. Yang, and Y. Bai, "Spur gear fault diagnosis using a multilayer gated recurrent unit approach with vibration signal," IEEE Access, vol. 7, pp. 56880-56889, 2019.
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "Bleu: a method for automatic evaluation of machine translation," in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311-318.
S. Banerjee and A. Lavie, "METEOR: An automatic metric for MT evaluation with improved correlation with human judgments," in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65-72.
R. Vedantam, C. Lawrence Zitnick, and D. Parikh, "Cider: Consensus-based image description evaluation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4566-4575.
C.-Y. Lin, "Rouge: A package for automatic evaluation of summaries," in Text Summarization Branches Out, 2004, pp. 74-81.
K. Drossos, S. Lipping, and T. Virtanen, "Clotho: An Audio Captioning Dataset," in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 736-740: IEEE.
L. Zhang, S. Wang, and B. Liu, "Deep learning for sentiment analysis: A survey," Journal of Wiley Interdisciplinary Reviews: Data Mining Knowledge Discovery, vol. 8, no. 4, p. e1253, 2018.
A. Krizhevsky and G. Hinton, "Learning multiple layers of features from tiny images," 2009.
X. Li and X. Wu, "Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition," in International Conference on Acoustics, Speech and Signal Processing, 2015, pp. 4520-4524: IEEE.
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv preprint arXiv:1412.3555, 2014.
S. Hochreiter, "The vanishing gradient problem during learning recurrent neural nets and problem solutions," International Journal of Uncertainty, Fuzziness Knowledge-Based Systems, vol. 6, no. 02, pp. 107-116, 1998.
M. Sundermeyer, R. Schlüter, and H. Ney, "LSTM neural networks for language modeling," in Thirteenth Annual Conference of the International Speech Communication Association, 2012.
R. Pascanu, T. Mikolov, and Y. Bengio, "On the difficulty of training recurrent neural networks," in International Conference on Machine Learning, 2013, pp. 1310-1318: PMLR.
Ö. Çaylı, B. Makav, V. Kılıç, and A. Onan, "Mobile Application Based Automatic Caption Generation for Visually Impaired," in International Conference on Intelligent and Fuzzy Systems, 2020, pp. 1532-1539: Springer.
T.-Y. Lin et al., "Microsoft coco: Common objects in context," in European Conference on Computer Vision, 2014, pp. 740-755: Springer.
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, "Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models," in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2641-2649.
D. Gurari, Y. Zhao, M. Zhang, and N. Bhattacharya, "Captioning Images Taken by People Who Are Blind," arXiv preprint arXiv:.08565, 2020.

Details

Primary Language

English

Subjects

Artificial Intelligence

Journal Section

Research Article

Authors

Volkan Kılıç ^*
0000-0002-3164-1981
Türkiye

Publication Date

August 31, 2021

Submission Date

January 22, 2021

Acceptance Date

May 13, 2021

Published in Issue

Year 2021 Volume: 4 Number: 2

DOI

https://doi.org/10.35377/saucis.04.02.866409

IZ

https://izlik.org/JA49EA23DU

Cite

RIS / Bibtex

APA

Kılıç, V. (2021). Deep Gated Recurrent Unit for Smartphone-Based Image Captioning. Sakarya University Journal of Computer and Information Sciences, 4(2), 181-191. https://doi.org/10.35377/saucis.04.02.866409

AMA

1.Kılıç V. Deep Gated Recurrent Unit for Smartphone-Based Image Captioning. SAUCIS. 2021;4(2):181-191. doi:10.35377/saucis.04.02.866409

Chicago

Kılıç, Volkan. 2021. “Deep Gated Recurrent Unit for Smartphone-Based Image Captioning”. Sakarya University Journal of Computer and Information Sciences 4 (2): 181-91. https://doi.org/10.35377/saucis.04.02.866409.

EndNote

Kılıç V (August 1, 2021) Deep Gated Recurrent Unit for Smartphone-Based Image Captioning. Sakarya University Journal of Computer and Information Sciences 4 2 181–191.

IEEE

[1]V. Kılıç, “Deep Gated Recurrent Unit for Smartphone-Based Image Captioning”, SAUCIS, vol. 4, no. 2, pp. 181–191, Aug. 2021, doi: 10.35377/saucis.04.02.866409.

ISNAD

Kılıç, Volkan. “Deep Gated Recurrent Unit for Smartphone-Based Image Captioning”. Sakarya University Journal of Computer and Information Sciences 4/2 (August 1, 2021): 181-191. https://doi.org/10.35377/saucis.04.02.866409.

JAMA

1.Kılıç V. Deep Gated Recurrent Unit for Smartphone-Based Image Captioning. SAUCIS. 2021;4:181–191.

MLA

Kılıç, Volkan. “Deep Gated Recurrent Unit for Smartphone-Based Image Captioning”. Sakarya University Journal of Computer and Information Sciences, vol. 4, no. 2, Aug. 2021, pp. 181-9, doi:10.35377/saucis.04.02.866409.

Vancouver

1.Volkan Kılıç. Deep Gated Recurrent Unit for Smartphone-Based Image Captioning. SAUCIS. 2021 Aug. 1;4(2):181-9. doi:10.35377/saucis.04.02.866409