Research Article

Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes

Volume: 9 Number: 1 March 16, 2026

Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes

Abstract

This study presents the first comprehensive benchmark of seven open-source multimodal vision-language models with Turkish language support—namely, Aya Vision 32B, Gemma 3 27B, InternVL3 38B, Qwen2-VL 72B-AWQ, Qwen2.5-VL 72B-AWQ, Cosmos-LLaVA, and Phi-4 Multimodal—on two image datasets of Turkish cuisine, TurkishFoods-15 and TurkishFoods-25. All models were evaluated zero-shot, without additional training or fine-tuning, utilizing a fully standardized Turkish system and user prompts. We report macro and weighted averages of accuracy, precision, recall, and F1-score, along with end-to-end inference time. Aya Vision 32B obtained the best weighted F1-score (85.9%) on TurkishFoods‑15, whereas Gemma 3 27B led on TurkishFoods‑25 (76.7%). Across metrics and datasets, Aya Vision 32B, Gemma 3 27B, Qwen2‑VL 72B‑AWQ, and InternVL3 38B formed the most reliable models. These results establish a solid reference for future work on culturally aware multimodal AI and demonstrate, for the first time, that vision-language models can categorize Turkish dishes without task‑specific training.

Keywords

References

  1. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds., Jun. 2019, pp. 4171–4186. doi: 10.48550/arXiv.1810.04805
  2. A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, p. 9, 2019.
  3. T. B. Brown et al., “Language models are few-shot learners,” arXiv, 2020. doi: 10.48550/arXiv.2005.14165
  4. A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” arXiv, 2021. doi: 10.48550/arXiv.2010.11929
  5. A. Radford et al., “Learning transferable visual models from natural language supervision,” arXiv, 2021. doi: 10.48550/arXiv.2103.00020
  6. Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “SimVLM: Simple visual language model pretraining with weak supervision,” arXiv, 2022. doi: 10.48550/arXiv.2108.10904
  7. J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation.” arXiv, 2022. doi: 10.48550/arXiv.2201.12086
  8. R. Mokady, A. Hertz, and A. H. Bermano, “ClipCap: CLIP prefix for image captioning.” arXiv, 2021. doi: 10.48550/arXiv.2111.09734

Details

Primary Language

English

Subjects

Software Testing, Verification and Validation , Software Engineering (Other)

Journal Section

Research Article

Early Pub Date

March 16, 2026

Publication Date

March 16, 2026

Submission Date

June 26, 2025

Acceptance Date

November 19, 2025

Published in Issue

Year 2026 Volume: 9 Number: 1

APA
Bıçakçı, Y. S. (2026). Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes. Sakarya University Journal of Computer and Information Sciences, 9(1), 119-133. https://doi.org/10.35377/saucis...1727583
AMA
1.Bıçakçı YS. Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes. SAUCIS. 2026;9(1):119-133. doi:10.35377/saucis.1727583
Chicago
Bıçakçı, Yunus Serhat. 2026. “Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes”. Sakarya University Journal of Computer and Information Sciences 9 (1): 119-33. https://doi.org/10.35377/saucis. 1727583.
EndNote
Bıçakçı YS (March 1, 2026) Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes. Sakarya University Journal of Computer and Information Sciences 9 1 119–133.
IEEE
[1]Y. S. Bıçakçı, “Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes”, SAUCIS, vol. 9, no. 1, pp. 119–133, Mar. 2026, doi: 10.35377/saucis...1727583.
ISNAD
Bıçakçı, Yunus Serhat. “Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes”. Sakarya University Journal of Computer and Information Sciences 9/1 (March 1, 2026): 119-133. https://doi.org/10.35377/saucis. 1727583.
JAMA
1.Bıçakçı YS. Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes. SAUCIS. 2026;9:119–133.
MLA
Bıçakçı, Yunus Serhat. “Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes”. Sakarya University Journal of Computer and Information Sciences, vol. 9, no. 1, Mar. 2026, pp. 119-33, doi:10.35377/saucis. 1727583.
Vancouver
1.Yunus Serhat Bıçakçı. Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes. SAUCIS. 2026 Mar. 1;9(1):119-33. doi:10.35377/saucis. 1727583

 

INDEXING & ABSTRACTING & ARCHIVING

 

31045 31044   ResimLink - Resim Yükle  31047 

31043 28939 28938 34240
 

 

29070    The papers in this journal are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License