Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes
Abstract
This study presents the first comprehensive benchmark of seven open-source multimodal vision-language models with Turkish language support—namely, Aya Vision 32B, Gemma 3 27B, InternVL3 38B, Qwen2-VL 72B-AWQ, Qwen2.5-VL 72B-AWQ, Cosmos-LLaVA, and Phi-4 Multimodal—on two image datasets of Turkish cuisine, TurkishFoods-15 and TurkishFoods-25. All models were evaluated zero-shot, without additional training or fine-tuning, utilizing a fully standardized Turkish system and user prompts. We report macro and weighted averages of accuracy, precision, recall, and F1-score, along with end-to-end inference time. Aya Vision 32B obtained the best weighted F1-score (85.9%) on TurkishFoods‑15, whereas Gemma 3 27B led on TurkishFoods‑25 (76.7%). Across metrics and datasets, Aya Vision 32B, Gemma 3 27B, Qwen2‑VL 72B‑AWQ, and InternVL3 38B formed the most reliable models. These results establish a solid reference for future work on culturally aware multimodal AI and demonstrate, for the first time, that vision-language models can categorize Turkish dishes without task‑specific training.
Keywords
References
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds., Jun. 2019, pp. 4171–4186. doi: 10.48550/arXiv.1810.04805
- A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, p. 9, 2019.
- T. B. Brown et al., “Language models are few-shot learners,” arXiv, 2020. doi: 10.48550/arXiv.2005.14165
- A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” arXiv, 2021. doi: 10.48550/arXiv.2010.11929
- A. Radford et al., “Learning transferable visual models from natural language supervision,” arXiv, 2021. doi: 10.48550/arXiv.2103.00020
- Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “SimVLM: Simple visual language model pretraining with weak supervision,” arXiv, 2022. doi: 10.48550/arXiv.2108.10904
- J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation.” arXiv, 2022. doi: 10.48550/arXiv.2201.12086
- R. Mokady, A. Hertz, and A. H. Bermano, “ClipCap: CLIP prefix for image captioning.” arXiv, 2021. doi: 10.48550/arXiv.2111.09734
Details
Primary Language
English
Subjects
Software Testing, Verification and Validation , Software Engineering (Other)
Journal Section
Research Article
Authors
Early Pub Date
March 16, 2026
Publication Date
March 16, 2026
Submission Date
June 26, 2025
Acceptance Date
November 19, 2025
Published in Issue
Year 2026 Volume: 9 Number: 1
