Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes

Yunus Serhat Bıçakçı

doi:10.35377/saucis...1727583

Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes

Abstract

This study presents the first comprehensive benchmark of seven open-source multimodal vision-language models with Turkish language support—namely, Aya Vision 32B, Gemma 3 27B, InternVL3 38B, Qwen2-VL 72B-AWQ, Qwen2.5-VL 72B-AWQ, Cosmos-LLaVA, and Phi-4 Multimodal—on two image datasets of Turkish cuisine, TurkishFoods-15 and TurkishFoods-25. All models were evaluated zero-shot, without additional training or fine-tuning, utilizing a fully standardized Turkish system and user prompts. We report macro and weighted averages of accuracy, precision, recall, and F1-score, along with end-to-end inference time. Aya Vision 32B obtained the best weighted F1-score (85.9%) on TurkishFoods‑15, whereas Gemma 3 27B led on TurkishFoods‑25 (76.7%). Across metrics and datasets, Aya Vision 32B, Gemma 3 27B, Qwen2‑VL 72B‑AWQ, and InternVL3 38B formed the most reliable models. These results establish a solid reference for future work on culturally aware multimodal AI and demonstrate, for the first time, that vision-language models can categorize Turkish dishes without task‑specific training.

Keywords

References

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds., Jun. 2019, pp. 4171–4186. doi: 10.48550/arXiv.1810.04805
A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, p. 9, 2019.
T. B. Brown et al., “Language models are few-shot learners,” arXiv, 2020. doi: 10.48550/arXiv.2005.14165
A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” arXiv, 2021. doi: 10.48550/arXiv.2010.11929
A. Radford et al., “Learning transferable visual models from natural language supervision,” arXiv, 2021. doi: 10.48550/arXiv.2103.00020
Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “SimVLM: Simple visual language model pretraining with weak supervision,” arXiv, 2022. doi: 10.48550/arXiv.2108.10904
J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation.” arXiv, 2022. doi: 10.48550/arXiv.2201.12086
R. Mokady, A. Hertz, and A. H. Bermano, “ClipCap: CLIP prefix for image captioning.” arXiv, 2021. doi: 10.48550/arXiv.2111.09734

A. F. Biten, R. Litman, Y. Xie, S. Appalaraju, and R. Manmatha, “LaTr: Layout-aware transformer for scene-text VQA,” arXiv, 2021. doi: 10.48550/arXiv.2112.12494
M. Mathew, D. Karatzas, and C. V. Jawahar, “DocVQA: A dataset for vqa on document images,” arXiv, 2021. doi: 10.48550/arXiv.2007.00398
M. A. Bayram, A. A. Fincan, A. S. Gümüş, B. Diri, S. Yıldırım, and Ö. Aytaş, “Setting standards in Turkish NLP: TR-MMLU for large language model evaluation,” arXiv, 2025. doi: 10.48550/arXiv.2501.00593
M. A. Bayram, A. A. Fincan, A. S. Gümüş, S. Karakaş, B. Diri, and S. Yıldırım, “Tokenization standards for linguistic integrity: Turkish as a benchmark,” arXiv, 2025. doi: 10.48550/arXiv.2502.07057
L. Zheng et al., “LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset,” arXiv, 2024. doi: 10.48550/arXiv.2309.11998
M. Hinck et al., “Why do LLaVA vision-language models reply to images in English?,” arXiv preprint arXiv:2407.02333, 2024. doi: 10.48550/arXiv.2407.02333
H. Wei et al., “Vary: Scaling up the vision vocabulary for large vision-language model,” in Computer Vision – ECCV 2024, A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, Eds., 2025, pp. 408–424. doi: 10.1007/978-3-031-73235-5_23
J.-H. Kim, N.-H. Kim, D. Jo, and C. S. Won, “Multimodal food image classification with large language models,” Electronics, vol. 13, no. 22, p. 4552, 2024. doi: 10.3390/electronics13224552
S. Dash et al., “Aya Vision: Advancing the frontier of multilingual multimodality,” arXiv, 2025. doi: 10.48550/arXiv.2505.08751
Gemma Team, Google DeepMind, “Gemma 3 Technical Report,” Google DeepMind, Mar. 2025. doi: 10.48550/arXiv.2503.19786
P. Wang et al., “Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution,” arXiv, 2024. doi: 10.48550/arXiv.2409.12191
S. Bai et al., “Qwen2.5-VL technical report,” arXiv, 2025. doi: 10.48550/arXiv.2502.13923
J. Zhu et al., “InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models,” arXiv, 2025. doi: 10.48550/arXiv.2504.10479
A. Zeer et al., “Cosmos-LLaVA: Chatting with the Visual,” in 2024 8th International Artificial Intelligence and Data Processing Symposium (IDAP), Sep. 2024, pp. 1–7. doi: 10.1109/idap64064.2024.10710874
Microsoft et al., “Phi-4-mini technical report: compact yet powerful multimodal language models via mixture-of-LoRAs,” arXiv, 2025. doi: 10.48550/arXiv.2503.01743
C. Güngör, F. Baltacı, A. Erdem, and E. Erdem, “Turkish cuisine: A benchmark dataset with Turkish meals for food recognition,” in 2017 25th Signal Processing and Communications Applications Conference (SIU), 2017, pp. 1–4. doi: 10.1109/SIU.2017.7960494
Ş. Kayıkçı, Y. Başol, and E. Dörter, “Classification of Turkish cuisine with deep learning on mobile platform,” in 2019 4th International Conference on Computer Science and Engineering (UBMK), Sep. 2019, pp. 1–5. doi: 10.1109/UBMK.2019.8906992
A. Paszke et al., “PyTorch: An imperative style, high-performance deep learning library,” arXiv, 2019. doi: 10.48550/arXiv.1912.01703
T. Wolf et al., “HuggingFace’s Transformers: State-of-the-art natural language processing,” arXiv, 2020. doi: 10.48550/arXiv.1910.03771
Lmd. Contributors, “LMDeploy: A toolkit for compressing, deploying, and serving LLM,” 2023. [Online]. Available: github.com/InternLM/lmdeploy
F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp. 11941–11952. doi: 10.1109/ICCV51070.2023.01100
M. Tschannen et al., “SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,” arXiv, 2025. doi: 10.48550/arXiv.2502.14786
J. Dang et al., “Aya Expanse: Combining research breakthroughs for a new multilingual frontier,” arXiv, 2024. doi: 10.48550/arXiv.2412.04261
E. J. Hu et al., “LoRA: Low-rank adaptation of large language models,” arXiv, 2021. doi: 10.48550/arXiv.2106.09685
W. Huang et al., “LLM2CLIP: Powerful language model unlocks richer visual representation,” arXiv, 2024. doi: 10.48550/arXiv.2411.04997
X. Yue et al., “MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI,” in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2024, pp. 9556–9567. doi: 10.1109/CVPR52733.2024.00913
X. Yue et al., “MMMU-Pro: A more robust multi-discipline multimodal understanding benchmark,” arXiv, 2024. doi: 10.48550/arXiv.2409.02813
Y. Liu et al., “OCRBench: on the hidden mystery of OCR in large multimodal models,” Science China Information Sciences, vol. 67, no. 12, p. 220102, 2024. doi: 10.1007/s11432-024-4235-6
J. Lin et al., “AWQ: Activation-aware weight quantization for LLM compression and acceleration,” arXiv, 2024. doi: 10.48550/arXiv.2306.00978
H.-L. Sun et al., “Parrot: Multilingual Visual Instruction Tuning,” arXiv, 2024. doi: 10.48550/arXiv.2406.02539
D. A. Hudson, and C. D. Manning, “GQA: A new dataset for real-world visual reasoning and compositional question answering,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019, pp. 6693–6702. doi: 10.1109/CVPR.2019.00686
T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” arXiv, 2015. doi: 10.48550/arXiv.1405.0312
R. Krishna et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” arXiv, 2016. doi: 10.48550/arXiv.1602.07332

Details

Primary Language

English

Subjects

Software Testing, Verification and Validation , Software Engineering (Other)

Journal Section

Research Article

Authors

Yunus Serhat Bıçakçı ^*
0000-0002-7288-9959
Türkiye

Early Pub Date

March 16, 2026

Publication Date

March 16, 2026

Submission Date

June 26, 2025

Acceptance Date

November 19, 2025

Published in Issue

Year 2026 Volume: 9 Number: 1

DOI

https://doi.org/10.35377/saucis...1727583

IZ

https://izlik.org/JA65GC28ZJ

Cite

RIS / Bibtex

APA

Bıçakçı, Y. S. (2026). Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes. Sakarya University Journal of Computer and Information Sciences, 9(1), 119-133. https://doi.org/10.35377/saucis...1727583

AMA

1.Bıçakçı YS. Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes. SAUCIS. 2026;9(1):119-133. doi:10.35377/saucis.1727583

Chicago

Bıçakçı, Yunus Serhat. 2026. “Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes”. Sakarya University Journal of Computer and Information Sciences 9 (1): 119-33. https://doi.org/10.35377/saucis. 1727583.

EndNote

Bıçakçı YS (March 1, 2026) Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes. Sakarya University Journal of Computer and Information Sciences 9 1 119–133.

IEEE

[1]Y. S. Bıçakçı, “Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes”, SAUCIS, vol. 9, no. 1, pp. 119–133, Mar. 2026, doi: 10.35377/saucis...1727583.

ISNAD

Bıçakçı, Yunus Serhat. “Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes”. Sakarya University Journal of Computer and Information Sciences 9/1 (March 1, 2026): 119-133. https://doi.org/10.35377/saucis. 1727583.

JAMA

1.Bıçakçı YS. Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes. SAUCIS. 2026;9:119–133.

MLA

Bıçakçı, Yunus Serhat. “Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes”. Sakarya University Journal of Computer and Information Sciences, vol. 9, no. 1, Mar. 2026, pp. 119-33, doi:10.35377/saucis. 1727583.

Vancouver

1.Yunus Serhat Bıçakçı. Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes. SAUCIS. 2026 Mar. 1;9(1):119-33. doi:10.35377/saucis. 1727583