Evaluating AI-Generated Meal Plans for Simulated Diabetes Profiles: A Guideline-Based Comparison of Three Language Models


BAYRAM H. M., Arslan S., ÖZTÜRKCAN S. A.

Journal of Evaluation in Clinical Practice, cilt.31, sa.7, 2025 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 31 Sayı: 7
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1111/jep.70295
  • Dergi Adı: Journal of Evaluation in Clinical Practice
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, CAB Abstracts, CINAHL, MEDLINE, Psycinfo
  • Anahtar Kelimeler: artificial intelligence, large language models, medical nutrition therapy, retrieval-augmented generation, type 2 diabetes mellitus
  • İstanbul Gelişim Üniversitesi Adresli: Evet

Özet

Aims: This synthetic simulation, using no real patient data, study aimed to evaluate and compare the performance of three prominent large language models (LLMs)—ChatGPT-4.1, Grok-3 and DeepSeek—in generating medical nutrition therapy aligned dietary plans for adults with type 2 diabetes mellitus (T2DM). Methods: A simulation-based design was employed using 24 standardized virtual patient profiles differentiated by gender and body mass index (BMI) category. Each LLM was prompted in Turkish to generate 3-day meal plans. Outputs were assessed for energy and macro-/micronutrient accuracy, adherence to national and international T2DM guidelines and alignment with the nutrition care process (NCP). Results: ChatGPT-4.1 showed the highest alignment with energy requirements (70.9%) but overestimated fat intake. Grok-3 demonstrated superior energy accuracy (83.1%) but failed to meet several micronutrient targets. DeepSeek adjusted protein intake according to BMI but underdelivered carbohydrates. None of the models demonstrated full concordance with the NCP framework, particularly in the diagnosis and monitoring components. Frequent hallucinations and lack of clinical contextualization were noted. Integration of retrieval-augmented generation (RAG) was identified as a potential improvement strategy. Conclusion: While LLMs showed promise in generating baseline dietary guidance in a simulated context, these results reflected concordance with guideline documents only and concordance with guideline documents only and should not be interpreted as evidence of equivalence to dietitian-led care. These findings reflected model behaviour in synthetic scenarios only and highlighted the need for RAG integration and expert supervision before any clinical application.