Comparative performance of AI chatbots in dental implantology: insights and limitations

Uçar, SULTAN; Gaş, SELİN; Sasany, Rafat

doi:10.1186/s12903-025-07426-9

Comparative performance of AI chatbots in dental implantology: insights and limitations

Uçar S. M., Gaş S., Sasany R.

BMC Oral Health, cilt.26, sa.1, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 26 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.1186/s12903-025-07426-9
Dergi Adı: BMC Oral Health
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, MEDLINE, Directory of Open Access Journals
Anahtar Kelimeler: Artificial intelligence, Clinical decision-making, Dental education, Dental implantology
İstanbul Gelişim Üniversitesi Adresli: Evet

Özet

Objective: This study critically evaluated the performance, accuracy, and clinical relevance of three large language models ChatGPT-4o, Claude 3.5, and Gemini 1.5 Pro when answering expert-generated questions on zygomatic implantology. The goal was to determine the extent to which such tools may function as educational or clinical decision supports in maxillofacial surgery. Methods: Thirty-eight standardized questions were developed by four oral and maxillofacial surgeons with advanced expertise in zygomatic implantology. Each model’s responses were independently assessed by five calibrated clinical raters using validated metrics DISCERN, GQS, and a 5-point Accuracy Rubric to judge reliability, quality, and factual correctness. Non-parametric statistics (Kruskal–Wallis with Bonferroni post hoc correction; Spearman correlation) were used, and inter-rater reliability was quantified by ICC(2,1) = 0.86–0.91 (p < 0.001). Results: Gemini 1.5 Pro achieved slightly higher mean scores for response quality and accuracy, whereas Claude 3.5 and ChatGPT-4o performed comparably. However, absolute differences were modest (≤ 0.5 points on 5-point scales), indicating relative trends rather than decisive superiority. All models produced readable, clinically relevant content, though variability persisted in the depth and specificity of clinical guidance. Conclusion: Current AI language models exhibit moderate but inconsistent competency when addressing complex implantology scenarios. While Gemini 1.5 Pro scored marginally higher, these differences are unlikely to be of major practical consequence. Continuous validation, transparent reporting of model versions, and expert supervision remain essential before integrating such systems into routine dental education or clinical decision-making.