BMC Oral Health, cilt.26, sa.1, 2026 (SCI-Expanded, Scopus)
Objective: This study critically evaluated the performance, accuracy, and clinical relevance of three large language models ChatGPT-4o, Claude 3.5, and Gemini 1.5 Pro when answering expert-generated questions on zygomatic implantology. The goal was to determine the extent to which such tools may function as educational or clinical decision supports in maxillofacial surgery. Methods: Thirty-eight standardized questions were developed by four oral and maxillofacial surgeons with advanced expertise in zygomatic implantology. Each model’s responses were independently assessed by five calibrated clinical raters using validated metrics DISCERN, GQS, and a 5-point Accuracy Rubric to judge reliability, quality, and factual correctness. Non-parametric statistics (Kruskal–Wallis with Bonferroni post hoc correction; Spearman correlation) were used, and inter-rater reliability was quantified by ICC(2,1) = 0.86–0.91 (p < 0.001). Results: Gemini 1.5 Pro achieved slightly higher mean scores for response quality and accuracy, whereas Claude 3.5 and ChatGPT-4o performed comparably. However, absolute differences were modest (≤ 0.5 points on 5-point scales), indicating relative trends rather than decisive superiority. All models produced readable, clinically relevant content, though variability persisted in the depth and specificity of clinical guidance. Conclusion: Current AI language models exhibit moderate but inconsistent competency when addressing complex implantology scenarios. While Gemini 1.5 Pro scored marginally higher, these differences are unlikely to be of major practical consequence. Continuous validation, transparent reporting of model versions, and expert supervision remain essential before integrating such systems into routine dental education or clinical decision-making.