Sethi, M. I. S. et al. (2026).
Indian Journal of Psychological Medicine,
02537176261435658.
Background:Artificial intelligence (AI) models demonstrate remarkable capabilities in healthcare applications, yet their performance compared to medical trainees in psychiatric education remains unexplored. This study evaluated the comparative performance of large language models (LLMs) against first-year psychiatry residents in standardized assessments at a premier Indian medical educational institute.
Methods:For this study, the already-scored answer sheets for Theory Papers I and II, as well as unmanned, non-interactive Objective Structured Clinical Examinations (OSCEs) with image-based tasks, from all 25 first-year psychiatry residents (March 2024 exam) were obtained from the examination section of the institute. The same question papers were then uploaded into three AI models (ChatGPT−3.5, Gemini Advanced, and Claude Sonnet). Four blinded faculty members evaluated the responses generated by the AI models. Final, the scores of the AI models and psychiatry residents were analyzed for comparison. Statistical analysis employed Kruskal–Wallis tests with post hoc Mann–Whitney U comparisons.
Results:AI models outperformed residents in theoretical assessments. In Paper I (theory), AI models achieved mean scores (standard deviation) of Claude Sonnet 67.88 (10.63), ChatGPT−3.5 70.38 (3.95), and Gemini Advanced 71.25 (3.86), compared to residents’ 58.0 (2.58). Paper II (theory) assessments showed even larger gaps, with AI models scoring Claude Sonnet 72.88 (3.77), ChatGPT−3.5 71.0 (3.56), and Gemini Advanced 69.63 (12.86), compared to residents’ 50.96 (2.49). OSCE performance patterns differed markedly. Paper I OSCEs showed equivalent performance: AI: 13.0; residents’: 13.16 (1.49), while Paper II OSCEs revealed variable results: Claude Sonnet excelled at 20.0 (1.41), but ChatGPT−3.5 underperformed at 15.0 (0.50), compared to residents at 16.6 (1.55). Inter-rater reliability coefficients remained excellent ( intraclass correlation coefficients [ICC]: 0.810–0.934).
Conclusions:While AI demonstrated superior theoretical knowledge, equivalent or variable practical skills performance reveals fundamental limitations in clinical reasoning and contextual understanding. These findings necessitate reconceptualizing psychiatric education to emphasize uniquely human competencies while leveraging AI’s capabilities for knowledge synthesis.
Here are some thoughts:
This study compared three large language models (LLMs) to first-year psychiatry residents using real institutional exams in India. The LLMs consistently outperformed residents on theoretical assessments (by 17–43%) but showed equivalent or inconsistent performance on practical OSCEs, revealing critical gaps in clinical reasoning and cultural contextualization. The authors conclude that psychiatric education should shift focus toward uniquely human skills like empathy and judgment, while using AI as a tool for knowledge synthesis.








