Publication: Evaluating the Performance of DeepSeek 3, Claude Sonnet 4,and Gemini 2.5 in the Chilean Medical Licensing Examination: Observational Study
Loading...
Date
Authors
Lancellotti Guajardo, Anaís Aracelly
Jerez Yañez, Oscar
Silva Arroyo, Vicente Alberto Edgardo Jesus
Vera Cartes, Marcos Jeremías Giovanny
Herrera Alcaíno, Álvaro Andrés
Journal Title
Journal ISSN
Volume Title
Publisher
Universidad de Murcia, Servicio de Publicaciones
Abstract
Introducción: La inteligencias artificial y su mejora continua han revolucionado la educación médica, pero sudesempeño en contextos evaluativos específicos aún requiere mayor exploración. Métodos: Este estudio evaluóy comparó cualitativamente el desempeño de tres modelos de lenguaje de última generación —Claude Sonnet 4,Gemini 2.5 y DeepSeek 3— en simulaciones del Examen Nacional de Conocimientos Médicos (EUNACOM) enChile. Se utilizaron tres exámenes simulados con 180 preguntas cada uno, que abarcaban diversas áreas médicasy tipos de preguntas, incluidas las basadas en casos clínicos. Resultados: Los resultados muestran que todos losmodelos de IA aprobaron los exámenes de forma consistente, y Claude Sonnet 4 logró el mayor desempeñogeneral (89% de precisión) y la mayor consistencia en todos los intentos. Las preguntas basadas en casos clínicosse respondieron con mayor precisión que las preguntas de conocimiento teórico, lo que destaca la fortaleza de losmodelos en el razonamiento clínico contextual. Claude sobresalió en Medicina Interna y Psiquiatría, DeepSeeken Cirugía y Gemini demostró un desempeño equilibrado. Sin embargo, se identificaron deficiencias específicasen áreas como la salud pública y el seguimiento clínico, lo que sugiere la necesidad de realizar ajustes específicosa cada modelo. Conclusión: Los hallazgos respaldan el potencial educativo de estas herramientas, pero tambiénenfatizan la importancia de su uso ético, supervisado y complementario a la formación médica tradicional. Esteestudio contribuye a comprender el papel emergente de la inteligencia artificial en las evaluacionesprofesionales, así como sus limitaciones y oportunidades en el contexto médico chileno
Introduction: Artificial intelligences and their continuous improvement have revolutionized medical education,but their performance in specific evaluative contexts still requires further exploration. Methods: This studyqualitatively evaluated and compared the performance of three state-of-the-art language models — ClaudeSonnet 4, Gemini 2.5, and DeepSeek 3 — in simulations of the National Medical Knowledge Examination(EUNACOM) in Chile. Three mock exams with 180 questions each were used, covering various medical areasand question types, including those based on clinical cases. Results: The results show that all AI modelsconsistently passed the exams, with Claude Sonnet 4 achieving the highest overall performance (89% accuracy)and the greatest consistency across attempts. Clinical case-based questions were answered more accurately thantheoretical knowledge questions, highlighting the models' strength in contextual clinical reasoning. Claudeexcelled in Internal Medicine and Psychiatry, DeepSeek in Surgery, and Gemini demonstrated balancedperformance. However, specific gaps were identified in areas such as Public Health and clinical follow-up,suggesting the need for model-specific adjustments. Conclusion: The findings support the educational potentialof these tools but also emphasize the importance of their ethical, supervised, and complementary use alongsidetraditional medical training. This study contributes to understanding the emerging role of artificial intelligencein professional assessments, as well as its limitations and opportunities within the Chilean medical context.
Introduction: Artificial intelligences and their continuous improvement have revolutionized medical education,but their performance in specific evaluative contexts still requires further exploration. Methods: This studyqualitatively evaluated and compared the performance of three state-of-the-art language models — ClaudeSonnet 4, Gemini 2.5, and DeepSeek 3 — in simulations of the National Medical Knowledge Examination(EUNACOM) in Chile. Three mock exams with 180 questions each were used, covering various medical areasand question types, including those based on clinical cases. Results: The results show that all AI modelsconsistently passed the exams, with Claude Sonnet 4 achieving the highest overall performance (89% accuracy)and the greatest consistency across attempts. Clinical case-based questions were answered more accurately thantheoretical knowledge questions, highlighting the models' strength in contextual clinical reasoning. Claudeexcelled in Internal Medicine and Psychiatry, DeepSeek in Surgery, and Gemini demonstrated balancedperformance. However, specific gaps were identified in areas such as Public Health and clinical follow-up,suggesting the need for model-specific adjustments. Conclusion: The findings support the educational potentialof these tools but also emphasize the importance of their ethical, supervised, and complementary use alongsidetraditional medical training. This study contributes to understanding the emerging role of artificial intelligencein professional assessments, as well as its limitations and opportunities within the Chilean medical context.
Description
Keywords
Medical education , EUNACOM , Clinical reasoning , Large Language Models , Medical assessment , Inteligencia artificial , Educación Médica , EUNACOM , Razonamiento clínico , Modelos de lenguaje , Evaluación médica , Artificial Intelligence
Citation
Revista Española de Educación Médica, Vol. 6 Núm. 5 (2025)
item.page.embargo
Ir a Estadísticas
Este ítem está sujeto a una licencia Creative Commons. http://creativecommons.org/publicdomain/zero/1.0/