International Journal of Cardiovascular Sciences. 11/jun/2025;38:e20240231.
Performance Benchmarking of Open-Source Large Language Models on the Brazilian Society of Cardiology’s Certification Exam
Abstract
Background:
Large language models (LLMs) have made a significant impact in medicine and demonstrate substantial promise for further development. However, most of the existing research has predominantly centered on English-language tasks with lower medical complexity. This underscores the importance of investigating the performance of state-of-the-art LLMs in more complex specialties, such as cardiology, and in languages beyond English, such as Portuguese.
Objective:
This study aimed to evaluate and compare leading LLMs based on their performance on the validated cardiology knowledge assessed by the Brazilian Society of Cardiology’s (SBC) Certification Exam.
Methods:
This study conducted a comparative analysis of 23 LLMs in the context of the SBC’s Certification Exam. The exam consists of 100 multiple-choice questions, 20 of which include images that cannot be processed by all LLMs. Therefore, these image-based questions were excluded from the analysis.
Results:
Proprietary LLMs showed a varying performance, with GPT-4o achieving the highest success rate at 62.25%, followed by Claude Opus at 60.25%. In the medium-sized model category (up to 100 billion parameters), Claude Haiku reached 47.25%. Among open-source models, Llama3 70B Instruct scored 53.50% in the large model category (over 100 billion parameters), while Llama3 8B achieved 36.25% in the small model category (under 20 billion parameters).
Conclusions:
Both proprietary and open-source LLMs underperformed on the test, failing to meet the exam’s cutoff score. Although larger models generally achieved better results, some medium-sized models — such as Llama3 70B Instruct and Claude Haiku—showed noteworthy results. The LLMs lacked specialized knowledge in cardiology and faced challenges in understanding Portuguese, revealing a significant gap in current AI capabilities and emphasizing the need for improvements.
Palavras-chave: Artificial Intelligence; Cardiology; Benchmarking
176