Performance Benchmarking of Open-Source Large Language Models on the Brazilian Society of Cardiology&#8217;s Certification Exam

Severino, João Victor Bruneti; Berger, Matheus Nespolo; Paula, Pedro Angelo Basei de; Loures, Filipe Silveira; Todeschini, Solano Amadori; Roeder, Eduardo Augusto; Veiga, Maria Han; Knopfholz, José; Marques, Gustavo Lenci

International Journal of Cardiovascular Sciences. 11/jun/2025;38:e20240231.

Artigo Original

Performance Benchmarking of Open-Source Large Language Models on the Brazilian Society of Cardiology’s Certification Exam

João Victor Bruneti Severino , Matheus Nespolo Berger , Pedro Angelo Basei de Paula , Filipe Silveira Loures , Solano Amadori Todeschini , Eduardo Augusto Roeder , Maria Han Veiga , José Knopfholz , Gustavo Lenci Marques

DOI: 10.36660/ijcs.20240231

Abstract

Background:

Large language models (LLMs) have made a significant impact in medicine and demonstrate substantial promise for further development. However, most of the existing research has predominantly centered on English-language tasks with lower medical complexity. This underscores the importance of investigating the performance of state-of-the-art LLMs in more complex specialties, such as cardiology, and in languages beyond English, such as Portuguese.

Objective:

This study aimed to evaluate and compare leading LLMs based on their performance on the validated cardiology knowledge assessed by the Brazilian Society of Cardiology’s (SBC) Certification Exam.

Methods:

This study conducted a comparative analysis of 23 LLMs in the context of the SBC’s Certification Exam. The exam consists of 100 multiple-choice questions, 20 of which include images that cannot be processed by all LLMs. Therefore, these image-based questions were excluded from the analysis.

Results:

Proprietary LLMs showed a varying performance, with GPT-4o achieving the highest success rate at 62.25%, followed by Claude Opus at 60.25%. In the medium-sized model category (up to 100 billion parameters), Claude Haiku reached 47.25%. Among open-source models, Llama3 70B Instruct scored 53.50% in the large model category (over 100 billion parameters), while Llama3 8B achieved 36.25% in the small model category (under 20 billion parameters).

Conclusions:

Both proprietary and open-source LLMs underperformed on the test, failing to meet the exam’s cutoff score. Although larger models generally achieved better results, some medium-sized models — such as Llama3 70B Instruct and Claude Haiku—showed noteworthy results. The LLMs lacked specialized knowledge in cardiology and faced challenges in understanding Portuguese, revealing a significant gap in current AI capabilities and emphasizing the need for improvements.

Performance Benchmarking of Open-Source Large Language Models on the Brazilian Society of Cardiology’s Certification Exam

Palavras-chave: Artificial Intelligence; Cardiology; Benchmarking

Citar

283

Views

International Journal of Cardiovascular Sciences. 11/jun/2025;38:e20240231.

Performance Benchmarking of Open-Source Large Language Models on the Brazilian Society of Cardiology’s Certification Exam

Abstract

Background:

Objective:

Methods:

Results:

Conclusions:

Comentários Cancelar resposta

Menu