Skip to main navigation Skip to search Skip to main content

Abstract

Objective: To evaluate large language models (LLMs) for automating the assessment of clinician adherence to ESC heart failure pharmacotherapy guidelines.
Materials and Methods: We used electronic health records (EHRs) data pertaining to hospitalized heart failure patients. The task was to assess whether discharge medications followed the guidelines. We labeled each record as: (1) all recommended medications are present and target doses achieved; (2) all recommended medications are present, but target doses not achieved; (3) one or more recommended medications are missing. We evaluated three general-domain (GLM-4-9B-chat, Llama3-8B-Instruct, Mistral-7B-Instruct-v0.2) and three medical-specific (Med42-v2-8B, Llama-3-8B-UltraMedical, OpenBioLLM-8B) open-source LLMs under different prompt settings (zero-shot, few-shot and Chain-of-Thought). We fine-tuned the models using synthesized preference data from our EHR data with the Monolithic Preference Optimization without Reference Model (ORPO) method. We performed a learning curve analysis to determine optimal training data size for performance. We assessed LLM performance using the macro F1 score.
Results: We included data of 969 patients. Adherence to medication and doses was 4%. All LLMs scored F1<0.40 across most prompt settings (baseline F1=0.333). After fine-tuning, five LLMs scored F1≥0.89; one (Llama3-8B-Instruct) scored lower (F1=0.764). GLM-4-9B-Chat reached peak performance with 40% of the training data, while Mistral-7B-Instruct-v0.2 required 50%. Other models needed more data.
Conclusion: Task-specific fine-tuning of LLMs is necessary for optimal performance, and selecting the appropriate LLM for this is important. Without fine-tuning, both general-domain and medical-specific LLMs performed close to random guessing, revealing key limitations in their adaptability to specialized tasks. Medical-specific LLMs showed no clear advantage over general-domain LLMs.
Original languageEnglish
JournalJournal of the American Medical Informatics Association
Publication statusSubmitted - 2025

Fingerprint

Dive into the research topics of 'Exploring the Potential of Large Language Models for Assessing Medication Adherence to the ESC Heart Failure Guidelines'. Together they form a unique fingerprint.

Cite this