TY - JOUR
T1 - Exploring the potential of large language models for assessing medication adherence to the ESC heart failure guidelines
AU - Dormosh, Noman
AU - Boonstra, Machteld
AU - Abu-Hanna, Ameen
AU - Asselbergs, Folkert W.
AU - Calixto, Iacer
N1 - Publisher Copyright:
© 2025 The Author(s). Published by Oxford University Press on behalf of the American Medical Informatics Association.
PY - 2025/12/1
Y1 - 2025/12/1
N2 - Objective To evaluate large language models (LLMs) for automating the assessment of clinician adherence to ESC heart failure pharmacotherapy guidelines. Materials and Methods We used electronic health records (EHRs) data pertaining to hospitalized heart failure patients. The task was to assess whether discharge medications followed the guidelines. We labeled each record as: (1) all recommended medications are present and target doses achieved; (2) all recommended medications are present, but target doses not achieved; (3) one or more recommended medications are missing. We evaluated three general-domain (GLM-4-9B-chat, Llama3-8B-Instruct, Mistral-7B-Instruct-v0.2) and three medical-specific (Med42-v2-8B, Llama-3-8B-UltraMedical, OpenBioLLM-8B) open-source LLMs under different prompt settings (zero-shot, few-shot and Chain-of-Thought). We fine-tuned the models using synthesized preference data from our EHR data with the Monolithic Preference Optimization without Reference Model (ORPO) method. We performed a learning curve analysis to determine optimal training data size for performance. We assessed LLM performance using the macro F1 score. Results We included data of 1,141 patients. Adherence to medication and doses was 5.3%. All LLMs scored F1 < 0.40 across most prompt settings (baseline F1 = 0.333). After fine-tuning, four LLMs scored F1 ≥ 0.90; the other two LLMs namely Llama3-8B-Instruct and OpenBioLLM-8B scored F1 = 0.794 and 0.787, respectively. GLM-4-9B-Chat reached peak performance with 40% of the training data, while Mistral-7B-Instruct-v0.2 required 50%. Other models needed more data. Conclusion Task-specific fine-tuning of LLMs is necessary for optimal performance, and selecting the appropriate LLM for this is important. Without fine-tuning, both general-domain and medical-specific LLMs performed close to random guessing, revealing key limitations in their adaptability to specialized tasks. Medical-specific LLMs showed no clear advantage over general-domain LLMs.
AB - Objective To evaluate large language models (LLMs) for automating the assessment of clinician adherence to ESC heart failure pharmacotherapy guidelines. Materials and Methods We used electronic health records (EHRs) data pertaining to hospitalized heart failure patients. The task was to assess whether discharge medications followed the guidelines. We labeled each record as: (1) all recommended medications are present and target doses achieved; (2) all recommended medications are present, but target doses not achieved; (3) one or more recommended medications are missing. We evaluated three general-domain (GLM-4-9B-chat, Llama3-8B-Instruct, Mistral-7B-Instruct-v0.2) and three medical-specific (Med42-v2-8B, Llama-3-8B-UltraMedical, OpenBioLLM-8B) open-source LLMs under different prompt settings (zero-shot, few-shot and Chain-of-Thought). We fine-tuned the models using synthesized preference data from our EHR data with the Monolithic Preference Optimization without Reference Model (ORPO) method. We performed a learning curve analysis to determine optimal training data size for performance. We assessed LLM performance using the macro F1 score. Results We included data of 1,141 patients. Adherence to medication and doses was 5.3%. All LLMs scored F1 < 0.40 across most prompt settings (baseline F1 = 0.333). After fine-tuning, four LLMs scored F1 ≥ 0.90; the other two LLMs namely Llama3-8B-Instruct and OpenBioLLM-8B scored F1 = 0.794 and 0.787, respectively. GLM-4-9B-Chat reached peak performance with 40% of the training data, while Mistral-7B-Instruct-v0.2 required 50%. Other models needed more data. Conclusion Task-specific fine-tuning of LLMs is necessary for optimal performance, and selecting the appropriate LLM for this is important. Without fine-tuning, both general-domain and medical-specific LLMs performed close to random guessing, revealing key limitations in their adaptability to specialized tasks. Medical-specific LLMs showed no clear advantage over general-domain LLMs.
KW - clinical guidelines
KW - electronic health records
KW - large language models
KW - natural language processing
KW - preference optimization
UR - https://www.scopus.com/pages/publications/105021863787
U2 - 10.1093/jamiaopen/ooaf155
DO - 10.1093/jamiaopen/ooaf155
M3 - Article
C2 - 41250772
SN - 2574-2531
VL - 8
JO - JAMIA Open
JF - JAMIA Open
IS - 6
M1 - ooaf155
ER -