Exploring the potential of large language models for assessing medication adherence to the ESC heart failure guidelines

Research output: Contribution to journalArticleAcademicpeer-review

4 Downloads (Pure)

Abstract

Objective To evaluate large language models (LLMs) for automating the assessment of clinician adherence to ESC heart failure pharmacotherapy guidelines. Materials and Methods We used electronic health records (EHRs) data pertaining to hospitalized heart failure patients. The task was to assess whether discharge medications followed the guidelines. We labeled each record as: (1) all recommended medications are present and target doses achieved; (2) all recommended medications are present, but target doses not achieved; (3) one or more recommended medications are missing. We evaluated three general-domain (GLM-4-9B-chat, Llama3-8B-Instruct, Mistral-7B-Instruct-v0.2) and three medical-specific (Med42-v2-8B, Llama-3-8B-UltraMedical, OpenBioLLM-8B) open-source LLMs under different prompt settings (zero-shot, few-shot and Chain-of-Thought). We fine-tuned the models using synthesized preference data from our EHR data with the Monolithic Preference Optimization without Reference Model (ORPO) method. We performed a learning curve analysis to determine optimal training data size for performance. We assessed LLM performance using the macro F1 score. Results We included data of 1,141 patients. Adherence to medication and doses was 5.3%. All LLMs scored F1 < 0.40 across most prompt settings (baseline F1 = 0.333). After fine-tuning, four LLMs scored F1 ≥ 0.90; the other two LLMs namely Llama3-8B-Instruct and OpenBioLLM-8B scored F1 = 0.794 and 0.787, respectively. GLM-4-9B-Chat reached peak performance with 40% of the training data, while Mistral-7B-Instruct-v0.2 required 50%. Other models needed more data. Conclusion Task-specific fine-tuning of LLMs is necessary for optimal performance, and selecting the appropriate LLM for this is important. Without fine-tuning, both general-domain and medical-specific LLMs performed close to random guessing, revealing key limitations in their adaptability to specialized tasks. Medical-specific LLMs showed no clear advantage over general-domain LLMs.

Original languageEnglish
Article numberooaf155
JournalJAMIA Open
Volume8
Issue number6
DOIs
Publication statusPublished - 1 Dec 2025

Keywords

  • clinical guidelines
  • electronic health records
  • large language models
  • natural language processing
  • preference optimization

Fingerprint

Dive into the research topics of 'Exploring the potential of large language models for assessing medication adherence to the ESC heart failure guidelines'. Together they form a unique fingerprint.

Cite this