Skip to main navigation Skip to search Skip to main content

Reframing Clinical AI Evaluation in the Era of Generative Models: Toward Multidimensional, Stakeholder-Informed, and Safety-Centric Frameworks for Real-World Health Care Deployment

  • Matthew Abikenari*
  • , M. Hassan Awad
  • , Sammy Korouri
  • , Kimia Mohseni
  • , Derek Abikenari
  • , René Freichel
  • , Yaseen Mukadam
  • , Ubaid Tanzim
  • , Amin Habib
  • , Ahmed Kerwan
  • *Corresponding author for this work
  • Stanford University
  • California State University Los Angeles
  • University of California at Irvine
  • University of California at Los Angeles
  • California State Polytechnic University Pomona
  • Department of Psychology, University of Amsterdam, Amsterdam, Netherlands
  • Royal Brompton and Harefield NHS Foundation Trust
  • University College London Hospitals NHS Foundation Trust
  • Mid and South Essex NHS Foundation Trust
  • Harvard University

Research output: Contribution to journalReview articleAcademicpeer-review

14 Downloads (Pure)

Abstract

The integration of artificial intelligence (AI) in the form of large language models (LLMs) and generative models into clinical practice has progressed ahead of metrics available to measure their performance in real-world settings. Traditional benchmarks such as area under the receiver operating characteristic curve or bilingual evaluation understudy (BLEU) scores are inadequate to meet clinical nuance, patient safety, explainability, and workflow integration. This scoping review maps the evolving landscape of clinical AI evaluation, combining academic and industry architectures, including clinical risk evaluation of LLMs for hallucination and omission (CREOLA), hospital deployments, and radiological tool reviews. We explore stakeholder tensions between academia, business viability, regulation, and frontline usability, and reveal how these perceptions build competing evaluation imperatives. In particular, we highlight the novel challenges created by generative models: hallucination, omission, narrative incoherence, and epistemic misalignment. The current paper elucidates that a strategy of layered, stakeholder-engaged design needs to integrate risk stratification, contextual awareness, and continuous postdeployment surveillance. Equity, interpretability, and clinician trust are not thought of as footnotes, but as central columns upon which evaluation is built. This review offers a synthesizing overview of how health systems, developers, and regulators can coconstruct adaptive and ethically grounded evaluation frameworks, ensuring that AI tools enhance, rather than erode, clinical judgment, patient safety, and health equity in real-world care.

Original languageEnglish
Article number100089
JournalPremier Journal of Science
Volume11
DOIs
Publication statusPublished - Aug 2025

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 10 - Reduced Inequalities
    SDG 10 Reduced Inequalities

Keywords

  • Clinical AI evaluation
  • Generative models
  • Healthcare equity
  • Safety-centric deployment
  • Stakeholder-informed frameworks

Fingerprint

Dive into the research topics of 'Reframing Clinical AI Evaluation in the Era of Generative Models: Toward Multidimensional, Stakeholder-Informed, and Safety-Centric Frameworks for Real-World Health Care Deployment'. Together they form a unique fingerprint.

Cite this