AI in Healthcare: Why Is Evaluation Not an Option?

Artificial intelligence is gradually becoming established in the healthcare sector. Its applications are multiplying: diagnostic assistance, help with writing medical reports, medical imaging analysis, optimization of care pathways… But healthcare is not an application domain like any other. The stakes are critical: patient safety—and ultimately, patients’ lives—are at stake.

In this healthcare context, as in all critical sectors, the use of AI requires high standards of reliability. Every AI-assisted decision must be understandable, verifiable, and properly supervised to ensure safe and ethical care.

So yes, it may seem obvious: evaluating AI systems is no longer an option. But it’s always better to say it… and even better to do it.

Medical Liability + AI: Without Evaluation, the Equation Doesn’t Hold

With AI, medical liability is entering uncharted territory.

On paper, nothing changes: the doctor remains responsible for the final decision. The ethical guidelines and position statements from the National Academy of Medicine and the High Authority for Health are clear: AI is a decision-support tool, not the decision-maker. But in practice, the situation is becoming tense. Because following AI-generated recommendations does not provide protection. If the AI makes a mistake, the doctor is liable. And failing to follow its recommendations can also be held against the practitioner, especially if the tool is recognized as effective. A catch-22? The practitioner then becomes the arbiter of a system they may not fully control… while remaining legally alone on the front lines. It is precisely this paradox that changes the game: the evaluation of AI systems is no longer just a technical issue—it is a matter of trust. Without it, doctors are being asked to make critical decisions… based on tools whose reliability is not always proven or understood.

European regulations taking shape…

In Europe, the regulation of medical software, including solutions incorporating AI, is based on the Medical Device Regulations (MDR (EU) 2017/745 and IVDR (EU) 2017/746). These regulations already imposed evaluation requirements, but with the adoption of Regulation (EU) 2024/1689 on AI (AI Act), the regulatory framework has become more comprehensive. The AI Act introduces a risk-based approach. In this context, AI systems integrated into medical devices governed by the MDR or IVDR are, in most cases, classified as high-risk systems. They are therefore subject to additional obligations, particularly regarding data governance, transparency, human oversight, and continuous monitoring. This results in a dual regulatory framework: stakeholders will now need to align the requirements specific to medical devices with those specific to AI.

And then there are harmonized standards… Behind this term lies a key element of the regulations: these are standards (developed by CEN/CENELEC) that translate regulatory requirements into technical requirements. In practice, it’s important to understand that the regulation specifies what must be achieved, and the harmonized standards explain how to achieve it. When a manufacturer applies these standards, they benefit from a presumption of compliance with regulatory requirements. The problem: for the AI Act, many of the standards have not yet been harmonized. But don’t panic—standards already exist and provide you with initial guidance.

The implementation timeline for the AI Act, despite having been recently adopted, is already evolving. According to the latest discussions at the European level, particularly within the framework of the Digital Omnibus, the European Parliament is proposing to postpone the entry into force of the regulation’s most fundamental provisions. Specifically, the requirements for high-risk AI systems initially scheduled for August 2026 could be pushed back to December 2027. For systems integrated into regulated products (including medical devices), the deadline could be extended to August 2028.

In the short term, this delay may be seen as good news, with less immediate pressure and more time to prepare. But in the medium term, it raises the concern that the lack of a stable regulatory framework could slow down decision-making and foster a wait-and-see attitude. Above all, it does not change anything fundamentally: the requirements remain; the regulatory trajectory is confirmed, and the pressure on system reliability continues to increase.

A Different Approach by the FDA

The United States, through the Food and Drug Administration (FDA), is taking a significantly different approach. Unlike the AI Act, the FDA does not propose a cross-cutting “high-risk AI” classification and does not impose a horizontal framework covering all sectors. It favors a case-by-case approach, centered on the product and its use. The result: greater flexibility for manufacturers but also less harmonization for integrators/users.

Furthermore, the FDA identified a key characteristic of AI very early on: its ability to evolve over time. It has therefore introduced the concept of a “Predetermined Change Control Plan” (PCCP), which is intended to allow for anticipating changes to algorithms after they have been placed on the market. In practice, the manufacturer describes in advance the types of planned modifications as well as the associated control methods, and these changes can then be implemented without a new full revalidation, provided they comply with the defined framework.

What steps can you take right now?

The European regulation requires the assessment of a set of trust properties, including:

Data quality (relevance, representativeness, absence of bias) 👉 Article 10 – Data and data governance
Robustness and accuracy 👉 Article 15 – Accuracy, robustness, and cybersecurity
Risk management 👉 Article 9 – Risk management system
Transparency and explainability 👉 Article 13 – Transparency and provision of information to users
Human oversight 👉 Article 14 – Human oversight
Cybersecurity 👉 Article 15 – (also covered under accuracy and robustness)
Post-deployment monitoring 👉 Article 72 – Post-market monitoring

And most importantly, these requirements apply throughout the system’s entire lifecycle, not just at the time of its market launch.

One might think that a unified framework already exists to guide the evaluation of AI systems in healthcare.

Spoiler: not yet!

The scientific community is nevertheless actively organizing itself, with initiatives such as TRIPOD-AI, PROBAST-AI, CONSORT-AI, FUTURE-AI, and TEF-Health, which are helping to gradually structure practices. Should we be concerned about the lack of a universal “checklist”? Probably not, especially since its feasibility is debatable. The diversity of use cases, risk levels, and technologies makes the idea of a single framework applicable to all situations unrealistic. What is emerging today is instead a structured yet adaptable approach: a foundation of principles supported by the AI Act, supplemented by context-specific methodologies. The challenge, therefore, is not to wait for a perfect framework, but to intelligently leverage the methods, tools, and expertise already available.

Based on the presented framework, the idea is not to seek a single metric, but to increase the number of measurement points.

First, regarding the data, it is a matter of verifying that it is consistent with the intended use. Is it representative of real patients? Are the training, test, and validation datasets constructed consistently? Are there any obvious biases? Is the quality consistent? At this stage, we are not yet “testing” the AI, but rather ensuring that the data foundation is solid. Particular vigilance is also required regarding the use of synthetic data, which can introduce or amplify certain biases.

Next, when it comes to the model, we naturally look at its performance, but that’s not all. The question becomes: does the system remain reliable in extreme situations? For example, when the data differs slightly from what was used during training, does the behavior remain consistent? The goal is to understand under what conditions the model works… and, more importantly, under what conditions it no longer works.

Finally, on the system side, we move closer to real-world usage conditions. How does AI integrate into the care pathway? Is it understandable and usable by professionals? Does it influence decision-making appropriately? It is often at this level that the gaps between AI that performs well “in the lab” and AI that is truly useful in practice become apparent. This directly raises the question of human oversight, but also the ability to monitor the system over time: what updates are being made, how performance evolves, and what management framework (such as MLOps) is in place to ensure controlled operation.

Evaluation is not (merely) clinical validation

When discussing medical devices that incorporate AI, it is important to distinguish between technical evaluation and clinical validation from a regulatory standpoint, as they address different requirements.

Under the MDR (EU) 2017/745, clinical validation is a core requirement: it aims to demonstrate that the medical device is safe and effective in a real-world clinical setting, and that it provides a benefit to patients. This involves clinical data, studies, and a demonstration of the benefit-risk ratio.

The AI Act, on the other hand, introduces additional requirements, more focused on the functioning of the AI system itself: data quality, robustness, bias management, transparency, human oversight, and post-deployment monitoring.

These two approaches are therefore complementary but not interchangeable in the context of an AI-enabled medical device.

And where does Kereval fit into all this?

AI in healthcare is becoming an operational reality. But in a field where every decision can have life-or-death consequences, its integration cannot rely solely on technical performance. What is at stake today is the shift from a one-off validation approach to a culture of continuous evaluation. The AI Act, despite its schedule adjustments, confirms this trajectory. It is no longer just a matter of proving that a system works, but of demonstrating that it remains reliable over time.

In this context, evaluation becomes a driver of trust:

trust for healthcare professionals, who must be able to rely on these tools to make decisions;
trust for patients, whose safety depends directly on these systems.

Waiting for a perfect framework or a universal method would be a mistake. The tools, standards, and best practices already exist. The challenge now is to mobilize them in a structured way. And that is where Kereval and its AI team come in. We position ourselves as a trusted third-party in the evaluation of complex systems incorporating AI.

Specifically, Kereval’s approach consists of:

Structuring the evaluation according to regulatory requirements (MDR, IVDR, AI Act) and emerging standards;
Expanding the scope of analysis: data, models, systems, real-world use cases;
Implementing methodologies tailored to each use case;
Providing ongoing support, particularly regarding post-deployment monitoring challenges (MLOps, update management).

The goal is not merely to “validate” an AI system, but to ensure its safe use by providing tangible assurance to all stakeholders.

Medical Liability + AI: Without Evaluation, the Equation Doesn’t Hold

European regulations taking shape…

A Different Approach by the FDA

Evaluation is not (merely) clinical validation

And where does Kereval fit into all this?

Further Reading

À lire également

Kereval CEO Re-elected Co-President of IHE-France

New European project dedicated to digital twins of territories

Kereval Evaluates the Fairness of AI Models in Oncology

Connectathon IHE Europe: a week dedicated to healthcare interoperability

Kereval partners with a new project to improve the predictability of industrial obsolescence

Partnering with AP-HP to Better Target Patients for Clinical Trials