What healthcare regulators actually want from your AI

There is a common assumption among engineering teams entering healthcare AI that the regulatory environment is hostile, that the requirements are vague, and that compliance will be a quarter-long retrofit. The first two are wrong. The third is a self-inflicted wound.

I want to write about what regulators are actually asking for, based on the guidance documents and on my work with clients who have shipped clinical AI in regulated environments. Most of the guidance is more pragmatic than teams assume.

The four things they consistently ask for

Across the FDA, ONC, the Joint Commission, and most state regulators, the asks converge on four themes. These are not exhaustive, but they are stable.

Documentation of the model's intended use, including the patient population, the clinical setting, and the specific decision the model is intended to inform. This is not boilerplate. The intended use statement constrains the evaluation, the deployment, and the post-market monitoring. Models drift outside their intended use, and the documentation is what allows the deviation to be detected.

Evaluation of the model on a representative dataset, with performance reported on subgroups. The subgroup reporting is non-negotiable. A model that performs well on average and poorly on a subpopulation is a model that creates harm. Regulators want to see the breakdown.

A monitoring plan for production performance, with thresholds that trigger review. The model in production is not the model that was evaluated. Performance drifts. The plan has to name what is being monitored, who is monitoring it, and what happens when the threshold is breached.

A change management process for model updates, including re-evaluation criteria and a path back to the prior version if the new one underperforms. Regulators are aware that models will be updated. They want the updates to be governed.

What is not in the guidance

A few things worth naming because they consume engineering discussion and are not what regulators are asking for.

Explainability of every prediction. Some regulators ask for this in narrow contexts. Most do not. What they ask for is that the clinical user understands what the model is and what it is not, that the model's role in the decision is documented, and that the user has the information needed to override.

A particular model architecture. Regulators do not have a view on whether you use a gradient-boosted ensemble or a transformer. They have a view on whether the model is evaluated, monitored, and governed.

Real-time auditability of every inference. Audit logs are expected, but the bar is forensic reconstruction over a reasonable retention period, not real-time replay. The operational burden is lower than teams assume.

The trap teams fall into

The most common trap is to defer regulatory engagement until late in the build, treat compliance as a paperwork exercise, and discover late that the architecture cannot produce the artefacts the guidance asks for.

The specific failure mode I have seen repeatedly. The team builds the model and the inference pipeline. Compliance asks for subgroup performance. The team realises the training data does not have the subgroup labels needed. The team realises the inference pipeline does not log the inputs needed to recover subgroup at inference time. The retrofit takes a quarter.

The fix is to design the artefacts the guidance asks for into the system from week one. The artefacts are not exotic. They are subgroup performance breakdowns, inference logs with the right metadata, intended-use documentation, and a monitoring dashboard. None of these are technically difficult. They have to be in the system from the start.

The artefact list I use

A short list of artefacts I have my clients commit to building before they ship a clinical AI system, regardless of the specific regulator they will engage with.

A model card that documents the intended use, the training population, the evaluation methodology, and the known limitations.

A subgroup performance report that breaks down the primary performance metric by relevant demographic and clinical subgroups, including age band, sex, race and ethnicity where available, and key comorbidity status. The report is updated quarterly against new production data.

An inference log that captures the inputs, the prediction, the prediction confidence, the model version, and the user who acted on it. The log is retained for the period the regulator specifies, and it is queryable for forensic reconstruction.

A monitoring dashboard that tracks performance metrics on a rolling window, with alerts when performance drifts beyond a threshold. The dashboard is reviewed at a stated cadence by a named owner.

A change management runbook that documents how the model is re-evaluated when updated, what the rollback path is, and who is authorised to approve a rollout.

What this means for the schedule

A team that builds these artefacts as it builds the system ships compliant. A team that defers them ships a model and then takes a quarter to ship a compliant version of the model.

The first path costs a few extra weeks across the build. The second path costs a quarter at the end, plus the credibility cost of having a delayed launch. The math is not close.

Healthcare regulators are not the obstacle most teams treat them as. They are asking for things that any well-run production AI system should produce anyway. Reading the guidance and building to it is cheaper than the alternative.