Are generalized AI models the best thing for healthcare?
Forcing machine learning models to work everywhere may just be guiding them to mediocrity
Say you are a physician planning a surgical procedure and you want to be prepared for any potential post-operative issues the patient may have. The hospital has access to a new machine learning tool that will make predictions as to what downstream complications may arise based on the patient's health record, current vitals, the type of procedure they will be undergoing and related information. You plug in this information and it gives you an answer.
How much confidence do you have in that prediction?
It seems like a “it depends” sort of answer. For instance, does your confidence change if you are working at a smaller rural hospital? Will the model work well on the specific patient populations you commonly see?
Trust and confidence in model predictions is essential, so these are important questions that will become more common as predictive models are incorporated into healthcare.
An element that is at the core of all of these questions is related to the “generalizability” of the model.
Generalizable models
One of the assumptions going into the creation of an AI model is that, when finished and ready to be used by the outside world, it should be generalizable. Generalization can be defined in different ways, but perhaps the easiest way to imagine it here is the ability for a model to work across different hospitals (geographic regions). A model that generalizes well will make just as good a prediction at a hospital in New York as it will at one in San Francisco or in rural Pennsylvania. This extends to patient populations. We assume that a good model will make good predictions for most any patient regardless of where it is being used.
This desire to show that a model is generalizable inherently comes from the way these models are created and validated. You have data that is initially used to train the model. Then, to show others that it is really working the way you say it does, you show the accuracy of its predictions on a new set of data (the test data set). This test data was never seen by the model during training. In medical AI applications, this desire to show generalizability usually leads to the requirement (in a publication or clinical trial) of showing that the model works the same across institutions with external patient cohorts.
This goal generally makes a lot of sense, as you are normally trying to show that the model you built works not just for your data, but for any data set you throw at it.
So why can generalizing be a problem?
It essentially boils down to a problem of averaging. By requiring broad generalizability we now need to find predictive features in the data that give information in all situations and contexts. Features that are highly predictive, but only in certain settings, will get “washed out” as more and more data is used in training to make it work well in a variety of environments. This has been shown by a number of groups (e.g., [1]) and we have seen a version of this in my own research group where we had a model that worked well nationally, but made poor predictions for our own institution.
In the end, you turn a model that works great in one setting into a model that works “just ok” across many.
What adds to this problem is that publishing of these models in most medical or ML journals requires that you demonstrate that your model generalizes. This goes for clinical trials of such systems as well. So the review system involved in helping to boost our confidence in a model is itself helping to maintain a process that inhibits higher performance at any specific site.
This problem has been slow to gain attention, but that is changing, in part due to general dissatisfaction with the real-world performance of some of the higher-profile models in the clinic. A really nice 2020 article by Futoma and colleagues describes some of the issues and is worth having a look.[2] Generalizing is also very much a part of the issues behind bias in AI models, so these issues aren’t going away any time soon.
What can be done?
Ideally, it would be great if we could make models that generalize really well, making highly accurate predictions no matter where they are used. This may be achievable in cases where the input data itself is limited and highly controlled. Diagnosis of mammograms from x-ray images come to mind as a possibility. But for the broader range of instances where predictive models could help aid in clinical decision making, this may not be readily achievable.
Until ways to address this are figured out, designing models from their inception for tailoring would seem to be a good path forward. Using transfer learning to adapt or update models to specific clinical settings is one approach.[1]
Processes at the healthcare system level will also have to be put in place for maintenance, as patient populations, hospital processes and even clinical staff change over time and these changes will cause the accuracy of these approaches to start to drop.
Most immediately, allowing more flexibility in the review and validation of these models, with a focus on providing more transparency as to under what conditions a model is expected to do well or not would have huge benefits.
As a self-interested patient, I want the best model. I don’t want it to make “just ok” predictions or diagnoses for me and a patient in a hospital across the country. I want the model that will help the clinician who is taking care of me right now make the best decision possible. Backing off the presumptive need for generalization will help AI transition into a true tool for precision medicine and deliver on the promise of improved patient outcomes.
Notes & references
[1] Yang J, Soltan AAS, Clifton DA. 2022. Machine learning generalizability across healthcare settings: insights from multi-site COVID-19 screening. NPJ digital medicine 5:69. (Open Access - free to the public)
[2] Futoma J, Simons M, Panch T, Doshi-Velez F, Celi LA. 2020. The myth of generalisability in clinical research and machine learning in health care. The Lancet. Digital health 2:e489–e492. (Open Access - free to the public).