In a recent review article published in The New England Journal of Medicine, researchers review how human values can be incorporated into emerging artificial intelligence (AI)-based large language models (LLMs) and how they can impact clinical equations.
Study: Medical Artificial Intelligence and Human Values. Image Credit: Gorodenkoff / Shutterstock.com
The ethics of AI in medicine
LLMs are sophisticated AI tools that perform a wide range of tasks, from writing compelling essays to passing professional examinations. Despite the growing utilization of LLMs, many healthcare professionals continue to express concerns about their application within the medical field due to confabulation, factual inaccuracy, and fragility.
It remains unclear whether “human values” that reflect human goals and behaviors will remain incorporated into the creation and use of LLMs. How human values differ from and are similar to LLM values must also be elucidated.
To this end, the authors investigated the influence of human values on the creation of massive language and AI models in the healthcare sector.
How do human values affect AI performance in medicine?
Human and societal values have inevitably affected the data used to train AI models. Some recent examples of AI models used in medicine include the automated interpretation of chest radiographs, the diagnosis of skin diseases, and the development of algorithms for optimizing the allocation of healthcare resources.
Generative Pretrained Transformer 4 (GPT-4) is an LLM that has been developed to consider the values of the various parties engaged in a clinical scenario, such as the clinician, the patient themselves or their parents/guardians, as well as health insurance companies. This “tunability” raises concerns about the values that a particular AI model should embody, whether it aids rational decision-making, and how financial forces influence its development and application in medicine.
Although the exact training details of GPT-4 have not been made available, details for predecessor models like GPT-3 have been published. GPT-3 is composed of 175 billion parameters, which is significantly greater than the number of predictor variables that have been used to develop historical clinical equations like the estimated glomerular filtration rate (eGFR) while, like LLMs, have been used to predict patient risks and treatment strategies.
The influence of human values during LLM training
One of the first stages of developing an LLM involves a ‘pre-training phase,’ during which these parameters are provided to the model. Thereafter, a ‘fine-tuning phase’ utilizes human feedback to rank model outputs to improve the model through reinforcement learning.
For example, during the development of the InstructGPT model, 40 human contractors representative of different demographic groups were recruited to fine-tune this LLM. Since these contractors were hired and instructed by model developers, potential biases can arise in whether the values of the trainers or trainees are ultimately incorporated into the final version of the LLM.
Taken together, these examples demonstrate that human values are deeply integrated into every stage of the LLM development process, from selecting what data is used to initially train the model to fine-tuning these models before they are made available to the public.
Changes in data properties, otherwise known as dataset shifts, can potentially jeopardize the precision and dependability of AI models, especially when including human values. Some consequences that may arise when human values are incorporated into AI systems include inappropriate treatment recommendations, poor alignment with common societal expectations, and the ultimate loss of confidence in AI tools among patients and clinicians.
Future directions
To overcome these challenges, future studies that evaluate how AI affects human decision-making and the development of specific skills are warranted. Investigating the ‘psychology of LLMs’ may also lead to important discoveries on human cognitive biases and how they impact decision-making processes.
Regular retraining and monitoring of LLMs is also essential to ensure AI’s safe and successful use in medicine. AI governance bodies can also support these efforts by providing supervision to these processes; however, the establishment of any rules is complicated due to different foundation models and types of data.
Utility elicitation approaches are valuable for determining human values; however, they may overlook real-world factors affecting healthcare decision-making. Thus, decision-curve analysis, which offers a different approach to evaluating diagnostic tests without explicit input from utilities, as well as data-driven methods, can support the continual learning of LLMs as data and values change over time.