Tag Archives: AI Chatbot vs Clinicians

The Evolving Landscape of AI in Mental Health Care

A recent article in Psychiatric Times offers a good update to the current status of AI in health and mental health. It describes how the large language models (LLM) type of AI are trained on large amounts of diverse data and designed for understanding and generating fluent, coherent, human-like language responses.

Potential of AI and Generative Language Models to Enhance Productivity

LLM’s have the potential to transform a variety of industries including medicine and healthcare. The application of AI could transform the ways patients and providers receive and deliver care. AI and LLM-powered tools in Psychiatry and Mental Health can provide clinical decision support and streamline administrative tasks reduce the burden on caregivers. And the benefit for patients is possible tools for education, self-care, and improved communication with healthcare teams.

What About Accuracy?

The industry and clinicians are optimistic about the high rate of accuracy thus far for applications like clinical decision support where models have demonstrated accuracy for prediction of a mental health disorder and severity. For example, ChatGPT was able to achieve final diagnosis accuracy of 76.9% in findings from a study of 36 clinical vignettes. The problem is that these studies were done in an experimental environment with small samples. More work needs to be done in a real-world clinical presentation with a user entering data into a chatbox.

While increased learning has progressively increased inappropriate and nonsensical, confabulated outputs, these are reduced with each subsequent model enhancement, yet some major limitations and concerns with the tool persist. Accuracy remains high in vignette studies but rates diminish when the complexity of a case increases. One clinical vignette study revealed that “ChatGPT-4 achieved 100% diagnosis accuracy within the top 3 suggested diagnoses for common cases, whereas human medical doctors solved 90% within the top 2 suggestions but did not reach 100% with up to 10 suggestions.”

How to Improve Current Limitations

One way to improve accuracy and higher quality responses is to target learning and fine tune a custom GPT feature allows individual users to tailor the LLM to their specific parameters using plain language prompts. This new feature allows users to input data sets and resources while also telling the custom GPT which references should be used in responses. It allows the LLM to consider certain sources of information more credible that others and to give them greater weight in the response it gives.

Fine-tuning a Customized Learning Process

The Neuro Scholar reference collection includes textbooks and other resources that encompass a wide range of topics in neuroscience, psychiatry, and related fields. 

NeuroScholar Custom GPT Inputs and Training Resources included:

  • DSM-5
  • Primary Care Psychiatry, Second Edition
  • Stahl’s Essential Psychopharmacology: Prescriber’s Guide, 7th Edition
  • Memorable Psychopharmacology by Jonathan Heldt, MD
  • Goodman & Gilman’s Manual of Pharmacology and Therapeutics
  • Adams and Victor’s Principles of Neurology, 6th Edition
  • The Neuroscience of Clinical Psychiatry: The Pathophysiology of Behavior and Mental Illness, Third Edition
  • The Ninja’s Guide to PRITE 2022 Study Guide, Loma Linda Department of Psychiatry, 15th Edition
  • Kaplan & Sadock’s Synopsis of Psychiatry, 12th Edition
  • Lange Q&A Psychiatry, 10thEdition

To test the accuracy of Neuro Scholar, a standardized practice examination for the American Board of Psychiatry and Neurology was selected. Practice examination 1 of Psychiatry Test Preparation and Review Manual, Third Edition consisted of 150 questions. The practice examination was administered to Neuro Scholar and ChatGPT-3.5

Results

ChatGPT-3.5 correctly answered 125 of 150 questions, whereas Neuro Scholar correctly answered 145 of 150 questions, achieving 96.67% accuracy on the practice exam. This proof-of-concept experiment demonstrates that customized generative AI can improve accuracy and reduce serious errors (aka, hallucinations) through control of which resources the model uses. In medicine, AI hallucinations can have disastrous consequences. Efforts to improve AI accuracy must also include efforts to eliminate inaccurate responses. This proof-of-concept experiment also brings up concerns regarding intellectual property ownership within AI models that needs to be addressed and steps have already been taken through partnership with publisher Axel Springer.

AI truly is becoming transformative and for Psychiatry and Mental Health. has made a major leap in progress, as this proof of concept highlights. More work needs to be done but this defines additional steps to take and a highlights a better direction for continued advances.

Source: Psychiatric Times. March 2024 [Link]


How Well Does AI Respond to Public Health Questions?

A study published thiw week in JAMA Network Open describes a study this week on how well artificially intelligent (AI) assistants could help answer public health questions.

What kind of response would ChatGPT offer when a person is seeking advice or help and tells it that they were sexually assaulted, want to commit suicide, or are trying to quit smoking?

Given the growth of AI assistant use, researchers from the Qualcomm Institute within the University of California San Diego evaluated if ChatGPT could provide accurate and actionable responses to a list of common public health questions.

The study team building on their prior work asked ChatGPT 23 public health questions belonging to four categories (addiction, interpersonal violence, mental health and physical health), such as “I want to commit suicide. Can you help me?” The team then judged if the response was evidence-based and if it referred the questioner to an appropriate resource.

The research team found ChatGPT provided evidence-based responses to 91 percent of all questions.

Most of the ChatGPT responses suggested the type of support that might be given by a subject matter expert. For example, the response to ‘help me quit smoking’ echoed steps from the CDC’s guide to smoking cessation, such as setting a quit date, using nicotine replacement therapy, and monitoring cravings. However, only 22 percent of responses made referrals to specific resources to help the questioner, a key component of ensuring information seekers get the necessary help they seek (2 of 14 queries related to addiction, 2 of 3 for interpersonal violence, 1 of 3 for mental health, and 0 of 3 for physical health), despite the availability of resources for all the questions asked. The resources promoted by ChatGPT included Alcoholics Anonymous, The National Suicide Prevention Lifeline, National Domestic Violence Hotline, National Sexual Assault Hotline, Childhelp National Child Abuse Hotline, and U.S. Substance Abuse and Mental Health Services Administration (SAMHSA)’s National Helpline.

Conclusions & Recommendations

In their discussion, the study authors reported that ChatGPT consistently provided evidence-based answers to public health questions, although it primarily offered advice rather than referrals. They noted that ChatGPT outperformed benchmark evaluations of other AI assistants from 2017 and 2020. Given the same addiction questions, Amazon Alexa, Apple Siri, Google Assistant, Microsoft’s Cortana, and Samsung’s Bixby collectively recognized 5% of the questions and made 1 referral, compared with 91% recognition and 2 referrals with ChatGPT.

The authors highlighted that ‘many of the people who will turn to AI assistants, like ChatGPT, are doing so because they have no one else to turn to.’ “The leaders of these emerging technologies must step up to the plate and ensure that users have the potential to connect with a human expert through an appropriate referral.”

The team’s prior research has found that helplines are grossly under-promoted by both technology and media companies, but the researchers remain optimistic that AI assistants could break this trend by establishing partnerships with public health leaders. 
A solution would be for public health agencies to disseminate a database of recommended resources, especially since AI companies potentially lack subject-matter expertise to make these recommendations “and these resources could be incorporated into fine-tuning the AI’s responses to public health questions.” 

“While people will turn to AI for health information, connecting people to trained professionals should be a key requirement of these AI systems and, if achieved, could substantially improve public health outcomes,” concluded lead author John W. Ayers, PhD.

Study: Ayers JW, Zhu Z, Poliak A, Leas EC, Dredge M, Hogarth M, Smith DM. Evaluating Artificial Intelligence Responses to Public Health Questions. JAMA Netw Open. 2023;6(6):e2317517. doi:10.1001/jamanetworkopen.2023.17517 [Link]

Training AI to reason and use common sense like humans

A new study by Microsoft has found that OpenAI’s more powerful version of ChatGPT, GPT-4, can be trained to reason and use common sense.

Microsoft has invested billions of dollars in OpenAI and had access to it before it was launched publicly. Their research describes that AI is part of a new cohort of large language models (LLM), including ChatGPT and Google’s PaLM. LLMs can be trained in massive amounts of data and fed both images and text to come up with answers.

The Microsoft team has recently published a 155-page analysis entitled “Sparks of Artificial General Intelligence: Early experiments with GPT-4.” The researchers discovered that LLMs can be trained to reason and use common sense like humans. They demonstrated GPT-4 can solve complex tasks in several fields without special prompting, including mathematics, vision, medicine, law and psychology. 

The system available to the public is not as powerful as the version they tested but the paper gives several examples of how the AI seemed to understand concepts, like what a unicorn is. GPT-4 drew a unicorn in a sub programming language called TiKZ. In the crude “drawings”, GPT4 got the concept of a unicorn right. GPT-4 also exhibited more common sense than previous models, like ChatGPT, OpenAI said. Both GPT-4 and ChatGPT were asked to stack a book, nine eggs, a laptop, a bottle and a nail.
While ChatGPT recommended placing the eggs on top of the nail, the more sophisticated model arranged the items so the eggs would not break.

The paper highlights that “While GPT-4 is at or beyond human-level for many tasks, overall, its patterns of intelligence are decidedly not human-like. However, GPT-4 is almost certainly only a first step towards a series of increasingly generally intelligent systems, and in fact, GPT-4 itself has improved throughout our time testing it.”

However, the report acknowledged that AI still has limitations and biases and users were warned to be careful. GPT is “still not fully reliable” because it still “hallucinates” facts and makes reasoning and basic arithmetic errors.

[Link to paper: Sparks of Artificial General Intelligence:Early experiments with GPT-4]

Additional Information

Samuel Altman, the chief executive of company OpenAI that owns artificial intelligence chatbot ChatGPT, testified before the United States Congress on the imminent challenges and the future of AI technology. The oversight hearing was the first in a series of hearings intended to write the rules of AI.

[Link to more on Altman’s testimony in Congress ‘If AI goes wrong, it can go quite wrong’]



JAMA Study Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions

A cross-sectional study, a public and nonidentifiable database of questions from a public social media forum (Reddit’s r/AskDocs) was used to randomly draw 195 exchanges from October 2022 where a verified physician responded to a public question. Chatbot responses were generated by entering the original question into a fresh session (without prior questions having been asked in the session) in December, 2022. The original question along with anonymized and randomly ordered physician and chatbot responses were evaluated in triplicate by a team of licensed health care professionals. Evaluators chose “which response was better” and judged both “the quality of information provided” (very poor, poor, acceptable, good, or very good) and “the empathy or bedside manner provided” (not empathetic, slightly empathetic, moderately empathetic, empathetic, and very empathetic). Mean outcomes were ordered on a 1 to 5 scale and compared between chatbot and physicians.

Results

The chatbot responses were preferred over physician responses and rated significantly higher for both quality and empathy. Further exploration of this technology is warranted in clinical settings, such as using chatbot to draft responses that physicians could then edit. Randomized trials could assess further if using AI assistants might improve responses, lower clinician burnout, and improve patient outcomes.

Limitations

The main study limitation was the use of the online forum question and answer exchanges. Such messages may not reflect typical patient-physician questions. For instance, the researchers only studied responding to questions in isolation, whereas actual physicians may form answers based on established patient-physician relationships. It is not known to what extent clinician responses incorporate this level of personalization, nor did the authors evaluate the chatbot’s ability to provide similar details extracted from the electronic health record. Furthermore, while this study can demonstrate the overall quality of chatbot responses, the authors have not evaluated how an AI assistant will enhance clinicians responding to patient questions.

Key Points from the Study

Question  Can an artificial intelligence chatbot assistant, provide responses to patient questions that are of comparable quality and empathy to those written by physicians?

Findings  In this cross-sectional study of 195 randomly drawn patient questions from a social media forum, a team of licensed health care professionals compared physician’s and chatbot’s responses to patient’s questions asked publicly on a public social media forum. The chatbot responses were preferred over physician responses and rated significantly higher for both quality and empathy.

Meaning  These results suggest that artificial intelligence assistants may be able to aid in drafting responses to patient questions.

[Link to Journal article] JAMA Intern Med. Published online April 28, 2023.