AI is everywhere, has real advantages....BUT...is it real?

Hallucinations – it’s a technical term in the Artificial Intelligence (AI) community for “making s*** up”.  It’s a big problem in the healthcare space when you’re trying to get AI to legitimately move a process forward.  There have been studies now that show AI can be more empathic and more helpful than a doctor.  That’s great news, unless of course, you’re the doctor.  What these studies don’t share is the downside risk – what’s being made up?  It turns out that a study of the MSUQ (my term – Making S*** Up Quotient, or a percentage of material presented that’s made up) for ChatGPT was around 20%.  The more specific the “question”, the higher the percentage of MSU.  Yikes!  

AI is the future – unfortunately, it’s still in the future.  With careful consideration, AI can help now, but CAREFUL CONSIDERATION best be the guiding principle.  What does that mean?  Dr. Google still has nothing on me!

FROM JAMA NETWORK / BY ANJUN CHEN AND DRAKE O. CHEN

Accuracy of Chatbots in Citing Journal Articles

Introduction

The recently released generative pretrained transformer chatbot ChatGPT from OpenAI has shown unprecedented capabilities ranging from answering questions to composing new content. Its potential applications in health care and education are being explored and debated. Researchers and students may use it as a copilot in research. It excels at creating new content but falls short in providing scientific references. Journals such as Science have banned chatbot-generated text in their published reports. However, the accuracy of reference citing by ChatGPT is unclear; therefore, this investigation aimed to quantify ChatGTP’s citation error rate.

Methods

This study tested the value of the ChatGPT copilot in creating content for training of learning health systems (LHS).5 A large range of LHS topics were discussed with the latest GPT-4 model from OpenAI from April 20 to May 6, 2023. We used prompts for broad topics, such as LHS and data, as well as specific topics, such as building a stroke risk prediction model using the XGBoost library. Since chatbot responses depended on the prompts, we first asked questions about specific LHS topics, then requested journal articles as references. This study followed the Strengthening the Reporting of Observational Studies in Epidemiology reporting guideline.

We verified each cited journal article by checking its existence in the cited journal and by searching its title using Google Scholar. The article’s title, authors, publication year, volume, issue, and pages were compared. Any article that failed this verification was considered fake. To determine a reliable error rate, over 300 article references were produced on the LHS topics. For comparison, we chatted with OpenAI’s default GPT-3.5 model for the same LHS topics. Exact 95% CIs for error rate were constructed. The error rate between the GPT-4 and GPT-3.5 models was compared using the Fisher exact test, with 2-sided P < .05 indicating statistical significance.

Results

From the default GPT-3.5 model, 162 reference journal articles were fact-checked, 159 (98.1% [95% CI, 94.7%-99.6%]) of which were verified as fake articles. From the GPT-4 model, 257 articles were fact-checked, 53 (20.6% [95% CI, 15.8%-26.1%]) of which were verified as fake articles. The error rate of reference citing for GPT-4 was significantly lower than that for GPT-3.5 (P < .001) but remains non-negligible. Narrower topics tended to have more fake articles than broader topics.

GPT-4 provided answers that could be used as supplementary materials for LHS training after fact-checking and editing. However, it failed to provide information about the latest LHS developments.

Discussion

Our findings suggest that GPT-4 can be a helpful copilot in preparing new LHS education and training materials, although it may lack the latest information. Because GPT-4 cites some fake journal articles, they must be verified manually by humans; GPT-3.5–cited references should not be used.

When asked why it returned fake references, ChatGPT explained that the training data may be unreliable, or the model may not be able to distinguish between reliable and unreliable sources. As generative chatbots are deployed as copilots in health care education and training, understanding their unique abilities (eg, the ability to answer any questions) and inherent defects (eg, the inability to fact-check responses) will help make more effective use of the new GPT technology for improving health care education and training. Additionally, potential ethical issues such as misinformation and data bias should be considered for GPT applications.

This study has some limitations, such as the chat topics not representing all subject areas. However, since the LHS topics covered many subject areas of health care, the findings should be applicable in the health care domain. Furthermore, the findings should be more applicable to deeper discussions with ChatGPT as opposed to superficial discussions.

Source: https://jamanetwork.com/journals/jamanetwo...