Questions spanned urologic conditions such as benign prostatic hyperplasia, overactive bladder, erectile dysfunction, kidney stones, Peyronie disease, and recurrent urinary tract infections.
In response to general guideline-based questions regarding urologic conditions, ChatGPT was found to provide responses that misinterpreted the guidelines and failed to include contextual information or appropriate references, according to findings from a recent study conducted by investigators at the University of Florida College of Medicine.1,2
The data were published in Urology.
For the study, the investigators posed 13 urological guideline-based questions to the chatbot 3 times, since new responses can be generated for the same question. The questions spanned topics such as benign prostatic hyperplasia, overactive bladder, erectile dysfunction, kidney stones, Peyronie disease, and recurrent urinary tract infections (UTIs).
Each response was measured for appropriateness and given a score based on the Brief DISCERN (BD) questionnaire, where a BD score of of least 16 indicated good-quality content. The BD score is measured in 6 domains, including the content’s aims, whether the aims were achieved, relevance, the sources of the information, the date of sources, and bias. The appropriateness of each question was designated based on accordance with guidelines by the American Urological Association, the Canadian Urological Association, and/or the European Association of Urology.
In total, 59% of the responses provided by the chatbot were deemed appropriate. However, responses to the same questions varied in appropriateness. Overall, 25% of the 13 question sets had discordant appropriateness scores among the 3 responses. Responses that were determined to be appropriate tended to have higher BD scores overall and in the relevance domain (both P < .01).
There was an average BD score of 16.8 among all responses, though only 53.8% (7 of 13) topics and 53.8% (21 of 39) responses met the 16-or-greater threshold for a good-quality response. Scores were highest for the questions regarding hypogonadism (average = 19.5) and erectile dysfunction (19.3), and lowest for the questions regarding Peyronie disease (15.1) and recurrent UTIs in women (14.0).
Among all 6 domains measured with the BD tool, the chatbot scored lowest regarding sources because default citations were not provided. When prompted to provide sources, 92.3% of responses from ChatGPT contained at least 1 citation that was determined to be incorrect, misinterpreted, or nonfunctional.
“It provided sources that were either completely made up or completely irrelevant. Transparency is important so patients can assess what they’re being told,” said senior author Russel S. Terry, MD, in a news release on the findings.2 Terry is an assistant professor of urology at the University of Florida College of Medicine in Gainesville, Florida.
Further, only 1 response provided by ChatGPT indicated that it “cannot give medical advice.” However, the chatbot suggested discussing or consulting with a doctor or medical provider in 24 of the responses.
The authors concluded, “Additional training and modifications are needed before these AI models will be ready for reliable use by patients and providers.”
1. Whiles BB, Bird VG, Canales BK, DiBianco JM, Terry RS. Caution! AI bot has entered the patient chat: ChatGPT has limitations in providing accurate urologic healthcare advice. Urology. 2023;S0090-4295(23)00597-6. doi:10.1016/j.urology.2023.07.010
2. UF College of Medicine research shows AI chatbot flawed when giving urology advice. News release. University of Florida College of Medicine. August 25, 2023. Accessed September 8, 2023. https://ufhealth.org/news/2023/uf-college-of-medicine-research-shows-ai-chatbot-flawed-when-giving-urology-advice