"ChatGPT not only has a low rate of correct answers regarding clinical questions in urologic practice, but also makes certain types of errors that pose a risk of spreading medical misinformation," says Christopher M. Deibert, MD, MPH.
A recent study published in Urology Practice found that ChatGPT provided incorrect responses when given the American Urological Association’s 2022 Self-assessment Study Program (SASP) examination,1 which is a “valuable test of clinical knowledge for urologists in training and practicing specialists preparing for Board certification,” according to a news release on the findings.2
"ChatGPT not only has a low rate of correct answers regarding clinical questions in urologic practice, but also makes certain types of errors that pose a risk of spreading medical misinformation," said Christopher M. Deibert, MD, MPH, in the news release.2 Deibert is a urologist at the University of Nebraska Medical Center in Omaha.
Questions in the SASP assessment were coded as open-ended or multiple-choice. In total, 15 questions that included visual components were excluded from the exam. The responses were graded by 3 independent researchers and reviewed by 2 physician adjudicators.
Overall, the large language model (LLM) provided correct responses on 36 of 135 (26.7%) open-ended questions on the exam. Indeterminate responses were given on 40 (29.6%) questions in this section.
The authors indicate that the responses given on this portion of the test were long, and the chatbot tended to repeat itself, even when given feedback.
"Overall, ChatGPT often gave vague justifications with broad statements and rarely commented on specifics,” they write. “ChatGPT continuously reiterated the original explanation despite it being inaccurate.”2
On the multiple-choice section, the chatbot scored slightly better, giving correct responses on 38 of 135 (28.2%) questions. Indeterminate responses were given on 4 (3.0%) questions in this section.
ChatGPT was given the opportunity to regenerate its answers for those that were coded as indeterminate, although this did not increase the proportion of correctly answered responses. The investigators found that for both portions of the exam, the chatbot “provided consistent justifications for incorrect answers and remained concordant between correct and incorrect answers.”1
Overall, 66.7% of the open-ended correct responses were given on first output, along with 94.7% of correct responses in the multiple-choice section. The second output generated 22.2% of correct responses to open-ended questions and 2.6% of correct responses to multiple-choice questions. Final output generated 11.1% of correct responses to open-ended questions and 2.6% of correct responses to multiple-choice questions.
The authors concluded, "Given that LLMs are limited by their human training, further research is needed to understand their limitations and capabilities across multiple disciplines before it is made available for general use. As is, utilization of ChatGPT in urology has a high likelihood of facilitating medical misinformation for the untrained user.”
1. Huynh LM, Bonebrake BT, Schultis K, Quach A, Deibert CM. New artificial intelligence ChatGPT performs poorly on the 2022 Self-assessment Study Program for urologists. Urol Pract. Published online June 5, 2023. Accessed June 7, 2023. doi: 10.1097/UPJ.0000000000000406
2. ChatGPT flunks self-assessment test for urologists. News release. Wolters Kluwer Health: Lippincott. June 6, 2023. Accessed June 7, 2023. https://www.newswise.com/articles/chatgpt-flunks-self-assessment-test-for-urologists