10 AI as a Bridge: Can ChatGPT Help Patients Understand Their Breast Radiology Reports?
Large language models such as ChatGPT are increasingly used by patients to understand radiology reports. Studies have demonstrated that most patients have an 8th grade reading comprehension level. Readability metrics can assess whether radiology reports are close to a patient’s level of understanding. Our study aimed to evaluate whether ChatGPT model 4 could improve readability of breast imaging reporting and data system (BI-RADS) categories 3, 4, and 5 reports while maintaining accuracy.
Fifteen consecutive BI-RADS 3, BI-RADS 4, and BI-RADS 5 source reports were retrieved from the radiology database. ChatGPT prompts were given as follows: “please put the report into layperson terms” and “please provide recommendations for the patient.” ChatGPT prompts generated layperson reports and recommendations for all 45 source reports.
Readability of source reports, layperson reports, and layperson recommendations were analyzed for each BI-RADS category using Flesch-Kincaid readability score in grade level and word counts. Flesch-Kincaid score is an established tool used to assess health literacy. Grade level scores define the reading comprehension grade level. High word count is associated poorer readability. A radiologist evaluated the accuracy of ChatGPT-generated outputs by identifying inaccurate/misleading information and whether the layperson reports contained beneficial supplementary information not found in the original report. Statistical analysis was performed using 2-sided unpaired, equal-variance t-tests.
Statistically significant decrease in mean grade level was identified for BI-RADS 3 source vs layperson reports (mean grade level, 10.9; SD = 1.8; vs mean grade level, 8.8; SD = 1.0; P = .003), and for BI-RADS 5 source vs layperson reports (mean grade level, 12.2; SD, 3.0; vs mean grade level, 9.6; SD = 0.8; P = .003). No statistical difference was seen in grade level between source reports and ChatGPT-generated recommendations for all 3 BI-RADS categories (Table 1).
Table 1. Mean Grade Level Scores With Accompanying Standard Deviations in Parentheses of Source Reports, ChatGPT-Generated Layperson Reports, and Recommendations for Each BI-RADS Category
There was a statistically significant decreased word count for BI-RADS 3 and BI-RADS 4 source vs ChatGPT-generated recommendations (P = .001 and P = .02, respectively). No statistically significant difference was observed between word count of source and layperson reports for all 3 BI-RADS categories.
ChatGPT-generated reports did not contain inaccurate or misleading information; however, the reports did offer beneficial supplementary information, particularly in terms of biopsy recommendations, compared with source reports.
ChatGPT improved readability (grade level) for BI-RADS 3 and 5 reports, and produced less wordy recommendations for BI-RADS 3 and 4 reports. ChatGPT-generated reports did not contain inaccurate information that would affect patient management. ChatGPT could be used to generate patient-friendly reports, which could potentially lessen patient anxiety surrounding breast imaging.
Stay up to date on recent advances in the multidisciplinary approach to cancer.