Use of Artificial Intelligence Chatbots in Interpretation of Pathology Reports

Eric Steimetz; Jeremy Minkowitz; Elmer C. Gabutan; Joan Ngichabe; Hagar Attia; Mordechai Hershkop; Fatih Ozay; Matthew G. Hanna; Raavi Gupta

doi:10.1001/jamanetworkopen.2024.12767

Key Points

Question Can artificial intelligence chatbots accurately simplify pathology reports so that patients can easily understand them?

Findings In this cross-sectional study of 1134 pathology reports, 2 chatbots significantly decreased the reading grade level of pathology reports while interpreting most reports correctly. However, some reports contained significant errors or hallucinations.

Meaning These findings suggest that chatbots have the potential to explain pathology reports to patients and extrapolate pertinent details; however, they are not flawless and should not be used without a review by a health care professional.

Abstract

Importance Anatomic pathology reports are an essential part of health care, containing vital diagnostic and prognostic information. Currently, most patients have access to their test results online. However, the reports are complex and are generally incomprehensible to laypeople. Artificial intelligence chatbots could potentially simplify pathology reports.

Objective To evaluate the ability of large language model chatbots to accurately explain pathology reports to patients.

Design, Setting, and Participants This cross-sectional study used 1134 pathology reports from January 1, 2018, to May 31, 2023, from a multispecialty hospital in Brooklyn, New York. A new chat was started for each report, and both chatbots (Bard [Google Inc], hereinafter chatbot 1; GPT-4 [OpenAI], hereinafter chatbot 2) were asked in sequential prompts to explain the reports in simple terms and identify key information. Chatbot responses were generated between June 1 and August 31, 2023. The mean readability scores of the original and simplified reports were compared. Two reviewers independently screened and flagged reports with potential errors. Three pathologists reviewed the flagged reports and categorized them as medically correct, partially medically correct, or medically incorrect; they also recorded any instances of hallucinations.

Main Outcomes and Measures Outcomes included improved mean readability scores and a medically accurate interpretation.

Results For the 1134 reports included, the Flesch-Kincaid grade level decreased from a mean of 13.19 (95% CI, 12.98-13.41) to 8.17 (95% CI, 8.08-8.25; t = 45.29; P < .001) by chatbot 1 and 7.45 (95% CI, 7.35-7.54; t = 49.69; P < .001) by chatbot 2. The Flesch Reading Ease score was increased from a mean of 10.32 (95% CI, 8.69-11.96) to 61.32 (95% CI, 60.80-61.84; t = −63.19; P < .001) by chatbot 1 and 70.80 (95% CI, 70.32-71.28; t = −74.61; P < .001) by chatbot 2. Chatbot 1 interpreted 993 reports (87.57%) correctly, 102 (8.99%) partially correctly, and 39 (3.44%) incorrectly; chatbot 2 interpreted 1105 reports (97.44%) correctly, 24 (2.12%) partially correctly, and 5 (0.44%) incorrectly. Chatbot 1 had 32 instances of hallucinations (2.82%), while chatbot 2 had 3 (0.26%).

Conclusions and Relevance The findings of this cross-sectional study suggest that artificial intelligence chatbots were able to simplify pathology reports. However, some inaccuracies and hallucinations occurred. Simplified reports should be reviewed by clinicians before distribution to patients.

Introduction

Every day, thousands of anatomic pathology specimens are processed across the US. On reviewing the slides and, potentially, ancillary test results, the pathologist prepares a report with the diagnosis. Depending on the procedure and complexity of the case, the report often contains additional information or comments, and often in cancer cases, prognostic information and molecular therapeutic targets that directly affect patient management. Despite their ubiquity and importance, the reports are written in difficult language that is generally beyond the comprehension of laypeople. Moreover, the reports have grown increasingly complex and lengthy in recent years.¹ In contrast to laboratory values, anatomic pathology reports are not standardized and often contain nuanced phrasing that even experienced clinicians may interpret incorrectly.² With the advent of electronic portals, patients have unfettered access to their reports. However, the complexity of the reports presents a major barrier to large-scale adoption of patient portals.³

Recent advancements in artificial intelligence (AI) have given rise to large language models (LLMs), which are probabilistic natural language processing systems trained on copious quantities of data. Large language model chatbots are generative AI applications that produce output in response to input in a conversational manner.⁴ An initial hypothesis was that the technology could be used in clinical decision support applications. However, several studies⁴^-6 have shown that, in its current form, the technology is too error prone and limited to be efficacious in the clinical setting. Nonetheless, previous studies⁷^,8 have shown that chatbots can communicate health information effectively. In fact, 1 study⁷ found the responses given by chatbots to patient questions to be of better quality and to appear to exhibit more empathy than some responses given by physicians. Thus far, the applicability of chatbots in the diagnostic setting remains largely unexplored. In this study, we investigate the ability of chatbots to accurately simplify anatomic pathology reports for patients and identify key elements of the reports that are pertinent to patient care.

Methods

This cross-sectional study was deemed exempt from review and informed consent by the SUNY (State University of New York) Downstate Health Sciences University Institutional Review Board. The study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.

Data Collection

We retrieved 1134 pathology reports from specimens processed between January 1, 2018, and May 31, 2023, in a teaching hospital in Brooklyn, New York. The reports were of varying length and complexity, written by different pathologists, and addressing a variety of different procedures, organ systems, and conditions (Table 1). The reports included many specimens from ordinary procedures such as appendectomies, Papanicolaou smears, and skin, breast, prostate, and colon biopsies. The diagnoses cover many conditions, such as lipoma, appendicitis, prostatic adenocarcinoma, invasive mammary carcinoma, colonic adenocarcinoma, and melanoma. Comments, notes, addendums, and synoptic reports (where applicable) were often part of the report. Some reports consisted of multipart resections, while others consisted of a single biopsy. The text of the reports was not edited for clarity. Potentially identifying information, such as dates, references to specimen accession numbers, and the names of the pathologists and clinicians, were anonymized. Care was taken to avoid including multiple reports with identical text, such as cytology reports, which have standardized sign-out language.

Two reviewers (E.S. and F.O.) independently categorized findings of each report as normal, benign, atypical and/or suspicious, precancerous, or malignant. In case of disagreement, a third pathologist (R.G.) functioned as a tiebreaker. During the review process, several cases were found to be nondiagnostic and were classified as such (Table 1).

Study Design

Two chatbots were used in this study: Bard (Google Inc; versions June 7, 2023, and July 13, 2023), referred to hereinafter as chatbot 1, and GPT-4 (OpenAI; version May 24, 2023), referred to hereinafter as chatbot 2. The models were queried between June 1 and August 31, 2023.

The pathology reports were input to the chatbots, which were asked, sequentially, to simplify the report; to classify findings on the spectrum of normal to malignant; and to denote the pathologic stage of the tumor (Table 2). To minimize bias, a new chat thread was started for each report, and the last question was asked in all cases, regardless of the diagnosis. The chatbot response to each prompt was recorded; eFigures 1 and 2 in Supplement 1 depict a sample exchange.

A commercial web-based tool (readable.com) was used to assess the readability metrics of the simplified (ie, the response to question 1) and original reports. The word count and findings of 2 widely used readability formulas, the Flesch Reading Ease (FRE) and Flesch-Kincaid grade level (FKGL), were recorded.⁹

In addition to readability, the accuracy of the simplified report was assessed. Two reviewers (including E.S., J.M., E.C.G., J.N., H.A., and F.O.) independently screened the reports for errors. Three pathologists (E.S., M.G.H., and R.G.) evaluated the flagged reports and categorized them as medically correct, partially medically correct, or medically incorrect. Medically correct indicated that the simplified report contained no errors, and the information was medically sound; partially medically correct, the simplified report contained at least 1 erroneous statement or explanation, but not one that would drastically alter the medical management of the disease or condition (eg, misstating the precise size of the tumor or miscounting the number of positive lymph nodes); and medically incorrect, the simplified report contained a significant error that would drastically change the management of the patient (eg, stating that cancer was present in a benign specimen or an incorrect hormonal status of a breast tumor). The reviewers also recorded any instances of hallucinations, which are fabricated statements or explanations.

Statistical Analysis

The mean readability scores of the original and simplified reports were compared using paired-samples t tests. The mean readability scores of the simplified reports generated by both chatbots were compared using a 2-tailed independent-samples t test. Statistical analysis was conducted using the open source SciPy Python package, version 1.9.3 (SciPy Community). Two-sided P < .05 indicated statistical significance.

Results

The mean FKGL of the original 1134 reports was 13.19 (95% CI, 12.98-13.41), the mean FRE score was 10.32 (95% CI, 8.69-11.96), and the mean word count was 80.65 (95% CI, 73.34-87.96). A total of 1068 reports (94.18%) were at or above an FKGL of 9.00. The mean FKGL of reports simplified by chatbot 1 was 8.17 (95% CI, 8.08-8.25), the mean FRE score was 61.32 (95% CI, 60.80-61.84), and the mean word count was 192.38 (95% CI, 188.44-196.32); a total of 314 reports (27.69%) were at or above an FKGL of 9.00. The mean FKGL of reports simplified by chatbot 2 was 7.45 (95% CI, 7.35-7.54), the mean FRE score was 70.80 (95% CI, 70.32-71.28), and the mean word count was 145.43 (95% CI, 141.04-149.81); a total of 197 reports (17.37%) were at or above an FKGL of 9.00.

The results of a paired t test showed that the simplified reports generated by both chatbots significantly decreased the FKGL (t = 45.29 [chatbot 1] and t = 49.69 [chatbot 2]; P < .001 for both) and increased the FRE score (t = −63.19 [chatbot 1] and t = −74.61 [chatbot 2]; P < .001 for both). An independent t test showed a significant difference in mean readability scores between the simplified reports generated by the 2 chatbots for both the FKGL and FRE score (t = −26.32 and t = 10.92, respectively; P < .001 for both).

Overall, both chatbots interpreted most reports correctly (Figure 1). For chatbot 1, 993 reports (87.57%) were medically correct, 102 (8.99%) were partially correct, and 39 (3.44%) were incorrect; 32 reports (2.82%) contained hallucinations. For chatbot 2, 1105 reports (97.44%) were medically correct, 24 (2.12%) were partially correct, and 5 (0.44%) were incorrect; 3 reports (0.26%) contained hallucinations.

The most common error made by both chatbots was assuming that a resection specimen without lymph nodes (pNx) implied that the lymph nodes status was negative (pN0). Common hallucinations included statements about patients, such as their well-being after the procedure, and confabulatory explanations of unfamiliar terms. Select reports are shown in Table 3.

Questions 2 to 4

Benign vs Malignant

Findings in 924 reports were classified by the reviewers as either benign (including normal) or malignant. Of those, 848 (91.77%) of chatbot 1 responses and 857 (92.75%) of chatbot 2 responses were correct (Figure 2).

Classifying the Reports on the Normal to Malignant Spectrum

In all 1134 reports, chatbot 1 responded with a 1-word answer in 28 (2.47%) reports and with 1 of the answer choices given in 1125 reports (99.21%). Chatbot 2 responded with a 1-word answer and with 1 of the answer choices (normal, benign, atypical and/or suspicious, precancerous, or malignant) in all but 1 report (Figure 2).

Pathologic Stage

Of the 1134 reports, 97 (8.55%) contained a pathologic tumor stage. Of those, chatbot 1 responded with the correct stage in 89 cases (91.75%) and chatbot 2 responded with the correct stage in 93 cases (95.88%). Inappropriate stage (ie, providing a stage although nonexistent) was provided by chatbot 1 in 122 cases (10.76% of all reports) and by chatbot 2 in 5 cases (0.44% of all reports).

Qualitative Findings

The reviewers believed that the responses given by chatbot 2 were better and more comprehensive, while the responses given by chatbot 1 were wordy and less helpful. This finding is supported by the statistically significant differences in mean readability scores and word count. The difference became more pronounced when comparing the performance of the chatbots for other metrics, such as medical accuracy.

Discussion

To our knowledge, this is the first cross-sectional study to investigate the use of generative AI chatbots as a tool to simplify pathology reports and make them more accessible to patients. The first important study to show the potential use of chatbots in health care was published by Ayers et al,⁷ who reported that chatbots responded to informal patient questions posted on an online forum. Chatbot responses were of fair quality and appeared to exhibit more empathy than some physicians’ answers. Other studies have investigated the ability of chatbots to respond to physician questions and facilitate clinical decision-making, with mixed results.⁵^,6,10 While the findings reported by Ayers et al⁷ suggest that chatbots may supplant physicians in some tasks, our findings suggest that they could augment physicians and serve patients by simplifying complex medical information and responding to a limited set of potential follow-up questions. Incorporating these inexpensive technology solutions in clinical practice could reduce disparities and be especially helpful to patients from socioeconomically disadvantaged backgrounds, who tend to have lower health literacy.¹¹ Clinicians have a responsibility to convey test results to patients and ensure that they have the requisite knowledge to follow through with treatment. Studies have linked patients’ involvement with their care to health outcomes.¹² However, it is impossible for patients to participate in the decision-making process without fully understanding their results.¹³ The integration and implementation of sophisticated chatbots into the field of pathology can potentially revolutionize how pathology reports are perceived and understood and allow patients to make informed decisions. Furthermore, as health care professionals increasingly recognize the importance of optimizing patient-physician communication, these models can empower patients by offering immediate interpretation of their reports. This would eliminate the often-anxious wait for their follow-up appointment, leading to improved health care outcomes and use of resources.¹⁴ Simplified reports may benefit midlevel practitioners and enhance the educational experience of medical students who are less familiar with the highly technical terms often found in pathology reports. Notably, the structured nature of the College of American Pathologists synoptic protocols presents a unique opportunity for these models to easily autogenerate explanatory notes based on the selected fields.

Simplified pathology reports have the potential to make health care more accessible to millions of patients and reduce disparities among those with low health literacy levels. Of note, the grade level of the simplified reports is markedly lower than that of most online patient educational material, which often is written for those with an 11th grade educational level or higher.¹⁵^-17

Another potential use of the technology is working in the background to streamline physician workflow. The ability of chatbots to correctly categorize reports into benign, premalignant, and malignant categories can allow for triaging of reports and determining which should be given priority by clinicians, especially in large medical centers with high patient volumes. Results of cases that are in the benign and normal categories could be released to patients without having to schedule a follow-up appointment to discuss the results.

Fine-tuning, which is the process of further training a model on a dataset of the correct answers to a specific question or task, has been shown to enhance the model’s performance in several studies.⁴ It is plausible that training a chatbot on such a dataset could improve its accuracy and reduce the instances of medically incorrect statements or hallucinations. Currently, users can fine-tune chatbot 2 through an application programming interface, but not chatbot 1. Evaluating the capabilities of models that are fine-tuned to explain and classify pathology reports should be a focus of future research.

Presently, the biggest barriers to wide adoption of the technology are hallucinations and seemingly true statements that are factually incorrect. This difficulty of chatbots with quantitative reasoning has been extensively documented in the literature.¹⁸^,19 Although some fact-checking solutions have been proposed,²⁰^,21 they are unlikely to be useful in situations where facts are nuanced or not readily available (eg, the well-being of a patient after surgery).

A few conditions must be met to deploy such models in clinical settings. We note that the proper ethical considerations, patient privacy, and regulatory compliance must be met. Sharing personal health information with chatbots may violate patient confidentiality.²² Critically, the issue of accuracy and hallucinations poses a significant hurdle to the clinical deployment of chatbots that answer patient questions about their pathology report, as it is not possible to anticipate all questions and evaluate every potential response. It is possible that fine-tuned models would perform significantly better and should be a focus of further studies. Until a proper solution is developed and tested, patients should not blindly rely on a response provided by an AI chatbot. Rather, the output should first be reviewed by a health care professional to ensure the response is medically sound.

Limitations

This study has some limitations. First, the reports were sourced from a single institution. Because pathology reports are largely not standardized, varying report structure or wordings might be interpreted differently by the models, potentially limiting the generalizability of our findings. Second, chatbots are probabilistic in nature and could change their output based on different inputs. The way the questions were phrased, and their order, could have influenced the responses. Additionally, using an evaluation framework that is based on measuring accuracy and hallucinations alone may not adequately capture all the subcomponents of an ideal patient-friendly report, such as clarity, completeness, and empathy. Last, determining the proper category of a given diagnosis can be challenging in some instances. For example, cervical low-grade squamous intraepithelial lesions have a significantly lower malignant potential than colonic polyps with high-grade dysplasia, yet both are considered precancerous. There are many conditions whose proper category is a matter of scientific debate.

Conclusions

The findings of this cross-sectional study suggest that artificial intelligence chatbots can simplify pathology reports for patients and identify key details that are relevant for patient management. However, their interpretation should be used judiciously, as they are not without flaws. Developing fact-checking solutions is necessary before integrating these tools in the health care setting.

Back to top

Article Information

Accepted for Publication: March 21, 2024.

Published: May 22, 2024. doi:10.1001/jamanetworkopen.2024.12767

Corresponding Author: Eric Steimetz, MD, MPH, Department of Pathology, SUNY Downstate Medical Center, 450 Clarkson Ave, MSC 25, Brooklyn, NY 11203 ([email protected]).

Author Contributions: Dr Steimetz had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Steimetz, Minkowitz, Attia, Hanna, Gupta.

Acquisition, analysis, or interpretation of data: All authors.

Drafting of the manuscript: All authors.

Critical review of the manuscript for important intellectual content: Minkowitz, Ngichabe, Hershkop, Ozay, Hanna, Gupta.

Statistical analysis: Steimetz, Hershkop, Gupta.

Administrative, technical, or material support: Minkowitz, Attia, Ozay, Gupta.

Supervision: Hanna, Gupta.

Conflict of Interest Disclosures: Dr Hanna reported consulting for Paige and serving as an adviser for PathPresenter Corporation. No other disclosures were reported.

References

1.

Bonert M, Zafar U, Maung R, et al. Evolution of anatomic pathology workload from 2011 to 2019 assessed in a regional hospital laboratory via 574,093 pathology reports. PLoS One. 2021;16(6):e0253876. doi:10.1371/journal.pone.0253876 PubMed Google Scholar Crossref

2.

Amin A, DeLellis RA, Fava JL. Modifying phrases in surgical pathology reports: introduction of Standardized Scheme of Reporting Certainty in Pathology Reports (SSRC-Path). Virchows Arch. 2021;479(5):1021-1029. doi:10.1007/s00428-021-03155-w PubMed Google Scholar Crossref

3.

Lyles CR, Nelson EC, Frampton S, Dykes PC, Cemballi AG, Sarkar U. Using electronic health record portals to improve patient engagement: research priorities and best practices. Ann Intern Med. 2020;172(11)(suppl):S123-S129. doi:10.7326/M19-0876 PubMed Google Scholar Crossref

4.

Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930-1940. doi:10.1038/s41591-023-02448-8 PubMed Google Scholar Crossref

5.

Chen S, Kann BH, Foote MB, et al. Use of artificial intelligence chatbots for cancer treatment information. JAMA Oncol. 2023;9(10):1459-1462. doi:10.1001/jamaoncol.2023.2954 PubMed Google Scholar Crossref

6.

Caranfa JT, Bommakanti NK, Young BK, Zhao PY. Accuracy of vitreoretinal disease information from an artificial intelligence chatbot. JAMA Ophthalmol. 2023;141(9):906-907. doi:10.1001/jamaophthalmol.2023.3314 PubMed Google Scholar Crossref

7.

Ayers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589-596. doi:10.1001/jamainternmed.2023.1838 PubMed Google Scholar Crossref

8.

Ali SR, Dobbs TD, Hutchings HA, Whitaker IS. Using ChatGPT to write patient clinic letters. Lancet Digit Health. 2023;5(4):e179-e181. doi:10.1016/S2589-7500(23)00048-1 PubMed Google Scholar Crossref

9.

Badarudeen S, Sabharwal S. Assessing readability of patient education materials: current role in orthopaedics. Clin Orthop Relat Res. 2010;468(10):2572-2580. doi:10.1007/s11999-010-1380-y PubMed Google Scholar Crossref

10.

Goodman RS, Patrinely JR, Stone CA Jr, et al. Accuracy and reliability of chatbot responses to physician questions. JAMA Netw Open. 2023;6(10):e2336483. doi:10.1001/jamanetworkopen.2023.36483 PubMed Google Scholar Crossref

11.

Stormacq C, Van den Broucke S, Wosinski J. Does health literacy mediate the relationship between socioeconomic status and health disparities? integrative review. Health Promot Int. 2019;34(5):e1-e17. doi:10.1093/heapro/day062 PubMed Google Scholar Crossref

12.

Hughes TM, Merath K, Chen Q, et al. Association of shared decision-making on patient-reported health outcomes and healthcare utilization. Am J Surg. 2018;216(1):7-12. doi:10.1016/j.amjsurg.2018.01.011 PubMed Google Scholar Crossref

13.

Joseph-Williams N, Elwyn G, Edwards A. Knowledge is not power for patients: a systematic review and thematic synthesis of patient-reported barriers and facilitators to shared decision making. Patient Educ Couns. 2014;94(3):291-309. doi:10.1016/j.pec.2013.10.031 PubMed Google Scholar Crossref

14.

Zhang Z, Citardi D, Xing A, Luo X, Lu Y, He Z. Patient challenges and needs in comprehending laboratory test results: mixed methods study. J Med Internet Res. 2020;22(12):e18725. doi:10.2196/18725 PubMed Google Scholar Crossref

15.

Prabhu AV, Kim C, Crihalmeanu T, et al. An online readability analysis of pathology-related patient education articles: an opportunity for pathologists to educate patients. Hum Pathol. 2017;65:15-20. doi:10.1016/j.humpath.2017.04.020 PubMed Google Scholar Crossref

16.

Miles RC, Baird GL, Choi P, Falomo E, Dibble EH, Garg M. Readability of online patient educational materials related to breast lesions requiring surgery. Radiology. 2019;291(1):112-118. doi:10.1148/radiol.2019182082 PubMed Google Scholar Crossref

17.

Hutchinson N, Baird GL, Garg M. Examining the reading level of internet medical information for common internal medicine diagnoses. Am J Med. 2016;129(6):637-639. doi:10.1016/j.amjmed.2016.01.008 PubMed Google Scholar Crossref

18.

Lewkowycz A, Andreassen A, Dohan D, et al. Solving quantitative reasoning problems with language models. arXiv. Preprint posted online July 1, 2022. Google Scholar

19.

Bang Y, Cahyawijaya S, Lee N, et al. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. arXiv. Preprint posted online November 28, 2023. doi:10.18653/v1/2023.ijcnlp-main.45 Google Scholar

20.

Min S, Krishna K, Lyu X, et al. FActScore: fine-grained atomic evaluation of factual precision in long form text generation. arXiv. Preprint posted online October 11, 2023. doi:10.18653/v1/2023.emnlp-main.741 Google Scholar

21.

Manakul P, Liusie A, Gales MJF. SelfCheckGPT: zero-resource black-box hallucination detection for generative large language models. arXiv. Preprint posted online October 11, 2023. doi:10.18653/v1/2023.emnlp-main.557 Google Scholar

22.

Marks M, Haupt CE. AI chatbots, health privacy, and challenges to HIPAA compliance. JAMA. 2023;330(4):309-310. doi:10.1001/jama.2023.9458 PubMed Google Scholar Crossref

Use of Artificial Intelligence Chatbots in Interpretation of Pathology Reports

Featured Articles

USPSTF Recommendation Statements

Blogs

See More About

Sign Up for Emails Based on Your Interests

Select Your Interests

Get the latest research based on your areas of interest.

Others Also Liked

Citation

Manage citations:

Use of Artificial Intelligence Chatbots in Interpretation of Pathology Reports