Analyzing Large Language Models’ Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard

dc.contributor.author

Lang, SP

dc.contributor.author

Yoseph, ET

dc.contributor.author

Gonzalez-Suarez, AD

dc.contributor.author

Kim, R

dc.contributor.author

Fatemi, P

dc.contributor.author

Wagner, K

dc.contributor.author

Maldaner, N

dc.contributor.author

Stienen, MN

dc.contributor.author

Zygourakis, CC

dc.date.accessioned

2025-05-23T19:10:24Z

dc.date.available

2025-05-23T19:10:24Z

dc.date.issued

2024-06-01

dc.description.abstract

Objective: In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education. Methods: Our study aims to assess the response quality of Open AI (artificial intelligence)’s ChatGPT 3.5 and Google’s Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from ‘unsatisfactory’ to ‘excellent.’ The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale. Results: In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard’s responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k = 0.041, p = 0.622; Bard: k =-0.040, p = 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism. Conclusion: ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs’ role in medical education and healthcare communication.

dc.identifier.issn

2586-6583

dc.identifier.issn

2586-6591

dc.identifier.uri

https://hdl.handle.net/10161/32419

dc.language

en

dc.publisher

The Korean Spinal Neurosurgery Society

dc.relation.ispartof

Neurospine

dc.relation.isversionof

10.14245/ns.2448098.049

dc.rights.uri

https://creativecommons.org/licenses/by-nc/4.0

dc.subject

Artificial intelligence

dc.subject

Large language models

dc.subject

Patient education

dc.subject

Lumbar spine fusion

dc.subject

ChatGPT

dc.subject

Bard

dc.title

Analyzing Large Language Models’ Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard

dc.type

Journal article

duke.contributor.orcid

Fatemi, P|0000-0001-8188-8440

pubs.begin-page

633

pubs.end-page

641

pubs.issue

2

pubs.organisational-group

Duke

pubs.organisational-group

School of Medicine

pubs.organisational-group

Clinical Science Departments

pubs.organisational-group

Orthopaedic Surgery

pubs.organisational-group

Neurosurgery

pubs.publication-status

Published

pubs.volume

21

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ns-2448098-049.pdf
Size:
1.58 MB
Format:
Adobe Portable Document Format