Affiliations

Advocate Lutheran General Hospital

Abstract

Background/Significance:

The Inflammatory Bowel Disease Questionnaire (IBDQ), a 32-item survey published in 1989, evaluates health-related quality of life (QOL) in inflammatory bowel disease (IBD) across bowel, systemic, emotional, and social domains. Higher scores indicate better QOL. It remains widely used in clinical practice and trials. However, as therapies and patient expectations evolve, legacy instruments may not capture contemporary domains such as mental health burden, cultural factors, treatment complexity, extraintestinal manifestations, and cumulative therapy impact. Large language models (LLMs), increasingly used in clinical research, offer a systematic method to critique and refine patientreported outcome (PRO) tools. We aimed to determine whether LLMs identify meaningful gaps in the IBDQ and generate clinically relevant revisions compared with the original instrument.

Purpose:

To evaluate the IBDQ using multiple AI platforms, identify missing or underrepresented domains, generate revisions, and compare outputs through blinded expert review.

Methods:

The complete IBDQ and standardized prompts were provided to seven LLMs. Each generated a 1–10 global quality rating and structured critiques outlining strengths, limitations, redundancies, and proposed missing domains. An IBD-focused academic physician reviewed seven coded, randomized outputs under blind conditions. Overall quality, clinical appropriateness, and psychometric insight were assessed. Two 1–5 scales rated clinical applicability and comparative improvement versus the original IBDQ. Content analysis identified domains most frequently cited as underrepresented.

Results:

All LLMs rated the IBDQ favorably (7–8/10). Open Evidence retained the original structure. Copilot and Gemini reduced redundancy by consolidating items. ChatGPT expanded social consequence domains. Claude and Perplexity emphasized dietary impact, treatment burden, body image, cognitive function, and extraintestinal manifestations. Expert review found Copilot, Gemini, and Open Evidence revisions useful but inferior to the original. ChatGPT 4o and 5.1 revisions were considered improvements. Claude received the highest evaluation, with revisions viewed as potentially practice-changing.

Conclusion:

LLMs identified domains not fully captured in the IBDQ and proposed meaningful refinements. While outputs varied, many emphasized reducing questionnaire burden and improving clarity. LLMs may serve as scalable tools to modernize legacy PRO instruments, with expert oversight to ensure clinical rigor and relevance.

Presentation Notes

Presented at Scientific Day; May 20, 2026; Milwaukee, WI.

Full Text of Presentation

wf_yes

Document Type

Poster


 

Open Access

Available to all.

Share

COinS
 
May 20th, 12:00 AM

Using Large Language Models to Evaluate the Inflammatory Bowel Disease Questionnaire (IBDQ)

Background/Significance:

The Inflammatory Bowel Disease Questionnaire (IBDQ), a 32-item survey published in 1989, evaluates health-related quality of life (QOL) in inflammatory bowel disease (IBD) across bowel, systemic, emotional, and social domains. Higher scores indicate better QOL. It remains widely used in clinical practice and trials. However, as therapies and patient expectations evolve, legacy instruments may not capture contemporary domains such as mental health burden, cultural factors, treatment complexity, extraintestinal manifestations, and cumulative therapy impact. Large language models (LLMs), increasingly used in clinical research, offer a systematic method to critique and refine patientreported outcome (PRO) tools. We aimed to determine whether LLMs identify meaningful gaps in the IBDQ and generate clinically relevant revisions compared with the original instrument.

Purpose:

To evaluate the IBDQ using multiple AI platforms, identify missing or underrepresented domains, generate revisions, and compare outputs through blinded expert review.

Methods:

The complete IBDQ and standardized prompts were provided to seven LLMs. Each generated a 1–10 global quality rating and structured critiques outlining strengths, limitations, redundancies, and proposed missing domains. An IBD-focused academic physician reviewed seven coded, randomized outputs under blind conditions. Overall quality, clinical appropriateness, and psychometric insight were assessed. Two 1–5 scales rated clinical applicability and comparative improvement versus the original IBDQ. Content analysis identified domains most frequently cited as underrepresented.

Results:

All LLMs rated the IBDQ favorably (7–8/10). Open Evidence retained the original structure. Copilot and Gemini reduced redundancy by consolidating items. ChatGPT expanded social consequence domains. Claude and Perplexity emphasized dietary impact, treatment burden, body image, cognitive function, and extraintestinal manifestations. Expert review found Copilot, Gemini, and Open Evidence revisions useful but inferior to the original. ChatGPT 4o and 5.1 revisions were considered improvements. Claude received the highest evaluation, with revisions viewed as potentially practice-changing.

Conclusion:

LLMs identified domains not fully captured in the IBDQ and proposed meaningful refinements. While outputs varied, many emphasized reducing questionnaire burden and improving clarity. LLMs may serve as scalable tools to modernize legacy PRO instruments, with expert oversight to ensure clinical rigor and relevance.

 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.