Recommended Citation
Foy O, Lafazanos Y, Berber L, Ehrenpreis ED . Using Large Language Models to Evaluate the Inflammatory Bowel Disease Questionnaire (IBDQ). Presented at Scientific Day; May 20, 2026; Milwaukee, WI.
Abstract
Background/Significance:
The Inflammatory Bowel Disease Questionnaire (IBDQ), a 32-item survey published in 1989, evaluates health-related quality of life (QOL) in inflammatory bowel disease (IBD) across bowel, systemic, emotional, and social domains. Higher scores indicate better QOL. It remains widely used in clinical practice and trials. However, as therapies and patient expectations evolve, legacy instruments may not capture contemporary domains such as mental health burden, cultural factors, treatment complexity, extraintestinal manifestations, and cumulative therapy impact. Large language models (LLMs), increasingly used in clinical research, offer a systematic method to critique and refine patientreported outcome (PRO) tools. We aimed to determine whether LLMs identify meaningful gaps in the IBDQ and generate clinically relevant revisions compared with the original instrument.
Purpose:
To evaluate the IBDQ using multiple AI platforms, identify missing or underrepresented domains, generate revisions, and compare outputs through blinded expert review.
Methods:
The complete IBDQ and standardized prompts were provided to seven LLMs. Each generated a 1–10 global quality rating and structured critiques outlining strengths, limitations, redundancies, and proposed missing domains. An IBD-focused academic physician reviewed seven coded, randomized outputs under blind conditions. Overall quality, clinical appropriateness, and psychometric insight were assessed. Two 1–5 scales rated clinical applicability and comparative improvement versus the original IBDQ. Content analysis identified domains most frequently cited as underrepresented.
Results:
All LLMs rated the IBDQ favorably (7–8/10). Open Evidence retained the original structure. Copilot and Gemini reduced redundancy by consolidating items. ChatGPT expanded social consequence domains. Claude and Perplexity emphasized dietary impact, treatment burden, body image, cognitive function, and extraintestinal manifestations. Expert review found Copilot, Gemini, and Open Evidence revisions useful but inferior to the original. ChatGPT 4o and 5.1 revisions were considered improvements. Claude received the highest evaluation, with revisions viewed as potentially practice-changing.
Conclusion:
LLMs identified domains not fully captured in the IBDQ and proposed meaningful refinements. While outputs varied, many emphasized reducing questionnaire burden and improving clarity. LLMs may serve as scalable tools to modernize legacy PRO instruments, with expert oversight to ensure clinical rigor and relevance.
Presentation Notes
Presented at Scientific Day; May 20, 2026; Milwaukee, WI.
Full Text of Presentation
wf_yes
Document Type
Poster
Open Access
Available to all.
Using Large Language Models to Evaluate the Inflammatory Bowel Disease Questionnaire (IBDQ)
Background/Significance:
The Inflammatory Bowel Disease Questionnaire (IBDQ), a 32-item survey published in 1989, evaluates health-related quality of life (QOL) in inflammatory bowel disease (IBD) across bowel, systemic, emotional, and social domains. Higher scores indicate better QOL. It remains widely used in clinical practice and trials. However, as therapies and patient expectations evolve, legacy instruments may not capture contemporary domains such as mental health burden, cultural factors, treatment complexity, extraintestinal manifestations, and cumulative therapy impact. Large language models (LLMs), increasingly used in clinical research, offer a systematic method to critique and refine patientreported outcome (PRO) tools. We aimed to determine whether LLMs identify meaningful gaps in the IBDQ and generate clinically relevant revisions compared with the original instrument.
Purpose:
To evaluate the IBDQ using multiple AI platforms, identify missing or underrepresented domains, generate revisions, and compare outputs through blinded expert review.
Methods:
The complete IBDQ and standardized prompts were provided to seven LLMs. Each generated a 1–10 global quality rating and structured critiques outlining strengths, limitations, redundancies, and proposed missing domains. An IBD-focused academic physician reviewed seven coded, randomized outputs under blind conditions. Overall quality, clinical appropriateness, and psychometric insight were assessed. Two 1–5 scales rated clinical applicability and comparative improvement versus the original IBDQ. Content analysis identified domains most frequently cited as underrepresented.
Results:
All LLMs rated the IBDQ favorably (7–8/10). Open Evidence retained the original structure. Copilot and Gemini reduced redundancy by consolidating items. ChatGPT expanded social consequence domains. Claude and Perplexity emphasized dietary impact, treatment burden, body image, cognitive function, and extraintestinal manifestations. Expert review found Copilot, Gemini, and Open Evidence revisions useful but inferior to the original. ChatGPT 4o and 5.1 revisions were considered improvements. Claude received the highest evaluation, with revisions viewed as potentially practice-changing.
Conclusion:
LLMs identified domains not fully captured in the IBDQ and proposed meaningful refinements. While outputs varied, many emphasized reducing questionnaire burden and improving clarity. LLMs may serve as scalable tools to modernize legacy PRO instruments, with expert oversight to ensure clinical rigor and relevance.
Affiliations
Advocate Lutheran General Hospital