Article Title

Identifying Race/Ethnicity Data via Natural Language Processing Among Women in a Uterine Fibroid Cohort Study

Publication Date



race, natural language processing


Background/Aims: Uterine fibroids are associated with morbidity including abnormal bleeding, anemia, pelvic/bladder symptoms and adverse reproductive outcomes. Symptomatic fibroids may affect 25% of women in their late 40s. Race is among the most consistent risk factors known. The Uterine Fibroid Study aims to use automated data to estimate fibroid incidence rates/trends during 2005–2014 in a retrospective cohort of women at Group Health. Race/ethnicity captured from automated structured data has improved yet remains incomplete, particularly with use of retrospective data.

Methods: The study included women 18–65 years old without hysterectomy, continuously enrolled with evidence of encounter in 3 years before study entry. Incidence estimates required absence of fibroid history. We collected fibroid diagnoses, demographics and other data from the Virtual Date Warehouse (VDW). VDW demographic race/ethnicity data is sourced from data collected from Group Practice patients, at time of encounter, via entry in the electronic health record. Additionally, Group Health collects race/ethnicity data from breast cancer screening program and tumor registry data. To complement traditional structured race/ethnicity data from VDW, we augmented with race/ethnicity extracted from free-text clinical notes via natural language processing (NLP). Our NLP system used a rule-based dictionary look-up approach to identify common terms used to describe patient race/ethnicity and custom rules to disambiguate race/ethnicity terms that also have other clinical meanings (e.g. the term “white” in “54-year-old white female” as opposed to “Her white blood cell count improved”). We conducted a partial validation of the NLP system in a sample of patients with known structured race/ethnicity data.

Results: Prior to amending race/ethnicity data with NLP, in the cohort of 277,821 women, 37.4% had race/ethnicity unknown. Fibroid incidence rates (per 10,000 person-years) were 156 for Hispanics, 133 for whites, 265 for African-Americans, 152 for Asian/Pacific Islanders and 108 for unknown race. NLP work on identifying race/ethnicity in the unknown race group is ongoing and results are pending.

Conclusion: Race/ethnicity is an important risk factor for a number of conditions, including uterine fibroids. Improving capture of race/ethnicity from available automated data sources potentially could improve accuracy of research findings and enhance patient care by providing a better understanding of the burden of disease in subgroups of affected patients.




July 6th, 2016


August 12th, 2016