Article Title

Rapid Case Ascertainment –– Two Algorithms to Identify Newly Diagnosed Cancer

Publication Date



algorithm, cancer


Background/Aims: The Kaiser Permanente Research Bank is a biobanking effort that includes a general cohort, a pregnancy cohort and a cancer cohort. A priority of the cancer cohort is rapid case identification (RCA) of newly diagnosed members. Because of the large volume of cancer cases diagnosed each year across Kaiser Permanente, this method must be automated and utilize electronic files; manual review of records is not feasible. We developed two algorithms using pathology reports that could identify most newly diagnosed cancers within days of diagnosis.

Methods: We developed two RCA algorithms; one utilized systematized nomenclature of medicine (SNOMED) codes and one used natural language processing (NLP) to read directly from the pathology report. All available pathology reports with SNOMED coding for 2013 were analyzed with SAS software at Kaiser Permanente Northwest (KPNW), Colorado (KPCO) and Northern California (KPNC), and all available 2013 pathology reports from KPCO were analyzed for NLP and processed using Linguamatics I2E. Cancer cases that were flagged using SNOMED or NLP were compared to the 2013 Virtual Tumor Registry (VTR, the gold standard) at each participating site to obtain sensitivity and specificity of the algorithm. We analyzed the data against all reportable cancers as well as against those that were diagnosed within a Kaiser Permanente facility and through pathology (limited VTR).

Results: Using the full VTR, the sensitivity for the SNOMED algorithm was 54%, 60% and 85% for KPCO, KPNW and KPNC, respectively. Specificity was 94%, 94% and 90%, respectively. When using the limited VTR, the sensitivity increased to 75%, 74% and 94%, respectively. The NLP algorithm at KPCO resulted in a higher sensitivity (58% full VTR, 80% limited VTR) than the SNOMED algorithm but a lower specificity (89%). Importantly, the RCA methods capture a wide range of cancer types across all stages.

Conclusion: Our two RCA algorithms are able to capture up to 80% of all cancer cases diagnosed while minimizing the number of people falsely identified as having cancer. These methods should help us minimize survival bias in the cancer cohort and possibly allow us to collect pretreatment blood samples on a portion of our cohort members.




June 24th, 2016


August 12th, 2016