Identifying oral disease variables associated with pneumonia emergence by application of machine learning to integrated medical and dental big data to inform eHealth approaches


Background: The objective of this study was to build models that define variables contributing to pneumonia risk by applying supervised Machine Learning-(ML) to medical and oral disease data to define key risk variables contributing to pneumonia emergence for any pneumonia/pneumonia subtypes.

Methods: Retrospective medical and dental data were retrieved from Marshfield Clinic Health System's data warehouse and integrated electronic medical-dental health records (iEHR). Retrieved data were pre-processed prior to conducting analyses and included matching of cases to controls by (a) race/ethnicity and (b) 1:1 Case: Control ratio. Variables with >30% missing data were excluded from analysis. Datasets were divided into four subsets: (1) All Pneumonia (all cases and controls); (2) community (CAP)/healthcare associated (HCAP) pneumonias; (3) ventilator-associated (VAP)/hospital-acquired (HAP) pneumonias and (4) aspiration pneumonia (AP). Performance of five algorithms were compared across the four subsets: Naïve Bayes, Logistic Regression, Support Vector Machine (SVM), Multi-Layer Perceptron (MLP) and Random Forests. Feature (input variables) selection and ten-fold cross validation was performed on all the datasets. An evaluation set (10%) was extracted from the subsets for further validation. Model performance was evaluated in terms of total accuracy, sensitivity, specificity, F-measure, Mathews-correlation-coefficient and area under receiver operating characteristic curve (AUC).

Results: In total, 6,034 records (cases and controls) met eligibility for inclusion in the main dataset. After feature selection, the variables retained in the subsets were: All Pneumonia (n = 29 variables), CAP-HCAP (n = 26 variables); VAP-HAP (n = 40 variables) and AP ( n = 37 variables), respectively. Variables retained (n = 22) were common across all four pneumonia subsets. Of these, the number of missing teeth, periodontal status, periodontal pocket depth more than 5 mm and number of restored teeth contributed to all the subsets and were retained in the model. MLP outperformed other predictive models for All Pneumonia, CAP-HCAP and AP subsets, while SVM outperformed other models in VAP-HAP subset.

Conclusion: This study validates previously described associations between poor oral health and pneumonia. Benefits of an integrated medical-dental record and care delivery environment for modeling pneumonia risk are highlighted. Based on findings, risk score development could inform referrals and follow-up in integrated healthcare delivery environment and coordinated patient management.

Document Type


PubMed ID