Predicting Inpatient and Outpatient Hospitalization Rates in U.S. Counties with Alternative Data Sources

Understanding hospital usage in each US county is imperative for identifying characteristics of underserved communities and developing strategies to increase community well-being. Numerous studies have found alcohol consumption and high air pollution can have adverse effects on a population’s overall health, and the usage of these substances is also affected by socioeconomic factors such as household income. In this project, we hypothesize that alcohol consumption and air pollution rates along with socioeconomic factors can be key components in predicting hospital usage by using the number of inpatient days and outpatient visits per capita as regression target variables. The purpose of this project is to use data from alternative data sources for air pollution, alcohol consumption, and socioeconomic factors to predict health markers for each county while analyzing the correlation between these factors. By analyzing data on the county level, we aim to derive regional patterns that may give insight into over and under-burdened populations that may be potentially lost by generalizing to a state or national level. We find that the best features are measures of demographics, healthcare facility characteristics, and alcohol consumption. Additionally, the latitude and longitude coordinates of a county has predictive performance which further emphasizes regional importance for predicting hospital rates. Our highest performing model, XGBoost Regressor, achieved 0.719 R2 and 0.219 R2 for inpatient and outpatient prediction respectively. We find that alternative data sources, particularly alcohol consumption, are beneficial for predicting county-level hospital rates. [Paper]