Importance: Delays in starting cancer treatment disproportionately affect vulnerable populations and can influence patients' experience and outcomes. Machine learning algorithms incorporating electronic health record (EHR) data and neighborhood-level social determinants of health (SDOH) measures may identify at-risk patients. Objective: To develop and validate a machine learning model for estimating the probability of a treatment delay using multilevel data sources. Design, Setting, and Participants: This cohort study evaluated 4 different machine learning approaches for estimating the likelihood of a treatment delay greater than 60 days (group least absolute shrinkage and selection operator [LASSO], bayesian additive regression tree, gradient boosting, and random forest). Criteria for selecting between approaches were discrimination, calibration, and interpretability/simplicity. The multilevel data set included clinical, demographic, and neighborhood-level census data derived from the EHR, cancer registry, and American Community Survey. Patients with invasive breast, lung, colorectal, bladder, or kidney cancer diagnosed from 2013 to 2019 and treated at a comprehensive cancer center were included. Data analysis was performed from January 2022 to June 2023. Exposures: Variables included demographics, cancer characteristics, comorbidities, laboratory values, imaging orders, and neighborhood variables. Main Outcomes and Measures: The outcome estimated by machine learning models was likelihood of a delay greater than 60 days between cancer diagnosis and treatment initiation. The primary metric used to evaluate model performance was area under the receiver operating characteristic curve (AUC-ROC). Results: A total of 6409 patients were included (mean [SD] age, 62.8 [12.5] years; 4321 [67.4%] female; 2576 [40.2%] with breast cancer, 1738 [27.1%] with lung cancer, and 1059 [16.5%] with kidney cancer). A total of 1621 (25.3%) experienced a delay greater than 60 days. The selected group LASSO model had an AUC-ROC of 0.713 (95% CI, 0.679-0.745). Lower likelihood of delay was seen with diagnosis at the treating institution; first malignant neoplasm; Asian or Pacific Islander or White race; private insurance; and lacking comorbidities. Greater likelihood of delay was seen at the extremes of neighborhood deprivation. Model performance (AUC-ROC) was lower in Black patients, patients with race and ethnicity other than non-Hispanic White, and those living in the most disadvantaged neighborhoods. Though the model selected neighborhood SDOH variables as contributing variables, performance was similar when fit with and without these variables. Conclusions and Relevance: In this cohort study, a machine learning model incorporating EHR and SDOH data was able to estimate the likelihood of delays in starting cancer therapy. Future work should focus on additional ways to incorporate SDOH data to improve model performance, particularly in vulnerable populations.

Original languageEnglish
Pages (from-to)e2328712
JournalJAMA network open
Issue number8
StatePublished - Aug 1 2023


  • Humans
  • Middle Aged
  • Cohort Studies
  • Risk Assessment/methods
  • Bayes Theorem
  • Carcinoma, Renal Cell
  • Kidney Neoplasms


Dive into the research topics of 'Development of a Multilevel Model to Identify Patients at Risk for Delay in Starting Cancer Treatment'. Together they form a unique fingerprint.

Cite this