A substantial fraction of sexually transmitted infections (STIs) occur in patients who have previously been treated for an STI. We assessed whether routine electronic health record (EHR) data can predict which patients presenting with an incident STI are at greatest risk for additional STIs in the next 1-2 years.
We used structured EHR data on patients aged ≥15 years who acquired an incident STI diagnosis during 2008-2015 in eastern Massachusetts. We applied machine learning algorithms to model risk of acquiring ≥1 or ≥2 additional STIs diagnoses within 365 or 730 days following the initial diagnosis using over 180 different EHR variables. We performed sensitivity analysis incorporating state health department surveillance data to assess whether improving the accuracy of identifying STI cases improved algorithm performance.
We identified 8,723 incident episodes of laboratory-confirmed gonorrhea, chlamydia, or syphilis. Bayesian Additive Regression Trees, the best performing algorithm of any single method, had a cross-validated area under the receiver operating curve (cv-AUC) of 0.75. Receiver-operator curves for this algorithm showed a poor balance between sensitivity and positive predictive value (PPV). A predictive probability threshold with a sensitivity of 91.5% had a corresponding PPV of 3.9%. A higher threshold with PPV of 29.5% had sensitivity of 11.7%. Attempting to improve classification of patients with and without repeat STIs diagnoses by incorporating health department surveillance data had minimal impact on cv-AUC.
Machine algorithms using structured EHR data did not differentiate well between patients with and without repeat STIs diagnosis. Alternative strategies, able to account for socio-behavioral characteristics, could be explored.