The Delphi Blog

News, findings, and musings from the epidemic tracking and forecasting team.

Using Electronic Health Records to Develop Real-Time COVID-19 Indicators

Aaron Rumack

In previous blog posts, we have written about indicators generated from surveys that we run through Facebook and Google. In addition to these, we have also generated indicators from electronic medical records. We received access to hospital records and insurance claims data from healthcare partners covering 10-15% of the population of the United States. One indicator that we created from this data is the “Doctor Visits” signal, which estimates the percentage of office visits due to COVID-like illness.

Electronic health records are extremely useful in providing information about epidemics in real time. A patient may interact with the healthcare system days before receiving a test (if any), and several more days before receiving a positive result and being reported as a confirmed case by a state’s Department of Health. This can make our signal an early indicator of confirmed cases, and certainly an early indicator of deaths. Early knowledge of an increase in people with COVID-like symptoms is extremely important, as early containment can prevent an outbreak.

Additionally, the reported case counts have other issues:

  • Limited testing capacity: Especially prevalent in the earlier days of the pandemic, limited testing capacity means that there were many people who saw a doctor for COVID-like symptoms but never even received a test. These people would be present in our data but absent from the confirmed counts.
  • Backlogs: A patient can be first reported as a confirmed case can be several days after they were tested. Since our data is linked to the date that the patient interacted with the healthcare system, we report a more accurate date.
  • Revisions: Some states have changed their definition of which tests indicate a confirmed COVID-19 case. This can be done retroactively, including hundreds or thousands of past cases and reporting them as “new” on the day on which they changed their definition. These cases are clearly assigned the wrong date, but we have no way of determining when the test was actually taken. Our data is subject to revisions due to lags in reporting, but the revisions are understood and are assigned the correct date.

Doctor Visits Indicator

The Doctor Visits indicator is based solely on insurance claims. For a given geographic area and a given day, we receive the count of claims that fall into each of five categories:

  1. Total: All claims
  2. Flu: ICD-10 primary code starting with J09, J10, or J11. This is an explicit diagnosis of flu.
  3. COVID-like: ICD-10 primary code in {U07.1, U07.2, B97.29, J12.81, Z03.818, B34.2, J12.89}. Some of these codes correspond to an explicit diagnosis of COVID-19, while others indicate similar symptoms.
  4. Flu-like: ICD-10 primary code of J22 (acute lower respiratory infection) or B34.9 (viral infection, unspecified).
  5. Mixed: ICD-10 primary code of Z20.828 (suspected exposure to COVID-19) or J12.9 (viral pneumonia).

We calculate the percent of COVID-like illness (%CLI) as (COVID-like + Flu-like + Mixed - Flu) / Total. We subtract the Flu counts because they are unrelated to coronavirus. The indicator is available at four geographic levels: county, MSA, HRR, and state. See here for more details.

The signal is available daily starting February 1, 2020 (although understandably, the signal is nearly zero in all locations until mid-March). To preserve privacy and data integrity, we do not report a signal for a given location and date if there are less than 500 total visits in the last seven days before that date. Following these restrictions, we are able to produce estimates for about 2000 counties each day, or about two-thirds of the counties in the country and over 90% of the US population.

Below, we plot two maps to explore the signal. On the left is a heatmap representing the average DV signal for each state over the time period April 15 to May 15. On the right is a heatmap of the average daily confirmed cases per 100,000 people using data from USAFacts during the same time period. We can see that the states with many cases tend to have high %CLI and states with fewer cases tend to have lower %CLI, providing a good sanity check.

library(covidcast)
library(dplyr)
library(gridExtra)
library(ggplot2)
library(stringr)

# Fetch DV % CLI signal and USA-Facts confirmed case incidence proportion at
# the state level
start_day = "2020-04-15"
end_day = "2020-05-15"
df_dv = covidcast_signal("doctor-visits", "smoothed_adj_cli",
                         start_day, end_day, geo_type = "state")
df_in = covidcast_signal("usa-facts", "confirmed_7dav_incidence_prop",
                         start_day, end_day, geo_type = "state")

# For each state, average the signals over all available times
df_dv_avg = df_dv %>% group_by(geo_value) %>% summarize(value = mean(value))
df_in_avg = df_in %>% group_by(geo_value) %>% summarize(value = mean(value))

# Set a bunch of fields so that the data frames know how to plot themselves
df_dv_avg$time_value = df_in_avg$time_value = start_day
df_dv_avg$issue = df_in_avg$issue = start_day
attributes(df_dv_avg)$geo_type = attributes(df_in_avg)$geo_type = "state"
class(df_dv_avg) = c("covidcast_signal", class(df_dv_avg))
class(df_in_avg) = c("covidcast_signal", class(df_in_avg))

# Plot choropleth maps, using the covidcast plotting functionality
subtitle = paste("Averaged over", start_day, "to", end_day)
p1 = plot(df_dv_avg,
          title = "% COVID-like illness, DV signal",
          range=c(0, 20), choro_params = list(subtitle = subtitle))
p2 = plot(df_in_avg,
          title = "Daily new confirmed COVID-19 cases per 100k pop.",
          range=c(0, 30), choro_params = list(subtitle = subtitle))
grid.arrange(p1, p2, nrow = 1)

Backfill

Electronic medical records have a few interesting challenges not present in the survey data. First, is that the data is available at a significant and variable latency (we call this “backfill”). For claims related to a given day, we will receive only about half of them within the first seven days after that date. Due to backfill, we cannot make estimates of our indicator for a given day until four days have passed, and we will revise the value of that day’s indicator every day. Interestingly, certain types of claims have a longer latency than others. We found that for a given date, as more claims are available, %CLI usually increases. Covid-like claims tend to have a longer latency and thus our early estimates of the indicator usually underestimate the final value.

Weekday Effects

Another challenge is the influence of the day of the week on %CLI. On weekends, both total counts and COVID-like counts decrease, but proportionally, total counts decrease more. This is because people are more likely to delay seeking medical care for non-acute issues than acute issues. The total counts include many visits related to non-acute issues, but almost all COVID-like counts are due to acute issues. Without correcting for this weekday effect, the %CLI curve has a “sawtooth” pattern, spiking on weekends. We derived a method to create a corrected signal that accounts for this weekday effect (for a formal mathematical description of the method, see our API documentation). Below, we see the effect of making these corrections below. When we do not correct for the weekday effect, we see a sawtooth pattern that clearly does not represent true changes in COVID-like illness within a location. However, when we make the weekday correction, we get a smooth curve that looks reasonable. It is important to note that this correction is not time-wise smoothing! Rather, we are making a point-wise correction based on the fitted \(\alpha\) parameters.

start_day = "2020-05-01"
end_day = "2020-08-01"

data_unadjusted <- suppressMessages(covidcast_signal("doctor-visits", "smoothed_cli", start_day = start_day,
                                                     end_day = end_day,geo_type="state"))
data_adjusted <- suppressMessages(covidcast_signal("doctor-visits", "smoothed_adj_cli", start_day = start_day, 
                                                   end_day = end_day,geo_type="state"))

data_unadjusted = data_unadjusted %>%
  select(geo_value, signal, time_value, value) %>%
  tidyr::pivot_wider(names_from=signal,values_from = value)

data_adjusted = data_adjusted %>%
  select(geo_value, signal, time_value, value) %>%
  tidyr::pivot_wider(names_from=signal,values_from = value)

cmb_df = inner_join(data_unadjusted,data_adjusted)

cmb_df$geo_value = str_to_upper(cmb_df$geo_value)
states_to_plot = c("AZ","CA","GA","FL","NJ","NY","PA","TX","WI")
ggplot(cmb_df %>% filter(geo_value %in% states_to_plot)) +
  geom_line(aes(x=time_value,y=smoothed_cli,color="Unadjusted")) +
  geom_line(aes(x=time_value,y=smoothed_adj_cli,color="Adjusted")) +
  facet_wrap(vars(geo_value)) +
  labs(x="Time",y="%CLI",title="Doctor visit signal, with and without weekday adjustment, for nine states")

Evaluating the Indicator

We can evaluate the Doctor Visits indicator by measuring its correlation with cases. We hope that when the Doctor Visits indicator is high, cases are also high. We can measure this correlation across two axes: space and time Let \(c_{it}\) and \(d_{it}\) be the confirmed COVID-19 rate and DV indicator respectively, at location \(i\) and time \(t\). The space-wise correlation of an individual location \(i\) is the rank correlation of \(c_{\cdot t}\) and \(d_{\cdot t}\), and the time-wise correlation of a single day \(t\) is the rank correlation of \(c_{i \cdot}\) and \(d_{i \cdot}\). Below, we plot the distribution of time-wise correlation values for each county with at least 500 total confirmed cases. Most counties have high time-wise correlation values, indicating that when the signal is high in a certain county, the county is likely to have high case incidence.

start_day = "2020-04-15"
end_day = "2020-10-01"

data_adjusted <- suppressMessages(covidcast_signal("doctor-visits", "smoothed_adj_cli", start_day, end_day))
df_cases = suppressWarnings(covidcast_signal("usa-facts", "confirmed_7dav_incidence_prop", start_day, end_day))

case_num = 500
geo_values = suppressWarnings(covidcast_signal("usa-facts", "confirmed_cumulative_num",
                              max(df_cases$time_value), 
                              max(df_cases$time_value))) %>%
  filter(value >= case_num) %>% pull(geo_value)

dv_cases_df = bind_rows(data_adjusted, df_cases) %>%
  filter(geo_value %in% geo_values) %>%
  select(geo_value, signal, time_value, value) %>%
  tidyr::pivot_wider(names_from=signal,values_from=value) %>%
  rename(cases=confirmed_7dav_incidence_prop, dv=smoothed_adj_cli) %>%
  filter(purrr::map_lgl(geo_value,function(fips) { substr(fips,3,5) != "000"})) %>%
  group_by(time_value) %>%
  mutate(rank_dv = rank(dv), rank_cases=rank(cases)) %>%
  ungroup()

ggplot(dv_cases_df %>% group_by(geo_value) %>% summarize(corr = cor(dv,cases,method="spearman",use="na.or.complete")) %>% ungroup()) +
  geom_density(aes(x=corr)) +
  labs(title="Time-wise Correlation between DV signal and case incidence",
       subtitle="Over all counties with at least 500 cumulative cases",
       x="Correlation",y="Density")

Below, we plot the space-wise correlation over time. The correlation remains somewhat steady into the middle of July, but starting in the end of July, the space-wise correlation starts to decrease. By late August, the space-wise correlation is essentially zero, or essentially meaningless. This means that on a given day in April or May, a county with a higher DV signal value was more likely to have a higher case incidence than a county with a lower DV signal value. By August and September, we could no longer say that a county with a higher DV signal was more likely to have a higher case incidence.

ggplot(dv_cases_df %>% group_by(time_value) %>% summarize(corr = cor(dv,cases,method="spearman",use="na.or.complete")) %>% ungroup()) +
  geom_line(aes(x=time_value,y=corr)) +
  labs(title="Space-wise Correlation between DV signal and case incidence",
       subtitle="Over all counties with at least 500 cumulative cases",
       x="Date",y="Correlation")

Correlation Deterioration

This correlation deterioration was shocking to us, so we set out to find out the cause. After testing many hypotheses, we found one cause: some areas of the country have a higher DV “baseline” than others. We illustrate this with the plots below, where we plot the average case incidence and average DV signal for each of the 10 HHS regions (for which regions contain which states, see here).

fips_to_hhs = function(fips) {
  st = as.integer(substr(fips,1,2))
  if (st %in% c(9,23,25,33,44,50)) {
    return(1)
  } else if (st %in% c(34,36,72,78)) {
    return(2)
  } else if (st %in% c(10,11,24,42,51,54)) {
    return(3)
  } else if (st %in% c(1,12,13,21,28,37,45,47)) {
    return(4)
  } else if (st %in% c(17,18,26,27,39,55)) {
    return(5)
  } else if (st %in% c(5,22,35,40,48)) {
    return(6)
  } else if (st %in% c(19,20,29,31)) {
    return(7)
  } else if (st %in% c(8,30,38,46,49,56)) {
    return(8)
  } else if (st %in% c(4,6,15,32,60,66,69)) {
    return(9)
  } else if (st %in% c(2,16,41,53)) {
    return(10)
  } else {
    return(0)
  }
}
dv_cases_df = dv_cases_df %>% mutate(hhs = purrr::map_dbl(geo_value,fips_to_hhs))
dv_cases_df = dv_cases_df %>% left_join(county_census %>% select(FIPS,POPESTIMATE2019) %>% rename(geo_value=FIPS,pop=POPESTIMATE2019))

ggplot(dv_cases_df %>% group_by(time_value,hhs) %>% summarize(dv=sum(dv*pop)/sum(pop),cases=sum(cases*pop)/sum(pop)) %>% ungroup()) +
  geom_line(aes(x=time_value,y=cases,group=hhs,color=as.factor(hhs))) +
  labs(title="Mean Case Incidence by HHS Region",
       x="Date",y="Cases",color="HHS")

ggplot(dv_cases_df %>% group_by(time_value,hhs) %>% summarize(dv=sum(dv*pop/sum(pop),na.rm=T),cases=sum(cases*pop/sum(pop),na.rm=T)) %>% ungroup()) +
  geom_line(aes(x=time_value,y=dv,group=hhs,color=as.factor(hhs))) +
  labs(title="Mean DV Signal by HHS Region",
       x="Date",y="Value",color="HHS")

Let’s look more closely at the behavior of HHS 2 (New Jersey and New York). We will compare the two curves in two ways: When the case incidence is high (or low) in HHS 2 compared to other days, is the DV signal also high (or low)? When the case incidence is high (or low) in HHS 2 compared to other HHS regions, is the DV signal also high (or low) in HHS 2? The first question asks how high the time-wise correlation is, and the second questions asks how high the space-wise correlation is.

The answer to the first question is largely yes. The case incidence is steadily decreasing throughout April and May, flattening out by mid-June. The DV signal peaks a little later than case incidence, but decreases throughout May and is also mostly flat by mid-June. This tells us that the time-wise correlation is good within HHS 2. However, the answer to the second question reveals an issue with the DV signal. The case incidence is the highest of all HHS regions in May and into June, but by July, HHS 2 and HHS 1 (New England) have the lowest case incidence. This is not true with the DV signal. HHS 2 has the highest DV signal almost throughout the entire time period, even when its case incidence is the lowest. With these plots, we can point out one reason why the space-wise correlations begin to decline starting in August: counties in HHS 2 tended to have low case incidence but a high DV signal, driving down the correlations.

We have now identified the problem: a DV signal of, say 5%, might mean a very low case incidence for a county in New York but a very high case incidence for a county in Oregon. Since we have six months worth of history for both the DV signal and case incidence, we can correct for this problem by using the DV signal to predict the case incidence in a location-specific manner. In our case, we will use a simple linear regression model to fit the DV signal to case incidence. We call this turning a signal into a sensor. In the past, the Delphi group has done extensive work in creating sensors from signals and developing “sensor fusion,” which combines multiple sensors into a single estimate. This work is out of the scope of this blog post, although it may be a topic for a future post. For a deeper look, see chapter 4 of Farrow 2016 and Jahja et al. 2019.

We make these sensors in two ways: 1) A single regression fit across the entire time series, and 2) fitting each day, using data from the past 6 weeks. The first method “cheats” by using data that was not yet available (e.g. the DV signal’s value for May 1 is corrected using data up to October 1). Therefore, its performance is not necessarily indicative of a system that makes this correction in real time. The second method is mostly real-time, with the exception of backfill of the DV signal, as discussed above.

From the graphs above, we see that correcting the DV signal by turning it into a sensor greatly improved the space-wise correlations! Furthermore, the correlation is relatively constant over time, demonstrating that the data quality did not necessarily decrease over time, since a simple linear correction stopped the deterioration.

Open Questions and Future Directions

We were happy to see that a location-specific correction was able to greatly increase the space-wise correlation. By looking at the correlations of the two corrections in the graph above, we can see that they perform quite differently early on, and perform more similarly after mid-July. This indicates that the regression coefficients of the real-time correction, which are fit using a sliding window, change significantly over time. We would like to look into this non-stationary relationship between the DV signal and case incidence to better understand why it changes, and why counties in New York and New Jersey have such a different relationship in the first place.

We would also like to see how this data compares to that of other healthcare insurance providers. Some insurance providers cover different populations with different rates, which could make their signal more or less correlated with the reported case incidence. We would expect that by combining datasets and covering more people, we can produce a more useful signal.

We believe that this dataset will be useful for forecasting Covid cases and deaths, since a doctor visit usually precedes a confirmed test and death. We can compare time-wise and space-wise correlations for cases lagged by a week or two, as well as deaths. Forecasting models can also take the current DV signal as a feature to help predict what the case incidence will be several weeks into the future.


Aaron Rumack is a Ph.D. student in the Machine Learning Department, advised by Roni Rosenfeld. He has been a member of the Delphi group since 2017.