A/B Testing in healthcare: study design and identification of bias

Miao Wang
6 min readJan 9, 2022

A Toy Example (but kind of realistic) …

Now, imagine the following:

You worked in an insurance firm. The insurance company is really motivated to lower the “unnecessary” medical cost spending from the member. Now, your boss has observed that people that have done annual check-ups regularly tend to spend much less than the rest. The cost difference is still significant after adjusting for confounders so the leadership decided to launch A/B testing to validate this hypothesis (A/B testing is the best way to draw a causal conclusion).

Photo by <a href=”https://unsplash.com/@freestockpro?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Alexandr Podvalny</a> on <a href=”https://unsplash.com/s/photos/healthcare?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
Photo by Alexandr Podvalny on Unsplash

For the past two years, an A/B testing study that aims to measure annual check-ups impact on next year's medical spending has been conducted. 100 members were placed in the holdout set, and 1000 members were placed in the experiment group. Members are distributed into these two groups randomly. For members in the experiment group, a nurse will try to call them to remind them of the annual check-ups.

Now, 2 years have passed. You are asked to evaluate the impact of this intervention (ask annual check-up reminder call) on medical cost saving.

Step 1: Define what is an engaged member?

The first thing to realize, not everyone in the experiment group will do an annual check-up. There is a natural funnel from initial identification (target) to the final affected population (we refer to as “engaged”). We lose some members because of limited call capacity, some due to being unable to reach (UTR), and some due to unwillingness to participate. The final remaining ones who picked up the call and actually agree are our final study of interest (aka the “treated” group).

Most importantly, we need to define what does “engaged” meant specifically for our purpose of the study.

In the study, since we want to assess the impact of annual-checks up, we decided to rely on the medical claims data (rather than their response at the time of the call). And anything that happen after 3 months of the call is unlikely related to our intervention, we decided to define “engaged member” as (1) having a call with our nurse (2) actually having a medical checkup within 3 months of the call.

Once we defined the “engaged” member, the final goal is clear: compare the “engaged” member vs the “holdout” member for their medical spending.

Image by author: Carefully read the picture and make sure you understand it!

You identify there are:

  • 800 outreached members (80% outreached rate)
  • 500 reached members (62.5% reached rate)
  • 100 engaged members (20% engaged rate)

Intend to Treat

In some scenarios, you might prefer to use the initial experiment as a “treatment” group for evaluation. This is called Intent to Treat (ITT). The benefit of ITT analysis is that you can directly draw inferences without any adjustment/correction since the initial experiment and control group are selected randomly. However, the drawback of ITT analysis is that your “treatment” effect will be diluted since only a subset of the experiment group actually received treatment.

Step 2: identify potential bias introduced during the funnel

Most studies will have bias even designed carefully. Due to that reason, your original holdout group might not be the best candidate as the baseline group. Therefore, It is very important to check bias at each step of the study flow:

  • Step 1: study design bias (looking at Targeted Vs Holdout)
  • Step 2: outreach bias (looking at Outreach vs Non-outreach within Targeted)
  • Step 3: response bias (looking at Reached vs Unable to Reach UTR within Outreached)
  • Step 4: engage bias (looking at Engaged vs Non-engaged within Reached)

2.1 Quantifying Bias: Covariates Balance between two groups

To quantify bias, we first need to have a set of covariates that are available before the study (for example, age, gender, prior history with customer service, prior conditions and prior office visits, etc).

Then, we need to calculate some summary statistics that show differences between the two groups (targeted vs holdout) such as:

Standardized Mean Difference (SME)

  • Usually, SME > 0.1 implies an imbalance of the given features
x_0: vector of a given covariate among control (0)
x_1: experiment group (1)
pooled_sd = sqrt([var(x_0) + var(x_1)]/2)
SME = (mean(x_1) — mean(x_0))/pooled_sd

Variance Ratio

  • Usually, Variance Ratio > 2 implies an imbalance of the given feature
x_0: vector of a given covariate among control (0)
x_1: experiment group (1)
Var_ratio = var(x_1) / var(x_0)

Empirical Cumulative Distribution Function (ECDF) distance

  • Get Empirical CDF for target and control
  • calculate average distance or max distance in the two ECDF curve
  • Usually, SME and Variance Ratio should be enough to assess variable balance in the two groups.

2.2 Summarizing balance into 1 metric

The above metrics (SME, Variance Ratio) are all calculated for each variable. What if, we want a single quantity that reflects the general balance of the cohort?

There are two ways:

Propensity Score

  • Think of propensity score as the composite score of balance for all selected confounders.
  • One can fit the group assignment (target vs control) with all variables to obtain a predicted outcome (aka propensity score) for each observation.
  • Finally, we can calculate the SME or Variance Ratio for the PS score as a general quantitative measure of data imbalance.

Prognostic Score

  • If we have an outcome of interest (next year's spending), then we can fit the outcome to the selected variables (only among the control group).
  • Then created the predicted outcome for all observations (both control and target group).
  • Finally, we can calculate the SME or Variance Ratio for the prognostic score as a general quantitative measure of data imbalance.
  • From literature, it seems that prognostic score is a superior measure of balance than propensity score even under mild misspecification of the model (missing confounder). I would suggest looking at both if possible.

2.3 Illustration of the balance evaluation

I created an R function that one can use in any data frame (or even spark data frame) to assess balance, incorporating all methods mentioned.

View full codes here: https://github.com/miaow27/causal_inference_util/blob/master/balance_eval.R

Below is a sample output that looks at the variable balance between target and control group across age, gender, health risk, prior checkup, prior medical utilization, and prior exposure to customer service.

ps_score and prg_score are also created. We can see ps_score have SME of 0.333 (> 0.1) which is considered an imbalance. On the other hand, the prognostic scores have SME of 0.0168 (< 0.1), which shows the overall data is balanced. In this case, the prognostic score is probably more accurate since the split between target and control is created randomly.

Image by author: sample output from my function
Image by author: sample output from my function

Step 3. Conclusion of biases source

Once you have evaluated the variable balance across the above 4 subsets, you should be able to answer:

  1. Is there any particularly imbalanced variable?
  2. Which step in the study have you introduced the largest bias?

Based on answers in 1) and 2), you should have an informed discussion with the business partner and let them be aware of that before showing the evaluation result.

The entire exercise should also give you a good idea about whether you need to perform some adjustments for evaluation or not and if so what are the covariates to adjust for.

4 Further Reading

--

--