Introduction to TMLE

Miao Wang

4 min readJun 9, 2021

Resources:

Paper

TMLE for Causal Inference Paper: https://biostats.bepress.com/ucbbiostat/paper252/, https://www.youtube.com/watch?v=8Q9dfW3oOi4,
Matching for Causal Inference Paper: https://www.jstatsoft.org/article/view/v042i08

R library

Matchit : https://cran.r-project.org/web/packages/MatchIt/vignettes/matching-methods.html
tmle: https://cran.r-project.org/web/packages/tmle/tmle.pdf
TLverse: https://tlverse.org/

TMLE

1. Level Set

1a. Final Goal: Goal to create a counterfactual dataset

Ideally, we want a counterfactual data for each observation:

for i = 1, …, n:
- Outcome if treated: Yi(A=1)
- Outcome if control: Yi(A=0)

However, in reality, we only observe 1 outcome for each observation

1b. What does TMLE comes from?

P(Y, A, W) = P(Y|A, W) * P(A|W) * P(W)

If we just do regular regression Y ~ A + W, then in order to have consistent estimate for Y, we have to make sure the:
- (1) all confounding factors W are included
- (2) functional form of A, W to Y is correctly specified

However, there is a chance to “correct” even if we mis-specify estimation of P(Y|A, W) by utilizing what we saw in P(A|W).

The theory is complex, but TMLE can provide “correct answer” as long as one of the form is correct P(A|W) or P(Y|A, W). Moreover, if both form are correct, it is more efficient estimator than regular regression.
This is also referred as “double robust”.

2. Detailed Steps

2.1 Get initial estimate of conterfactural data

Use regression/other ML models: fit Y ~ A + W

Then we can get the initial counterfactual data

Outcome Referred as the “Q” Function, Denote as Q0_i(A, W)
for i = 1, …., n:
- Estimated Outcome when A = 1: Q0_i(1, W)
- Estimated Outcome when A = 0: Q0_i(0, W)

2.2 Get the probability of A = 1, given W (exp of propensity score)

Use regression/other ML models: fit A ~ W

Then we know the probability of assignment

Probability of Assignment Referred as the “g” function, Denote as g0_i(W)
for i = 1, …, n:
- Estimated probability in A = 1: g0_i(W)
- Estimated probability in A = 0: 1 -g0_i(W)

2.3 Calculate Adjustment and Estimate Fluctuation (step-size)

The adjustment is the inverse of the probability.

Assume we are given a step-size e1, e0:

Q1_i(A, W) = Q0_i(A, W) + e1 * H1(1, W) + e0 * H0(1, W),
where
H1(1,W) = 1 / g0_i(W)
H0(0,W) = 1 / (1 — g0_i(W))

e1, e0 are the “fluctuations” (step size) for A= 1 and A = 0 respectively.

Estimation of fluctuations

We usually will use a logistic regression (MLE) to get best estimate for e1 and e0. Imagine the simplest scenario when outcome Y is binary:

After step 1, we have the residual for each outcome y — pi-hat (which is ranged from 0–1). We will fit the following model: logit(y — pi-hat) ~ e0 * H0(0,W) + e1 * H1(1, W), where H0 and H1 are all values from step 2.

The MLE estimate for the above model will be the best e0 and e1.

Note: for continuous outcome Y, we can fit a linear model instead to get e0 and e1.
However, because linear model is unbounded, this might gives extreme value that are out of bound.
If we know min(Y) = a, and max(Y) = b, then we can transform Y to Y* = (Y-a)/(b-a) which is bounded between 0 and 1.
We can fit logistic regression on Y* instead.

2.4 Update the next iteration outcome

for i = 1, …, n:
- Next Iteration Outcome when A = 1: Q1_i(1, W) = Q0_i(1, W) + e1-hat * H1(1, W)
- Next Iteration Outcome when A = 0: Q1_i(0, W) = Q0_i(0, W) + e0-hat * H0(1, W)

Once we have the new outcome, we can re-estimate e1 and e0 and update the next iteration of Q until convergence.

At the end we will have a counterfactual data for each observation as well as P-value, estimate and CI for ATE, ATT, RR, Odds ratio of the outcome.

3. Frequent Questions

1. How good should the Q initial estimation be (in terms of R-square)?

It is mentioned in the paper, that we should not overfit the initial Q estimation step because that will minimize the residual signal (which make it harder to make bias reduction). If we have two model for Q initial estimation, one gives R-square of 20% and one gives R-square of 30%, which one should we use?

-> If both model have low R-square in this case, pick the one with higher R-square should be fine. We want to avoid very high R-square (because that will leave little signical for g estimation)

2. Can we calculate estimate of ATT and ATC from Qstar?

After TMLE, we get Qstar (n x 2) the final counterfactual dataset. It is easy to verify that ATE = mean(Q1W — Q0W). How about ATT and ATC?

-> No! Qstar is the counterfactual dataset when target is ATE. To get ATT/ATC, there is a different corresponding IC (influence curve) which is not showed in the final output.
We have to use the reported ATT and ATC from the output.

3. What is the relationship between TMLE outcome and Matching outcome?

-> I think the result for ATT from TMLE will be the most closest to the 1:1 nearest matching result.

To recall, the 1:1 match is trying to find for every treated member, we want to find someone who is very similar in all aspect in the control group.

In a perfect world, where we did matching perfectly and TMLE perfectly, I would expect ATT similar with matching result. However in reality, ATT in TMLE tends to be lower (more on the conservative side) than the matching result.