sherlock
: Causal Machine Learning for Segment Discoverysherlock_quick_start_netflix.Rmd
sherlock
sherlock
allows us to use data from A/B tests or observational and quasi experiments, to discover which subgroups or segments of users would benefit from (or be harmed by) a given treatment of interest. The criterion for benefit or harm could be either absolute or subject to some cost constraints.
In these Causal Segment Discovery problems, we are given a set of segment dimensions/covariates that are not affected by the treatment (e.g. {country, device}). These are often a subset of the baseline covariates that are used to adjust for treatment effect confounding (in the case of non-AB data) and/or to reduce estimation variance. A segment or subgroup is a particular realization of the segment dimensions (e.g. {country=US, device=iphone12}). Our task is to learn the treatment effect heterogeneity across the segments, and which segments would satisfy a given treatment benefit criterion. The causality aspect lies in that the criterion is a function of the causal treatment effects on the segments, such as the Conditional Average Treatment Effect (CATE) on a segment. This task would result in a segmentation (or partition) of the population, into segments that should receive treatment vs those that should not.
sherlock
implements a modular, doubly robust machine learning framework for Causal Segment Discovery (first proposed by Vanderweele et al. (2019), van der Laan and Luedtke (2015), Luedtke and van der Laan (2016b) and Luedtke and van der Laan (2016a)). This package has the following features:
sherlock
also provides double robust effect measures to assess how well this segmentation achieves the intended criterion.Before installing sherlock
, please make sure that you have the latest version of sl3
, which is the workhorse of our model estimations. You can do so via
remotes::install_github("tlverse/sl3@devel", dependencies = FALSE)
sl3
implements an ensemble machine learning framework called Super Learner. At the current version, sl3
has ~40 of the most popular machine learning methods to choose from as base learners.
When sherlock
package is available on CRAN, it can be installed via:
install.packages("sherlock")
Alternatively, it can be installed from the public sherlock
repo on Netflix Github:
remotes::install_github("Netflix/sherlock")
Another option is to download the package from GitHub and install the package tarball locally.
Once installed, sherlock
and sl3
can be loaded via
sherlock
Overall, there are 3 main modules in this Causal Segment Discovery Framework:
sherlock_calculate
function)watson_segment
function)mycroft_assess
function)For the classes of problems addressed in this framework, step 1 would be the same regardless of the segmentation criterion, since it learns the effect of the treatment on each segment. Steps 2 and 3 depend on the segmentation criterion and what would be an appropriate effectiveness evaluation. Hence, step 1 only needs to be run once. After that, we can run different segmentation criteria and corresponding effectiveness evaluations.
We illustrate the steps with a synthetic data example. Suppose we want to learn whether a new UI treatment would have a differential impact on segments of our users, where segments are defined by the number of devices they have and their tenure.
data(data_example)
head(data_example)
#> num_devices is_p2plus is_newmarket baseline_ltv baseline_viewing treatment
#> 1: 2 1 0 0.3500025 0.01014789 0
#> 2: 4 0 0 0.8144417 1.56213812 1
#> 3: 3 0 1 0.0000000 0.41284605 0
#> 4: 5 1 0 0.0000000 0.00000000 0
#> 5: 5 1 0 0.0000000 0.71454993 1
#> 6: 1 1 0 0.0000000 0.58507617 0
#> outcome_viewing
#> 1: 0.3500025
#> 2: 5.0144417
#> 3: 1.0000000
#> 4: 0.0000000
#> 5: -3.0000000
#> 6: 0.0000000
Here the column treatment
represents the treatment of interest \(A\), the column outcome_viewing
represents the outcome of interest \(Y\). All the other columns make up the baseline covariates \(W\), which we shall use to control for treatment effect confounding and to reduce variance. Let \(V\)={num_devices
, is_p2plus
} be our segment dimensions of interest. Note that we always have \(V\subseteq W\). If the data has repeated measures on the independent units, we would have an additional id
column, where rows pertaining to the same unit should have the same id
value.
At this point, we make no assumptions on whether this data is from AB-tests. That assumption only comes in later when we either provide the treatment allocation rate (if AB test) or an estimation method to fit the treatment propensity (if non-AB).
For a segment \(V=v\), \(CATE(v) = E[E(Y|A=1,W)-E(Y|A=0,W)|V=v)]\) is the Conditional Average Treatment Effect (CATE) on this segment. It captures the average difference in outcome, if users in this segment had received treatment \(A=1\) vs control \(A=0\). We want to estimate \(CATE(v)\) across all the segments \(v\).
This task can be modularized in terms of estimating the nuisance parameters:
Each nuisance parameter can be estimated using ML methods specified by the user, or selected via a cross-validation procedure. These are implemented via sl3
.
As mentioned above, we use sl3
to implement the machine learning methods to estimate the propensity score, the outcome model and the CATE. When given one base learner (one machine learning method spec), sl3
will fit this learner as if using the algo’s package directly. But when given multiple base learners (different methods and different tuning specs), sl3
will fit the best linear combination of these, wherein “best” is chosen by a cross-validated risk. Currently, there are ~30 machine learners methods available for binary dependent variables, and ~40 available for continuous dependent variables.
##### list all the supported ML algos in sl3
sl3_list_learners("binomial") ## ML algos for binary dependent variables
#> [1] "Lrnr_bartMachine" "Lrnr_bayesglm"
#> [3] "Lrnr_bound" "Lrnr_caret"
#> [5] "Lrnr_cv_selector" "Lrnr_dbarts"
#> [7] "Lrnr_earth" "Lrnr_gam"
#> [9] "Lrnr_gbm" "Lrnr_glm"
#> [11] "Lrnr_glm_fast" "Lrnr_glmnet"
#> [13] "Lrnr_grf" "Lrnr_gru_keras"
#> [15] "Lrnr_h2o_glm" "Lrnr_h2o_grid"
#> [17] "Lrnr_hal9001" "Lrnr_lightgbm"
#> [19] "Lrnr_lstm_keras" "Lrnr_mean"
#> [21] "Lrnr_nnet" "Lrnr_optim"
#> [23] "Lrnr_pkg_SuperLearner" "Lrnr_pkg_SuperLearner_method"
#> [25] "Lrnr_pkg_SuperLearner_screener" "Lrnr_polspline"
#> [27] "Lrnr_randomForest" "Lrnr_ranger"
#> [29] "Lrnr_rpart" "Lrnr_screener_correlation"
#> [31] "Lrnr_solnp" "Lrnr_stratified"
#> [33] "Lrnr_svm" "Lrnr_xgboost"
sl3_list_learners("continuous") ## ML algos for continuous dependent variables
#> [1] "Lrnr_arima" "Lrnr_bartMachine"
#> [3] "Lrnr_bayesglm" "Lrnr_bilstm"
#> [5] "Lrnr_bound" "Lrnr_caret"
#> [7] "Lrnr_cv_selector" "Lrnr_dbarts"
#> [9] "Lrnr_earth" "Lrnr_expSmooth"
#> [11] "Lrnr_gam" "Lrnr_gbm"
#> [13] "Lrnr_glm" "Lrnr_glm_fast"
#> [15] "Lrnr_glmnet" "Lrnr_grf"
#> [17] "Lrnr_gru_keras" "Lrnr_gts"
#> [19] "Lrnr_h2o_glm" "Lrnr_h2o_grid"
#> [21] "Lrnr_hal9001" "Lrnr_HarmonicReg"
#> [23] "Lrnr_hts" "Lrnr_lightgbm"
#> [25] "Lrnr_lstm_keras" "Lrnr_mean"
#> [27] "Lrnr_multiple_ts" "Lrnr_nnet"
#> [29] "Lrnr_nnls" "Lrnr_optim"
#> [31] "Lrnr_pkg_SuperLearner" "Lrnr_pkg_SuperLearner_method"
#> [33] "Lrnr_pkg_SuperLearner_screener" "Lrnr_polspline"
#> [35] "Lrnr_randomForest" "Lrnr_ranger"
#> [37] "Lrnr_rpart" "Lrnr_rugarch"
#> [39] "Lrnr_screener_correlation" "Lrnr_solnp"
#> [41] "Lrnr_stratified" "Lrnr_svm"
#> [43] "Lrnr_tsDyn" "Lrnr_xgboost"
Documentation for the tuning parameters of each learner can be found in sl3
Learners Reference. Use ?Lrnr...
to see which model defaults are used. For now, we show a few examples of base learner specifications:
# a random forest base learner implemented by `ranger` package
lrn_ranger100 <- Lrnr_ranger$new(num.trees = 100)
# xgboost base learners
xgb_fast <- Lrnr_xgboost$new() # defaults from `xgboost` package (nrounds=20)
xgb_50 <- Lrnr_xgboost$new(nrounds = 50, max_depth = 6, eta = 0.001)
# For learners that require a formula specification, the instantiated learner
# uses main terms derived from the input data, additional interaction
# terms need to be specified separately when calling sherlock:
lrn_glm <- Lrnr_glm$new()
lrn_lasso <- Lrnr_glmnet$new() # default is alpha=1, which is the LASSO
lrn_ridge_interaction <- Lrnr_glmnet$new(alpha = 0)
lrn_enet.5_interaction <- Lrnr_glmnet$new(alpha = 0.5)
### example: additional interaction terms of treatment with segment dimensions
or_interactions <- list(c("treatment", "num_devices"),
c("treatment", "is_p2plus"),
c("treatment", "num_devices", "is_p2plus"))
Next, we specify which learners we want to use to estimate the outcome model, propensity score, and the final CATE parameter. At this point, we do not need to specify the dependent and independent variables, as that will be handled inside sherlock
methods. For example, we can choose:
##### for each nuisance parameter, we illustrate one way of using sl3
# to estimate the propensity score p(A|W):
# for example, we use only one learner. Here sl3 will just directly call
# xgboost to fit according to tuning params
ps_learner_spec <- xgb_50
# if data is from a AB test, where treatment is allocated at a fixed rate p,
# say, p = 0.5, then we can also specify:
# ps_learner <- 0.5
# to estimate the outcome regression E(Y|A,W):
# for example, we use a linear ensemble with 3 base learners:
# {ranger, xgboost, lasso with the specified interactions}.
or_learner_spec <- list(lrn_ranger100, xgb_fast,
list(or_interactions, lrn_lasso))
# if we are not using models that specify the interaction terms, then we can
# instantiate the ensemble models directly. For example:
# or_learner_spec <- Lrnr_sl$new(learners = list(lrn_ranger100,xgb_fast))
# to estimate CATE(V) = E( E(Y|1,W)-E(Y|0,W) | V):
# for example, we use cross-validation to select best among ranger and xgboost
cate_learner_spec <- Lrnr_sl$new(
learners = list(lrn_ranger100, xgb_fast),
metalearner = Lrnr_cv_selector$new()
)
Now, we can call the function sherlock_calculate
with these ml methods to calculate the CATE on the segments. The argument baseline
specifies all the covariates \(W\) we want to use to adjust for confounding and/or to increase estimation efficiency. The segment_by
argument lists the segment dimensions \(V\) along which we want to segment users. It must be a subset of baseline
. The arguments ps_learner
, or_learner
and cate_learner
take the specified machine learning methods or parametric models (examples shown above), for the propensity score, outcome regression and CATE estimates, respectively. If instead of using an ensemble learner, we wish to only select the best among the list of candidate learners for each nuisance parameter, we can specify the argument use_cv_selector=TRUE
(default is FALSE
) in sherlok_calculate
. The dependent and independent variables used in each learner will be based on the variable groups we specified and the nuisance parameter being estimated.
sherlock_results <- sherlock_calculate(
data_from_case = data_example,
baseline = c("num_devices", "is_p2plus", "is_newmarket", "baseline_ltv",
"baseline_viewing"),
exposure = "treatment",
outcome = "outcome_viewing",
segment_by = c("num_devices", "is_p2plus"),
cv_folds = 5L,
ps_learner = ps_learner_spec,
or_learner = or_learner_spec,
cate_learner = cate_learner_spec
)
print(sherlock_results)
#> num_devices is_p2plus segment_proportion cate lwr_ci upr_ci std_err
#> 1: 1 0 0.0392 0.2432 0.06118 0.4252 0.09286
#> 2: 1 1 0.0862 1.8944 1.78646 2.0024 0.05509
#> 3: 2 0 0.0718 -0.6053 -0.72942 -0.4812 0.06333
#> 4: 2 1 0.1752 2.5508 2.47506 2.6265 0.03863
#> 5: 3 0 0.0764 -1.4620 -1.58053 -1.3434 0.06050
#> 6: 3 1 0.1796 3.3903 3.30814 3.4724 0.04190
#> 7: 4 0 0.0728 4.3153 4.19349 4.4371 0.06214
#> 8: 4 1 0.1814 -2.2125 -2.28816 -2.1369 0.03859
#> 9: 5 0 0.0328 5.0961 4.87242 5.3197 0.11411
#> 10: 5 1 0.0846 -2.8779 -3.00315 -2.7528 0.06388
We can visualize the CATE estimation results from sherlock_calculate
as follows:
plot(sherlock_results, plot_type = "cate")
sherlock_calculate
returns an object of class sherlock
, which is also a data.table
object representing the input data set, with the estimated treatment probabilities, outcome estimates and CATE estimates attached to it. The default print out (or a call attr(sherlock_results,'summary')
) will provide for each segment, the segment proportion in the population, its estimated CATE and associated confidence intervals and standard error. This readily provides a view of the HTE across the segments.
In step 1, we obtained a summary of the HTE by estimating the CATE across the segments. Now, we will consider a few examples of finding the optimal segments to treat based on a segment discovery criterion. We do not need to rerun sherlock_calculate
when we want to try a new criterion, unless the data is changed (see example 3).
Suppose we want to find all the segments that would benefit from the treatment, where “benefit” is determined by having \(CATE(v) >\) threshold
, for some user-specified threshold. We can do so by calling the watson_segment
function, with the segmentation function cost_threshold()
and the desired benefit threshold. In this example, threshold=0
means that we consider a segment with \(CATE(v)>0\) to be benefiting from treatment, and hence “should be treated”. The argument type=inferential
specifies that the determination of \(CATE(v) >\) threshold
is based on an upper-tail hypothesis test with the standard errors of the CATE estimates and multiple comparison adjustments. By contrast, type=analytical
means that the determination is only based on the point estimate \(CATE(v)\), without hypothesis testing. When the sizes of some segments are very small, we suggest either using type=analytical
as the hypothesis testing will be inaccurate or refactoring the segment dimensions to allow for bigger segments.
sherlock_results <- watson_segment(
sherlock_results,
segment_fun = cost_threshold,
threshold = 0,
type = "inferential"
)
print(sherlock_results)
#> num_devices is_p2plus segment_proportion cate lwr_ci upr_ci std_err
#> 1: 1 0 0.0392 0.2432 0.06118 0.4252 0.09286
#> 2: 1 1 0.0862 1.8944 1.78646 2.0024 0.05509
#> 3: 2 0 0.0718 -0.6053 -0.72942 -0.4812 0.06333
#> 4: 2 1 0.1752 2.5508 2.47506 2.6265 0.03863
#> 5: 3 0 0.0764 -1.4620 -1.58053 -1.3434 0.06050
#> 6: 3 1 0.1796 3.3903 3.30814 3.4724 0.04190
#> 7: 4 0 0.0728 4.3153 4.19349 4.4371 0.06214
#> 8: 4 1 0.1814 -2.2125 -2.28816 -2.1369 0.03859
#> 9: 5 0 0.0328 5.0961 4.87242 5.3197 0.11411
#> 10: 5 1 0.0846 -2.8779 -3.00315 -2.7528 0.06388
#> threshold test_stat pval pval_adj rule
#> 1: 0 2.619 4.412e-03 7.353e-03 1
#> 2: 0 34.386 2.039e-259 4.077e-259 1
#> 3: 0 -9.558 1.000e+00 1.000e+00 0
#> 4: 0 66.025 0.000e+00 0.000e+00 1
#> 5: 0 -24.165 1.000e+00 1.000e+00 0
#> 6: 0 80.916 0.000e+00 0.000e+00 1
#> 7: 0 69.450 0.000e+00 0.000e+00 1
#> 8: 0 -57.335 1.000e+00 1.000e+00 0
#> 9: 0 44.659 0.000e+00 0.000e+00 1
#> 10: 0 -45.054 1.000e+00 1.000e+00 0
We can visualize the treatment decisions from watson_segment
as follows:
plot(sherlock_results, plot_type = "treatment_decisions")
The watson_segment
function attaches to each individual the recommended treatment rule of their segment: rule=1
means “should receive treatment”, rule=0
means “should not receive treatment”. The default print out (or a call attr(sherlock_results, "summary")
) provides a summary of the segment discovery. For example: segment {num_devices=1, is_p2plus=1}
should receive treatment (rule=1
) because we can conclude that \(CATE(v) > 0\) based on the p-values (adjusted for multiple comparisons). Similarly, {num_devices=2, is_p2plus=0}
should not receive treatment.
Step 2 used the CATE estimates to determine which segments should be treated. But how effective is such segmentation in maximizing outcomes, compared to global one-size-fits-all strategies? Or what are the treatment effects on all segments that should receive treatment vs those that should not? We can assess these using the mycroft_assess
function, which provides double-robust estimation of the Optimal Treatment Effects (param_type = "ote"
) or Heterogeneous Average Treatment Effects ((param_type = "hte"
).
# Optimal Treatment Effects (OTE) from rule-based treatment"
ote_example <- mycroft_assess(
sherlock_results,
param_type = "ote"
)
# friendly print out
print(ote_example)
#> Counterfactual Mean Outcome If All Get Treatment
#> Parameter Estimate: 2.181
#> 95% Conf Interval: (2.101, 2.261)
#> Standard Error: 0.04085
#> Counterfactual Mean Outcome If All Get Control
#> Parameter Estimate: 1.273
#> 95% Conf Interval: (1.243, 1.304)
#> Standard Error: 0.01544
#> Counterfactual Mean Outcome If Treat Based on Segment Rule
#> Parameter Estimate: 2.999
#> 95% Conf Interval: (2.943, 3.055)
#> Standard Error: 0.02868
#> Optimal Treatment Effect: Treat based on Segment Rule vs. All Get Treatment
#> Parameter Estimate: 0.8176
#> 95% Conf Interval: (0.7805, 0.8546)
#> Standard Error: 0.01889
#> Optimal Treatment Effect: Treat based on Segment Rule vs All Get Control
#> Parameter Estimate: 1.725
#> 95% Conf Interval: (1.672, 1.779)
#> Standard Error: 0.02727
# tabular print out
ote_example %>%
as.data.table()
#> param estimate lwr_ci upr_ci std_err
#> 1: tsm_trt_all 2.1812438 2.1011875 2.2613002 0.04084581
#> 2: tsm_ctl_all 1.2734885 1.2432277 1.3037493 0.01543945
#> 3: tsm_trt_rule 2.9988009 2.9425926 3.0550092 0.02867822
#> 4: ate_trt_rule_vs_trt_all 0.8175571 0.7805243 0.8545898 0.01889459
#> 5: ate_trt_rule_vs_ctl_all 1.7253124 1.6718692 1.7787555 0.02726742
The first 3 “Counterfactual Mean Outcome” parameters are the counterfactual mean outcome under the global ‘all get treatment’ strategy, ‘all get control’ strategy, and the targeted strategy of ‘treat based on segment rule’, where the segment rule was learned in step 2 using watson_segment
. The last two “optimal treatment effect” parameters are the differences in outcome of the rule-based strategy vs the global ‘treat all’ and ‘control all’ strategies, respectively. These allow us to evaluate the gain in implementing the segment-rule based strategy, compared to a more convenient global strategy.
## "Heterogeneous Treatment Effect (HTE) Based on Segmentation"
hte_example <- mycroft_assess(
sherlock_results,
param_type = "hte"
)
print(hte_example)
#> Average Treatment Effect (All Get Treatment vs All Get Control) on Population:
#> Parameter Estimate: 0.9078
#> 95% Conf Interval: (0.8278, 0.9877)
#> Standard Error: 0.0408
#> Average Treatment Effect on Segments with Recommended Treatment (Rule = 1)
#> Parameter Estimate: 2.945
#> 95% Conf Interval: (2.885, 3.005)
#> Standard Error: 0.03066
#> Average Treatment Effect on Segments with Recommended Controle (Rule = 0)
#> Parameter Estimate: -1.974
#> 95% Conf Interval: (-2.035, -1.912)
#> Standard Error: 0.03129
#> Difference in Average Treatment Effect: Segments with Recommended Treatment vs Segments with Recommended Control
#> Parameter Estimate: 4.919
#> 95% Conf Interval: (4.833, 5.005)
#> Standard Error: 0.0438
The first 3 “Average Treatment Effect” parameters evaluate the average treatment effect (treat-all vs control-all), on the whole population, on the segments where treatment is recommended (rule=1
), and on the segments where treatment is not recommended (rule=0
). “Difference in ATE” parameter is the difference in Average Treatment Effect between the two segment groups (ATE on rule==1
group vs ATE on rule=0
group).
Consider the use case where treating each unit incurs a fixed cost, and as a result we are only able to treat at most \((100\times q)\%\) of the population, for a given rate \(q\in (0,1)\). Our goal is to find segments that should receive treatment, but subject to this population-level constraint. Implicit in this use case is that the treatment will be beneficial for at least \((100\times q)\%\) of the population. Otherwise the constraint doesn’t apply and we are back in the use case of example 1.
This use case can be formulated in terms of the knapsack problem, where the goal is find the set of segments \(T\subseteq \{v: CATE(v) > 0\}\) that maximizes \(\sum_{v\in T} CATE(v)\), subject to the constraint \(\sum_{v\in T} p(v) \leq q\). Our package uses the knapsack
implementation in adagio
.
We already obtained CATE for each segment in step 1. So now, in step 2, we use the segmentation function cost_knapsack()
instead. This function uses the argument budget
to specify the treatment budget constraint \(q\). Suppose we have a budget to treat only 20% of the population: budget=0.2
. The argument type=inferential
specifies that the determination of \(CATE(v) > 0\) should be based on hypothesis testing with multiple comparison adjustments. If type=analytic
, then only the determination is only based on the point estimates. Recall that we had run watson_segment
once already using segment_fun=cost_threshold
in example 1. Now, this new segmentation will overwrite the previous one.
sherlock_results <- watson_segment(
sherlock_results,
segment_fun = cost_knapsack,
budget = 0.2,
type = "inferential"
)
print(sherlock_results)
#> num_devices is_p2plus segment_proportion cate lwr_ci upr_ci std_err
#> 1: 1 0 0.0392 0.2432 0.06118 0.4252 0.09286
#> 2: 1 1 0.0862 1.8944 1.78646 2.0024 0.05509
#> 3: 2 0 0.0718 -0.6053 -0.72942 -0.4812 0.06333
#> 4: 2 1 0.1752 2.5508 2.47506 2.6265 0.03863
#> 5: 3 0 0.0764 -1.4620 -1.58053 -1.3434 0.06050
#> 6: 3 1 0.1796 3.3903 3.30814 3.4724 0.04190
#> 7: 4 0 0.0728 4.3153 4.19349 4.4371 0.06214
#> 8: 4 1 0.1814 -2.2125 -2.28816 -2.1369 0.03859
#> 9: 5 0 0.0328 5.0961 4.87242 5.3197 0.11411
#> 10: 5 1 0.0846 -2.8779 -3.00315 -2.7528 0.06388
#> threshold test_stat pval pval_adj rule
#> 1: 0 2.619 4.412e-03 7.353e-03 0
#> 2: 0 34.386 2.039e-259 4.077e-259 1
#> 3: 0 -9.558 1.000e+00 1.000e+00 0
#> 4: 0 66.025 0.000e+00 0.000e+00 0
#> 5: 0 -24.165 1.000e+00 1.000e+00 0
#> 6: 0 80.916 0.000e+00 0.000e+00 0
#> 7: 0 69.450 0.000e+00 0.000e+00 1
#> 8: 0 -57.335 1.000e+00 1.000e+00 0
#> 9: 0 44.659 0.000e+00 0.000e+00 1
#> 10: 0 -45.054 1.000e+00 1.000e+00 0
As before, we can visualize the treatment decisions from watson_segment
as follows:
plot(sherlock_results, plot_type = "treatment_decisions")
Similar to the previous example, rule=1
means that the segment should receive treatment, rule=0
otherwise. Compared to the optimality criterion in example 1 (\(CATE(v)>0\)), we have fewer “treat” segments now. The column segment_proportion
records \(p(v)\), the segment’s share of the population. We can check that the chosen “treat” segments make up 19.18% of the population, thus satisfying the constraint of \(\sum_{v\in T} p(v) \leq 0.2\):
attr(sherlock_results, "summary") %>%
filter(rule == 1)
#> num_devices is_p2plus segment_proportion cate lwr_ci upr_ci
#> 1: 1 1 0.0862 1.894438 1.786457 2.002419
#> 2: 4 0 0.0728 4.315272 4.193489 4.437055
#> 3: 5 0 0.0328 5.096079 4.872423 5.319735
#> std_err threshold test_stat pval pval_adj rule
#> 1: 0.05509323 0 34.38604 2.03859e-259 4.07718e-259 1
#> 2: 0.06213531 0 69.44959 0.00000e+00 0.00000e+00 1
#> 3: 0.11411213 0 44.65852 0.00000e+00 0.00000e+00 1
attr(sherlock_results, "summary") %>%
filter(rule == 1) %>%
pull(segment_proportion) %>%
sum()
#> [1] 0.1918
Since the constraint has prevented us from choosing all the segments that would benefit from treatment, we should expect that the resulting segment-rule based strategy will perform less optimally than the one we learned in example 1. Indeed:
mycroft_assess(
sherlock_results,
param_type = "ote"
) %>% print
#> Counterfactual Mean Outcome If All Get Treatment
#> Parameter Estimate: 2.181
#> 95% Conf Interval: (2.101, 2.261)
#> Standard Error: 0.04085
#> Counterfactual Mean Outcome If All Get Control
#> Parameter Estimate: 1.273
#> 95% Conf Interval: (1.243, 1.304)
#> Standard Error: 0.01544
#> Counterfactual Mean Outcome If Treat Based on Segment Rule
#> Parameter Estimate: 1.924
#> 95% Conf Interval: (1.873, 1.975)
#> Standard Error: 0.02615
#> Optimal Treatment Effect: Treat based on Segment Rule vs. All Get Treatment
#> Parameter Estimate: -0.2572
#> 95% Conf Interval: (-0.3264, -0.188)
#> Standard Error: 0.0353
#> Optimal Treatment Effect: Treat based on Segment Rule vs All Get Control
#> Parameter Estimate: 0.6505
#> 95% Conf Interval: (0.6073, 0.6937)
#> Standard Error: 0.02204
The “Counterfactual Mean Outcome If Treat Based on Segment Rule” in this example is less than the corresponding one in example 1.
Now, we consider a more general use case where treating each unit incurs a cost that depends on its segment and other unit-level factors. In finding which segments should receive treatment and which should receive control, we want to ensure that the resulting average cost per unit across the population is capped. Let \(Cost(v)\) be the average cost of treating a unit in segment \(V=v\). Then, \(\sum_{v} Cost(v)p(v)\) is the average cost per population unit, if we treat everyone (all segments). Similarly, \(\sum_{v \in T} Cost(v)p(v)\) is the average cost per population unit, if we only treat some subset of segments \(T\).
This use case can again be formulated in terms of the knapsack problem, where the goal is find the set of segments \(T\subseteq \{v: CATE(v) > 0\}\) that maximizes \(\sum_{v\in T} CATE(v)\), subject to the constraint \(\sum_{v\in T} Cost(v) p(v) \leq \text{budget}\). The previous example is a special case with a fixed cost \(Cost(v)=c\) for all segments.
The data structure and calls to sherlock_calculate
are slightly different in this example. The data should have a column for the treatment cost for a unit.
data(data_example_with_cost)
head(data_example_with_cost) %>%
as.data.frame() %>%
print()
#> num_devices is_p2plus is_newmarket baseline_ltv baseline_viewing treatment
#> 1 2 1 0 0.3500025 0.01014789 0
#> 2 4 0 0 0.8144417 1.56213812 1
#> 3 3 0 1 0.0000000 0.41284605 0
#> 4 5 1 0 0.0000000 0.00000000 0
#> 5 5 1 0 0.0000000 0.71454993 1
#> 6 1 1 0 0.0000000 0.58507617 0
#> outcome_viewing cost
#> 1 0.3500025 1.505074
#> 2 5.0144417 2.781069
#> 3 1.0000000 2.786524
#> 4 0.0000000 3.518043
#> 5 -3.0000000 3.564960
#> 6 0.0000000 1.292538
When calling the sherlock_calculate
function, we need to add the argument treatment_cost
to specify the column for the treatment cost in the data.
sherlock_results <- sherlock_calculate(
data_from_case = data_example_with_cost,
baseline = c("num_devices", "is_p2plus", "is_newmarket", "baselin_ltv",
"baseline_viewing"),
exposure = "treatment",
outcome = "outcome_viewing",
segment_by = c("num_devices", "is_p2plus"),
treatment_cost = "cost",
cv_folds = 5L,
ps_learner = ps_learner_spec,
or_learner = or_learner_spec,
cate_learner = cate_learner_spec
)
print(sherlock_results)
#> num_devices is_p2plus segment_proportion cate lwr_ci upr_ci std_err
#> 1: 1 0 0.0392 0.4187 0.1870 0.6504 0.11822
#> 2: 1 1 0.0862 1.9987 1.8533 2.1440 0.07416
#> 3: 2 0 0.0718 -0.4629 -0.6305 -0.2954 0.08548
#> 4: 2 1 0.1752 2.8191 2.7154 2.9228 0.05290
#> 5: 3 0 0.0764 -1.2363 -1.4031 -1.0695 0.08510
#> 6: 3 1 0.1796 3.5541 3.4435 3.6648 0.05645
#> 7: 4 0 0.0728 4.4860 4.3083 4.6636 0.09064
#> 8: 4 1 0.1814 -2.0599 -2.1661 -1.9537 0.05418
#> 9: 5 0 0.0328 5.4936 5.2327 5.7545 0.13312
#> 10: 5 1 0.0846 -2.7343 -2.8902 -2.5784 0.07953
#> avg_treatment_cost
#> 1: 1.384
#> 2: 1.846
#> 3: 1.840
#> 4: 2.347
#> 5: 2.340
#> 6: 2.835
#> 7: 2.821
#> 8: 3.329
#> 9: 3.296
#> 10: 3.845
In addition to the segment proportion in the population, its estimated CATE and associated confidence intervals and standard error, the default print out (or a call attr(sherlock_results, "summary")
) now also provides avg_treatment_cost
, which is the average treatment cost \(Cost(v)\) of treating a unit in each segment.
Note that the cost constraint of “average cost per population unit” is expressed in terms of avg_treatment_cost*segment_proportion
\(=Cost(v)p(v)\):
## if treating all units, then the average cost per population unit is 2.687:
attr(sherlock_results, "summary") %>%
summarize(sum(avg_treatment_cost * segment_proportion))
#> sum(avg_treatment_cost * segment_proportion)
#> 1 2.687224
# if treating only those with cate > 0, then the average cost is 1.44:
attr(sherlock_results, "summary") %>%
filter(cate > 0) %>%
summarize(sum(avg_treatment_cost * segment_proportion))
#> sum(avg_treatment_cost * segment_proportion)
#> 1 1.447183
Whenever we choose not to treat certain segments, those units incur 0 cost. Thus, the average cost per population unit is lowered.
To find the segments to treat, we will again use the segmentation function cost_knapsack()
. The argument use_segment_treatment_cost=TRUE
specifies the avg_treatment_cost
per segment (\(Cost(v)\)) must be used in conjunction with segment_proportion
(\(p(v)\)), to evaluate the cost constraints. The argument budget
specifies the cap on the average treatment cost per population unit: \(\sum_{v\in T} Cost(v) p(v) \leq \text{budget}\). Similar to before, type=inferential
means we use hypothesis testing to determine \(CATE(v) > 0\).
sherlock_results <- watson_segment(
sherlock_results,
segment_fun = cost_knapsack,
budget = 1,
use_segment_treatment_cost = TRUE,
type = "inferential"
)
print(sherlock_results)
#> num_devices is_p2plus segment_proportion cate lwr_ci upr_ci std_err
#> 1: 1 0 0.0392 0.4187 0.1870 0.6504 0.11822
#> 2: 1 1 0.0862 1.9987 1.8533 2.1440 0.07416
#> 3: 2 0 0.0718 -0.4629 -0.6305 -0.2954 0.08548
#> 4: 2 1 0.1752 2.8191 2.7154 2.9228 0.05290
#> 5: 3 0 0.0764 -1.2363 -1.4031 -1.0695 0.08510
#> 6: 3 1 0.1796 3.5541 3.4435 3.6648 0.05645
#> 7: 4 0 0.0728 4.4860 4.3083 4.6636 0.09064
#> 8: 4 1 0.1814 -2.0599 -2.1661 -1.9537 0.05418
#> 9: 5 0 0.0328 5.4936 5.2327 5.7545 0.13312
#> 10: 5 1 0.0846 -2.7343 -2.8902 -2.5784 0.07953
#> threshold avg_treatment_cost test_stat pval pval_adj rule
#> 1: 0 1.384 3.542 1.984e-04 3.307e-04 0
#> 2: 0 1.846 26.951 2.785e-160 5.569e-160 1
#> 3: 0 1.840 -5.415 1.000e+00 1.000e+00 0
#> 4: 0 2.347 53.292 0.000e+00 0.000e+00 0
#> 5: 0 2.340 -14.528 1.000e+00 1.000e+00 0
#> 6: 0 2.835 62.960 0.000e+00 0.000e+00 1
#> 7: 0 2.821 49.494 0.000e+00 0.000e+00 1
#> 8: 0 3.329 -38.019 1.000e+00 1.000e+00 0
#> 9: 0 3.296 41.270 0.000e+00 0.000e+00 1
#> 10: 0 3.845 -34.379 1.000e+00 1.000e+00 0
## constraint satisfied? resulting average cost per population unit:
attr(sherlock_results,'summary') %>% filter(rule==1) %>%
summarize(sum(avg_treatment_cost*segment_proportion))
#> sum(avg_treatment_cost * segment_proportion)
#> 1 0.9818078
The resulting cost constraint would be satisfied if we followed this segment-rule based strategy. We also saw earlier that if we treated all units the resulting cost per unit would be ~ 2.69.
In addition to lower cost, the segment-rule based strategy also yields better metric outcome, compared to treating all units:
mycroft_assess(
sherlock_results,
param_type = "ote"
) %>% print
#> Counterfactual Mean Outcome If All Get Treatment
#> Parameter Estimate: 2.268
#> 95% Conf Interval: (2.185, 2.35)
#> Standard Error: 0.04209
#> Counterfactual Mean Outcome If All Get Control
#> Parameter Estimate: 1.168
#> 95% Conf Interval: (1.137, 1.2)
#> Standard Error: 0.01598
#> Counterfactual Mean Outcome If Treat Based on Segment Rule
#> Parameter Estimate: 2.493
#> 95% Conf Interval: (2.431, 2.554)
#> Standard Error: 0.03127
#> Optimal Treatment Effect: Treat based on Segment Rule vs. All Get Treatment
#> Parameter Estimate: 0.2248
#> 95% Conf Interval: (0.1657, 0.2839)
#> Standard Error: 0.03014
#> Optimal Treatment Effect: Treat based on Segment Rule vs All Get Control
#> Parameter Estimate: 1.324
#> 95% Conf Interval: (1.266, 1.382)
#> Standard Error: 0.02968
sherlock
Let * \(A\) = treatment, 0 for control, 1 treatment. Consider binary for now. * \(Y\) = metric of interest. Continuous or binary. * \(W\) = baseline covariates. These include covariates used to adjust for confounder bias (if non-AB) and/or to reduce variance. May be continuous or discrete. * \(V\subseteq W\) = a set of segment dimensions/covariates, and \(\mathcal{V}\) its outcome space. A realization \(V=v\) represents a segment of users. * For a segment \(V=v\), \(CATE(v) = E[E(Y|A=1,W)-E(Y|A=0,W)|V=v)]\) is the Conditional Average Treatment Effect (CATE) on this segment.
We want to know which segments of users would benefit from treatment. This benefit can be:
Both of these questions can be framed in terms of classifying the segments based on some criterion on their CATE. Thus we largely divided the package into 3 modular tasks:
sherlock_calculate
: Estimating CATE.watson_segment
: Learning the segmentation (i.e., which segments to treat).mycroft_assess
: Evaluating the effectiveness of the resulting segmentation.The CATE and the effectiveness measures are implemented using sample splitting principles in CVTMLE (Zheng and van der Laan (2010), van der Laan and Luedtke (2015)).
In sherlock_calculate
: To estimate CATE, we first apply machine learning methods chosen by the user and sample splitting principles in CV-TMLE to estimate the outcome regression \(E(Y|A,W)\) and the propensity score \(p(A|W)\) (if non-AB data). Then, we use these nuisance parameter estimates to perform a double robust outcome transformation \(D=\frac{2A-1}{p(A|W)}[Y-E(Y|A,W)] + E(Y|1,W)-E(Y|0,W)\). We then regress this \(D\) onto the segment dimensions \(V\) using machine learning. This regression \(E(D|V)\) allows us to compute out-of-sample double robust estimates of \(CATE(V)\), and the associated standard errors.
In watson_segment
: The two classes of segment discovery problems require a determination of whether a segment \(v\) satisfies \(CATE(v) > \delta\) for some constant \(\delta\). In problem a, \(\delta\) is the threshold given by the user. In problem b, \(\delta=0\). By default, we implement this segment classification as an upper tail hypothesis testing problem (H1: \(CATE(v) > \delta\); H0: \(CATE(v) \leq \delta\)), controlling for false discovery rate in the multiple comparisons across the segments. For the cost constrained problems, the knapsack
algorithm is applied on the segments that were determined to have \(CATE(v)>0\).
In mycroft_assess
: Once we learn this segmentation of the population into \(T\) and its complement, we can evaluate how effectively it has achieved the segmentation goal. For example, we can estimate the population-level outcome under the optimal segment-rule treatment strategy. Specifically, this parameter is \(\psi_d = E[ E(Y|A=d(V), W)]\), where \(d(v) = I(v \in T)\) assigns treatment to segments in \(T\), and assigns control otherwise. \(E(Y|A=d(V),W)\) is the expected outcome in an individual if they followed their segment’s optimal treatment rule, and \(E[E(Y|A=d(V), W)]\) is averaged over the population. \(\psi_d\) can be contrasted with \(\psi_1 = E[E(Y|A=1, W)]\), the population outcome if all receive treatment, and \(\psi_0 = E[E(Y|A=0, W)]\), the population outcome if all receive control. \(\psi_d-\psi_1\) and \(\psi_d-\psi_0\) are both valid Optimal Treatment Effects, depending on what we consider as reference strategy. These are implemented in mycroft_assess
with param_type=ote
. Alternatively, we can consider treatment effects on the learn segment groups. \(\mu_{all} = E[ E(Y|A=1, W) - E(Y|A=0, W) ]\) is the average treatment effect (treatment-all vs control-all) on the whole population. Similarly, \(\mu_{1} = E[ E(Y|A=1, W) - E(Y|A=0, W) | V \in T]\) and \(\mu_{0} = E[ E(Y|A=1, W) - E(Y|A=0, W) | V \notin T]\) are the average treatment effects among segments that are recommended treatment (\(v\in T\)) and those that are not recommended (\(v\notin T\)). And we can also contrast the HTE in these two segment groups \(\mu_{1}-\mu_{0}\). These are implemented in mycroft_assess
with param_type=hte
.
We make two standard assumptions in causal inference:
There is sufficient variability of the treatment and control within each subgroup defined by the baseline covariates and by the segment dimensions. This means that we do not have a subgroup of users that rarely or never receive a specific treatment category.
Inferring causal effects outside of AB experiments rely on strong causal assumptions. For non-AB data, we make the standard implicit causal assumption that the data captures all the common causes between a unit’s treatment and the outcome of interest.
Specific to the methodology we implemented, we also require each segment to have a moderate number of observations in the data (> 50). This is to ensure that we can provide a standard error for CATE on each segment. In the absence of this requirement, we can still obtain point estimates of CATE and learn a desired segmentation, but would be unable to provide reliable statistical inference on CATE.
Luedtke, Alex, and Mark van der Laan. 2016a. “Optimal Individualized Treatments in Resource-Limited Settings.” International Journal of Biostatistics 12 (1): 283–303. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6052884/.
———. 2016b. “Super-Learning of an Optimal Dynamic Treatment Rule.” International Journal of Biostatistics 12 (1): 305–32. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6056197/.
van der Laan, Mark, and Alex Luedtke. 2015. “Targeted Learning of the Mean Outcome Under an Optimal Dynamic Treatment Rule.” Journal of Causal Inference 3 (1): 61–95. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4517487/.
Vanderweele, Tyler, Alex Luedtke, Mark van der Laan, and Ronald Kessler. 2019. “Selecting Optimal Subgroups for Treatment Using Many Covariates.” Epidemiology 30 (3): 334–41. https://journals.lww.com/epidem/Fulltext/2019/05000/Selecting_Optimal_Subgroups_for_Treatment_Using.5.aspx.
Zheng, Wenjing, and Mark van der Laan. 2010. “Asymptotic Theory of Cross-Validated Targeted Maximum Likelihood Estimation.” https://biostats.bepress.com/ucbbiostat/paper273/.