Causal Machine Learning for Population Segment Discovery and Analysis

Authors: Nima Hejazi and Wenjing Zheng

Causal Segmentation Analysis with sherlock

The sherlock R package implements an approach for population segmentation analysis (or subgroup discovery) using recently developed techniques from causal machine learning. Using data from randomized A/B experiments or observational studies (quasi-experiments), sherlock takes as input a set of user-selected candidate segment dimensions – often, a subset of measured pre-treatment covariates – to discover particular segments of the study population based on the estimated heterogeneity of their response to the treatment under consideration. In order to quantify this treatment response heterogeneity, the conditional average treatment effect (CATE) is estimated using a nonparametric, doubly robust framework (Vanderweele et al. 2019; van der Laan and Luedtke 2015; Luedtke and van der Laan 2016b, 2016a), incorporating state-of-the-art ensemble machine learning (van der Laan, Polley, and Hubbard 2007; Coyle et al. 2021) in the estimation procedure.

For background and details on using sherlock, see the package vignette and the documentation site. An overview of the statistical methodology is available in our conference manuscript (Hejazi, Zheng, and Anand 2021) from CODE @ MIT 2021.


Install the most recent version from the master branch on GitHub via remotes:



If you encounter any bugs or have any specific feature requests, please file an issue.


After using the sherlock R package, please cite the following:

      author={Hejazi, Nima S and Zheng, Wenjing and {Netflix, Inc.}},
      title = {{sherlock}: Causal machine learning for segment discovery
        and analysis},
      year  = {2021},
      note = {R package version 0.2.0},
      doi = {10.5281/zenodo.5652010},
      url = {}

      author = {Hejazi, Nima S and Zheng, Wenjing and Anand, Sathya},
      title = {A framework for causal segmentation analysis with machine
        learning in large-scale digital experiments},
      year = {2021},
      journal = {Conference on Digital Experimentation at {MIT}},
      volume = {(8\textsuperscript{th} annual)},
      publisher = {MIT Press},
      url = {}


The contents of this repository are distributed under the Apache 2.0 license. See file for details.


Coyle, Jeremy R, Nima S Hejazi, Ivana Malenica, Rachael V Phillips, and Oleg Sofrygin. 2021. sl3: Modern Pipelines for Machine Learning and Super Learning.

Hejazi, Nima S, Wenjing Zheng, and Sathya Anand. 2021. “A Framework for Causal Segmentation Analysis with Machine Learning in Large-Scale Digital Experiments.” Conference on Digital Experimentation at MIT (8th annual).

Luedtke, Alex, and Mark van der Laan. 2016a. “Optimal Individualized Treatments in Resource-Limited Settings.” International Journal of Biostatistics 12 (1): 283–303.

———. 2016b. “Super-Learning of an Optimal Dynamic Treatment Rule.” International Journal of Biostatistics 12 (1): 305–32.

van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).

van der Laan, Mark, and Alex Luedtke. 2015. “Targeted Learning of the Mean Outcome Under an Optimal Dynamic Treatment Rule.” Journal of Causal Inference 3 (1): 61–95.

Vanderweele, Tyler, Alex Luedtke, Mark van der Laan, and Ronald Kessler.

  1. “Selecting Optimal Subgroups for Treatment Using Many Covariates.” Epidemiology 30 (3): 334–41.