Sep 8, 2025

Sensitivity Analyses Useful for Probing the Validity of the MAR Assumption in COA Data

Abstract brain image
Abstract brain image
Abstract brain image

The majority of models used for efficacy outcomes either are valid under the assumption of MAR (e.g., conditional linear mixed models, marginal linear mixed models [mixed models for repeated measures]) or can be modified to be valid under the assumption of MAR (e.g., ANOVA/ANCOVA by estimating them via likelihood methods in mixed model software).1

Broadly, contemporary approaches to probing the validity of the MAR assumption can be decomposed into three general forms: controlled imputation, tipping point analysis & models that explicitly assume MNAR.

It is useful to understand these approaches as they increasingly inform regulatory feedback and guidance with respect to missing data sensitivity analysis recommendations for both validation and endpoint efficacy.

We are increasingly seeing applicants being asked in regulatory feedback to plan, describe, or conduct such sensitivity analyses for missing data to facilitate regulatory decision making.

  1. Controlled Imputation.

Controlled imputation relies on the machinery of multiple imputation (MI). In clinical applications, single imputations (e.g., LOCF, mean imputation) is often favored over MI, though these are truly only valid under MCAR. However, MI is really not that complex and MI will always have greater stochastic precision than single imputations, meaning it will always have the smallest standard errors between the two; and in the case of inference, we always want the smallest standard errors possible because all else being equal, statistical significance is inversely related to the magnitude of standard errors.

Why is multiple imputation more precise than single imputation and what did you mean exactly by “stochastic precision”?

Every time we replace a missing value with an imputed value, we are making a guess, a very well informed statistical "guess", about what the missing value would have been, and there is only a probability that we are correct and if that probability were small we would have a poor imputation and vice versa. But the thing is, the value isn’t observed, so we will never really know what our probability of being correct is. But if we do that imputation only once, then there’s nothing we can do to smooth out that probability in our favor – if we were wrong, we were wrong and that would suck, the end.

But if instead we made several (multiple) random imputations and averaged those values accounting for the variation within imputation and variation between imputation, then we could minimize the likelihood of a single poor imputation skewing results. This is much in the same way that a mean of several values will always be more precise than any single value.

However, because multiple imputation requires more machinery and process (it really isn’t that bad), clinical trialists often favor single imputation approaches.  

In this section we will review one commonly employed controlled imputation scenario used in the service of testing how robust MAR-assuming findings are to violations of the MAR assumption. But, to do that, we have to first introduce some additional useful jargon – and this jargon is worth understanding because it will become increasingly commonly encountered. Heads up, it involves estimands.

Estimands: it is just the thing to be estimated for the purposes of characterizing an effect. It’s a recipe consisting of operationalizing all the elements required to test your outcome. This includes:

  1. The population.

  2. The measure (e.g. PROMIS Fatigue SF 7a), scoring (e.g., T-scores), and time (e.g., change from baseline to EOT) defining the endpoint.

  3. Intercurrent events: any modifications resulting from departures from protocol that can be anticipated a priori from first principles and knowledge of the disease and drug tolerability information.

  4. The Statistical summary upon which endpoint efficacy will be based: marginal mean difference between arms for change from baseline to EOT estimated from a mixed model for repeated measures with Kenward & Roger denominator degrees of freedom.

Stitching it together: mean change from baseline to end of treatment (EOT) on T-scored PROMIS Fatigue 7a within the intent to treat (ITT) population estimated from an MMRM with early discontinuation handled under a treatment policy approach.

The thing about the treatment policy approach, however, is that it assumes that the information available up until discontinuation is all that is required for valid inferences on efficacy. But that means that the observed values are all that are required for valid inference. And that sounds a lot like the MAR assumption. Because it is.

Another way of framing this is whether valid inference depends on what we know or whether it may depend on what we don’t know in the arms. Enter the jargon . . .

Carpenter, Roger, & Kenward (2013)[1] discriminated between estimands that correspond to the assumptions of MAR or MNAR using the language of de jure and de facto estimands, respectively.

De jure and De facto Estimands

From latin, de jure and de facto mean, respectively, according to law/rule and according to fact/practice. In a randomized clinical trial, the law/rule is defined by the trial protocol, and fact/practice would include any deviation from the reality defined by the protocol. De jure estimands would correspond to the per protocol population while the de facto estimands would correspond to the intent to treat (ITT) population. But population is only one component of the estimand, and we are talking about estimands and not populations exclusively here.

Thus, a de jure estimand would be any estimand that operates as though the protocol had not been deviated from, such as treatment policy. This corresponds to an assumption of MAR because nothing changes in the estimand, and everything is based upon only the observed information up until the moment of protocol deviation. For example, for subjects who discontinue treatment prior to EOT, those subjects are treated as if the randomized assignment and all observed data is all that can explain their post-discontinuation data.

In contrast, a de facto estimand would be any estimand that explicitly accepts and/or probes the effect of protocol deviations. For example, a de facto estimand could postulate that the reality of post-discontinuation data may depart from that expected based on randomization and observed data and that could include drawing information to explain post-discontinuation data from information that could not be observed within the treatment arm. The de facto estimand would be one that assumes that missing data does not (exclusively) depend on observable information and may depend on unobservable information. And this distinction is what aligns the de facto estimand with the assumption of MNAR. And thus, under de facto estimands, we arrive at the motivation for controlled multiple imputations.

Controlled Multiple Imputation

Traditional imputation imputes under a de jure estimand framework: missing information within the treatment arm(s) is replaced by values drawn from the posterior distribution of the outcome, $y = (y_{obs},y_{mis}; \theta_y)$, defined on the data observed within the treatment arm(s) given the missing data and the parameters (the same is true for the comparator arm(s)). In contrast, controlled imputation replaces missing data within the treatment arms from alternative sources that are not directly restricted to the same arm. For example, under a progressive disease, one might assume that upon treatment discontinuation treatment’s outcome trajectory might be better represented by that of comparator than that of treatment. This form of controlled imputation is referred to in the literature as jump to reference (J2R) imputation. And this imputation would be repeatedly conducted until “m” complete data sets had been stochastically imputed and then and only then would efficacy be recomputed under this de facto post-discontinuation counterfactual. Efficacy could remain significant, or it could not, and either would be useful information. But the interpretability of the finding would only be as strong as the counterfactual’s verisimilitude. If the counterfactual were not a reasonable expectation for the post-discontinuation behavior given the specific disease and any enduring (or lack-thereof) treatment effect or tolerability issues, J2R would not be a valuable sensitivity analysis.

And consequently, this demonstrates how the sensitivity analyses are not perfunctory or easily decided: they require detailed knowledge of the natural history of the disease, the expectations of the disease in the absence of treatment, and what post-discontinuation treatments are most likely and how they would affect the post-discontinuation outcome trajectory.

J2R is one hypothesis that can be tested and the effect of which observed and reconciled with the primary estimand that likely assumes MAR. Several alternative controlled MI scenarios have been defined and may be of use, depending on context, and these are decomposed into imputations of marginal values versus conditional values. Marginal controlled MI scenarios include Copy Differences From Reference, Last Marginal Mean Carried Forward (not the same as LOCF), Marginal Delta Method (more on this later). Conditional controlled MI scenarios include Copy Reference, Last Conditional Mean Carried Forward (not LOCF), and Conditional Delta Method.

These are all worth reviewing and considering under the assumption that they might apply to any specific missing data circumstance you may find yourself in. And I say this because, J2R has become a default sensitivity analysis for missing data without jumping to reference necessarily being the most likely post-drop out behavior of data. And so, much like LOCF was reflexively used to detriment, J2R is used reflexively even when it is not appropriate. And this violates the principles underlying the use of such sensitivity analyses. So, while the details of each approach are beyond the scope of this post, the idea you should take from this post is that there are many existing methods and selecting the method or methods (given that multiple are recommended) that is/are most appropriate for your circumstance can be achieved through careful thought and discussion. Note that this process need not be time consuming and partnering with researchers well versed in these methods can expedite the design of sensitivity analysis strategies. Ideally, these strategies would explore multiple post-drop out scenarios.

Because it is rare that a single sensitivity analysis can reflect the totality of post-drop out data behavior, multiple scenarios can and should be specified a priori and tested. These would draw from the plausible combinations of the alternative controlled imputation scenarios that could apply to the post-drop out data in your context of use. And the accumulated evidence under each such hypothesis could be used in a tipping point to determine how extreme the counterfactual assumption of post-discontinuation behavior must be before efficacy for the estimand ceases to be statistically significant. And if the totality of evidence supports efficacy, then the conclusion would be that the primary efficacy estimates assuming MAR are valid and robust to departures from MAR. Additional alternative de facto imputations are described in Kenward (2015).[2]

One particularly useful controlled imputation scenario is the Delta Methods (either marginal or conditional). The delta methods constitute the core framework employed in tipping point analyses, described next. One of the most elegant and flexible aspects of the delta method is that it can be extended to tipping point analyses that do not rely on imputation and do not, therefore, make explicit assumptions about the nature of the post-drop out data. Such delta implementations simply posit that the observed data departs from what would have been observed and increments the observed data in both arms until efficacy is lost, thereby establishing the tipping point whose relevance and validity can then be evaluated.

Tipping Point Analysis via Delta Method

Under the delta methods a parameter is added to the expected means post-discontinuation incrementing imputed data or observed within-arm means by a value, delta, intended to represent hypothetical departures from the estimated means. The delta values are incrementally increased in both arms until efficacy is no longer statistically significant, operationalized either via p-values or once the 95% confidence interval contains the null value (0 in the case of linear models and 1 in the case of generalized linear models). This concept was nicely illustrated by Liu, Zhou, & Sims (2025),[3] copied and presented here with acknowledgement to the authors. To the multiply imputed data sets values from 3 to 12 incremented by 3 were respectively added to the treatment arm means and subtracted from the control arm means. And only with a delta of 12 did efficacy cease to be statistically significant.

Data chart

It is important to clarify, that as with controlled imputation, tipping point analyses are neither an adjustment for missing data nor an alternative estimate of what efficacy would have been had complete data been observed. It is a framework for testing alternative hypothetical scenarios to determine how sensitive the observed or imputed data are to departures from primary estimates. If the data depart from statistical significance in the presence of small delta values, then perhaps the estimates are not robustly reliable. But one thing to consider is that the only factor manipulated by the delta methods or tipping point analyses specifically is the location parameters (means, proportions, rates, etc.). The dispersion, aka the standard errors, are not incremented at all, and efficacy is a function of the ratio of location to dispersion. Moreover, location parameters (e.g., means) are affected far less than dispersion parameters [standard errors (SEs] in the presence of missing data.[1] Thus, while the tipping point method is conceptually appealing, easily implemented, and finding increasing favor with regulatory reviewers, its focus on location parameter deltas exclusively is a point of weakness deserving of more refined extensions. Another limitation is the use of arbitrary delta values, as demonstrated in the Liu, Zhou, & Sims (2025)[3] example.   

One solution to the arbitrary deltas is the tipping point analysis increasingly used by the United States Food and Drug Administration’s (FDA) division of biometry described by Torres et al. (2025).[4] Under this framework the observed arm-specific estimand (for example, the average change from baseline to follow-up visit pre-specified as the end point $(\mu_{\Delta_{T^c}}$ & $\mu_{\Delta_{C^c}})$ for treatment and comparator, respectively) among completers is incremented as a function of $\delta$ parameters pre-specified because the values reflect the plausible ranges of observable differences between the completers and drop-outs ($d$): $\delta_T = \mu_{\Delta_{T^c}} - \mu_{\Delta_{T^d}}$. A benefit of this is that a t-test can be constructed for the difference in these composite means between arms $ (\mu_{\Delta_{T^c}} + \delta_T) - (\mu_{\Delta_{P^c}} + \delta_P)$, with asymptotic inference constructed under a wald test. This delta based on the mean of dropouts can then be incremented as in standard tipping point analyses, but it is at least starting from a non-arbitrary point.

Much of the controlled imputation and tipping point analysis framework is motivated from the perspective of pattern mixture models. And pattern mixture models are but one of three modeling frameworks designed to estimate efficacy under the assumption of MNAR rather than MAR. These are described next.

Models Assuming MNAR

Speaking of MNAR, are we all ready for more fun? I’ll take your silence as a “yes”. There are three primary models that are all based on distinct factorizations of the joint likelihood between the outcome model and the missingness mechanism. These are the Heckman Selection model, Pattern Mixture models, and shared parameter models, a special case of which is the joint model. 

The joint distribution for the outcome model and missingness mechanism can be expressed as

$$ p(y_{obs},y_{mis},r;\theta_y,\theta_r) $$

Selection models are based on the idea that subjects select to either continue or drop out of the study.[5] The selection model factors the joint distribution of the outcome model and missingness mechanism into two parts: the first being the marginal distribution of the outcome model and the second being the distribution for the missingness mechanism conditioned on the longitudinal outcome.

$$ p(y_{obs},y_{mis},r;\theta_y,\theta_r) = p(y_{obs},y_{mis};\theta_y) p(r|y_{obs},y_{mis};\theta_r)$$

Pattern mixture models are based on the opposite factorization: the conditional distribution of the longitudinal outcomes given the missingness mechanism and the marginal distribution of the missingness mechanism.[6]

$$ p(y_{obs},y_{mis},r;\theta_y,\theta_r) = p(y_{obs},y_{mis}|r;\theta_y) p(r;\theta_r)$$

Unfortunately, either Selection or Pattern Mixture models are computationally intensive and in the case of the pattern mixture model yield a series of counterfactuals to evaluate but no clear conclusion: if parameter estimates for missingness pattern r =[1,1,1,1,1,0,0] diverge from parameter estimates for missingness pattern r =[1,1,1,0,0,0,0] then what do we conclude?

Ideally, we would have a single outcome model summarizing the data adjusted for the missingness mechanism. Enter the joint model, which does just that. The joint model employs shared random effects to relate the measurement model to the missingness mechanism.[7] This decomposition is composed of just three elements: The conditional distribution of the outcome model given the random effects and outcome model parameters; the conditional distribution of the missingness given the random effects and missingness parameters; and the distribution of the random effects given the random effect parameters. And all this is marginalized over the random effects, reducing down to an outcome model and a missingness model that share estimated random effects which express the relationship between the missingness model and the outcome model. And in so doing, this conditioning and explicit linking of the two models permits accurate estimation and inference in the outcome model even when the data are missing at random, missing not at random, or both.

$$\begin{aligned} p(y_{obs},y_{mis},r;\theta_y,\theta_r) &= p(y_{obs},y_{mis}|r;\theta_y) p(r;\theta_r) \\ &= \int p( y_{obs}, y_{mis}| b; \theta_y) p(r | b; \theta_r) p(b|\theta_b) \mathrm{d}b \end{aligned}$$

Given the estimated shared random effect parameters, the outcome model and missingness mechanism become conditionally independent.

An important distinction between these three modelling frameworks is the flexibility with which they can accommodate intermittent missing data (non-monotone missingness) versus being restricted to discontinuation or drop-out (monotone missingness). For example, the missingness patterns given above for the pattern mixture model were all monotone: r =[1,1,1,1,1,0,0]; and this stands in contrast to a non-monotone patter: r =[1,0,0,1,1,0,0]. As it turns out, only the joint model can easily accommodate both monotone and non-monotone missing data under the MNAR assumption, while both the selection and pattern-mixture models, to date, assume that monotone missing data are MNAR while intermittent missing data (non-monotone missing data) are assumed to be MAR. And, as you will note, when reading the selection and pattern-mixture models literature both repeatedly and exclusively discuss drop-out or discontinuation. Conspicuously absent from these manuscripts is the explicit acknowledgment that the models’ ability to adjust for data MNAR is restricted to the narrow circumstance of drop-out or discontinuation. Intermittent missing data is assumed to be MAR under both these models. But, as we know in PRO data, intermittent missing data is not uncommonly encountered and could be MNAR. Which begs the question: how applicable are selection or pattern mixture models to PRO data and does the restriction to monotone missing data limit their utility to PRO data? I would argue that the answer is yes.

And in the next article I describe how the joint model can be augmented to test efficacy while easily accommodating both drop-out and intermittent missing data under the assumption of MNAR within a single modelling framework.

References

[1] Carpenter, JR, Roger, JH, & Kenward, MG.  2013. Analysis of longitudinal trials with protocol deviations: a framework for relevant accessible assumptions and inference via multiple imputation. Journal of Biopharmaceutical Statistics, 23; 1352:1371.
[2] Kenward, MG. 2015. Controlled multiple imputation methods for sensitivity analyses in longitudinal clinical trials with dropout and protocol deviation. Clinical Investigation; 5(3): 311-320
[3] Liu Y, Zhou K, Sims KD. 2025. Tipping point analysis: assessing the potential impact of missing data. JAMA; 334(3):265–266. 
[4] Torres, C. 2025. A tipping point method to evaluate sensitivity to potential violations in missing data assumptions. Pharmaceutical Statistics; 24(3).
[5] Heckman, J. 1976. The common structure of statistical models of truncation, sample selection, and limited dependent variables and a simple estimator for such models. Annals of Economic and Social Measurement 5, 475-492 
[6] P. Diggle and M. G. Kenward. 1994. Informative Drop-Out in Longitudinal Data Analysis, Applied Statistics; 43(1):  49-73
[7] Rizopoulos D. 2012. Joint models for longitudinal and time-to-event data, with applications in R. Boca Raton: Chapman and Hall/CRC.

Have a chat with us

Got any questions? Contact us and we'll be happy to help.