Sep 8, 2025

Missing Data Mechanisms in COA Data, How They Are Defined and Why They Matter

Abstract block image
Abstract block image
Abstract block image

“We do not recommend using the last observation carried forward (LOCF) method as it is only valid under an MCAR assumption. Instead, we recommend you conduct sensitivity analyses alternating the mechanism assumed to give rise to your missing data (i.e., MAR and MNAR). This can include controlled imputation and/or tipping point analyses”

FDA Feedback received in 2017

“We recommend you conduct sensitivity analyses examining the robustness of your efficacy findings to violations of the assumption that data are missing at random.”

FDA Feedback received in 2019

In this article we are going to attempt to thread a very hard needle to thread: explaining missing data theory and why it matters in a manner rigorous enough for classically trained statisticians while simultaneously being accessible and clarifying for applied researchers. The focus will be on clinical outcome assessment (COA) data, but not restricted to validation, as the missing data mechanism considerations outlined apply to missing data occurring when COAs are deployed as endpoints in randomized clinical trials.

Why should you keep reading?

Once upon a time it was LOCF all day every day. As seen in the anonymized quoted FDA feedback above, everything has changed with respect to missing data, except for one thing: 

  • Regardless of what kind of study

  • No matter what you do

  • No matter how hard you work

  • No matter how brilliantly you plan

  • No matter how many procedures you put in place to prevent it

  • eCOA solutions will not prevent it

your study will have missing data and you will have to do something about it. Period.

That missing data may be intermittent or it may be drop-out or both, but missing data will happen. And while people will say things like “a low rate of missing data will not bias results” or “less than 10% missing data is not concerning” that is incorrect and FDA is conducting missing data sensitivity analyses in the presence of as little as 8% missing data and less.[1] And the atypical and major problem with this issue is that the uncertainty introduced by missing data is not the typical “Statistical Uncertainty” we are accustomed to: you cannot sample size your way out of the effect of missing data.[2],[3] Solutions that validate the robustness of your findings to missing data are the only viable solution to the uncertainties related to missing data.

To help us all gain an understanding of these issues and their corresponding solutions, this article series, consisting of three articles, consolidates a vast literature to explain how missing data theory functions, provides strategies for linking analysis methods to the missing data mechanism most likely to operate on your data, and importantly outlines the current landscape of regulatory-favored procedures for conducting sensitivity analyses probing the validity of missing data solution assumptions. And that latter point is key, because as seen in the anonymized FDA feedback above, requests for such sensitivity analyses are becoming routine by FDA. The last article in this series provides a detailed examination of how the joint survival model can be used to overcome virtually all of the limitations of each of the methods for missing data routinely employed and described in the first two articles. 

So, let’s begin.

Modern missing data theory began with Donald B. Rubin’s seminal 1976 publication.[4] While we all accept that outcomes are random variables that can be assumed to conform to certain probability distributions, Rubin’s insight was that the missing data, or more precisely, the indicator variable that accounted for whether data was missing or observed, could also be conceived of as a random variable. What do we mean by “indicator variable”? Well, let’s say you have a patient reported outcome (PRO) collected longitudinally at 7 assessments and for some participants their vector of PRO values is not complete. Say, for example, someone like Timmy here: 

Subject

Visit

PRO Score

Timmy

Baseline

10

Timmy

1

9

Timmy

2

6

Timmy

3

8

Timmy

4

4

Timmy

5

.

Timmy

6

.

The indicator variable, aka the variable indicating whether the data is missing or not can be coded 1 when data are present and 0 when missing. This variable is referred to as $r$ in the missing data literature. The manner in which $r$ depends on the outcome “$y$”, here a PRO score, forms the basis for all the missing data mechanisms referred to by the acronyms MCAR, MAR, and MNAR defined and explained shortly.

Subject

Visit

PRO Score

$r$

Timmy

Baseline

10

1

Timmy

1

9

1

Timmy

2

6

1

Timmy

3

8

1

Timmy

4

4

1

Timmy

5

.

0

Timmy

6

.

0

Here the outcome, in this case the PRO score and henceforth for the sake of generality referred to as y, can be decomposed into the observed and missing data as follows:

  1. $y_{obs} = y_r = 1 $

  2. $y_{mis} = y_r = 0 $

With the complete response vector defined as the joint distribution of the two elements $y = (y_{obs}, y_{mis})$

Subject

Visit

$y_{obs}$

$y_{mis}$

$y$

$r$

Timmy

Baseline

10


10

1

Timmy

1

9


9

1

Timmy

2

6


6

1

Timmy

3

8


8

1

Timmy

4

4


4

1

Timmy

5


.

.

0

Timmy

6


.

.

0

And given Rubin’s (1976) insights, we can express a conditional distribution for the probability of the missing data indicator, $r$, as a function of $y_{obs}, y_{mis}$, and some assumed governing parameters. Note that for applied researchers overwhelming mathematical notation is omitted, but it should be assumed that we are operating at the level of a single individual's outcome data, referred to as the $i^{th}$ individual's sub-vector of the outcome, subscripted "i": $y_i$ which when aggregated with all of the subjects' data we reconstruct the entire outcome data $Y$ All of this is denoted as $y_{i} \in Y$, and all that says is what we already said: that $y_i$ is an element of $Y$.

 $$ p(r | y_{obs}, y_{mis}; \theta_r) $$

And all this says is what we have already seen: whether and when r takes on a value of one depends on data like Timmy’s, as seen in the last column – 0 where data is missing and 1 where data is observed. And it further assumes that there are some parameters, 


(these depend on the specific mechanism and examples are given below), that can help explain where the 0 and 1 values occur.

For these explanations we will stick with Schafer & Graham’s iconic example[5] because it is so clarifying, though it does not involve COA data. COA analogs are presented thereafter.

How can missing data depend on observed data? Well, let’s say the outcome was blood pressure, and that per protocol any systolic blood pressure exceeding 130 would result in participant withdrawal from study. And let’s say Timmy registers a systolic blood pressure of 180 at visit 4, then all subsequent data would be missing because of the observed value at visit 4.  This is referred to as data missing at random (MAR).

A COA analog would be per-protocol study termination based on clinician-rated Clinical Dementia Rating Scale global (CDR-Global) in AD trials when participants transition to full dementia. Under this paradigm, the hypothetical MAR scenario consists of Timmy transitioning to full dementia at study visit 4 making his data missing at planned study visits 5 & 6.

How can missing data depend on missing data? I mean, talk about tautology, right?

Let’s stick with Timmy and the blood pressure outcome. Timmy is doing fine through visit 4, but thereafter Timmy has become hypotensive. Had he participated in study visits 5 & 6, the systolic blood pressure recorded may have been 80. But because of his hypotensive status, Timmy is feeling dizzy and therefore does not attend visits 5 & 6. In this case, the data is missing because of the blood pressure issue that prevented Timmy from contributing data. And that data was not recorded in the CRF.  This is referred to as data missing not at random (MNAR).

Continuing with Timmy’s dementia COA analog, under MNAR, the hypothetical scenario consists of Timmy transitioning to full dementia after study visit 4 and is so incapacitated that he cannot contribute to data collection at planned study visits 5 & 6. The severity of Timmy’s dementia that we do not observe causes Timmy to drop out.  By design that dementia severity would have been captured on the visit 5 CDR-global, but it wasn’t. And that is MNAR.

But just as missing data can depend on either the observed data and/or the missing data it can also depend on neither. For example, if Timmy would have contributed valid blood pressure data at visits 6 & 7, but a family emergency required Timmy to travel out of state, thereby creating missing data at visits 6 & 7 completely unrelated to the outcome. This is referred to as data missing completely at random (MCAR). 

For those interested, these concepts are formalized by alternative conditional distributions.

  • MCAR:  $ p(r | y_{obs}, y_{mis}; \theta_r) = p(r | \theta_r) $; note that on the right side of the equation $y_{obs}$ & $y_{mis}$ are absent because the missing data does not depend on the outcome.

  • MAR:   $ p(r | y_{obs}, y_{mis}; \theta_r) = p(r | y_{obs}; \theta_r) $; note that on the right side of the equation $y_{mis}$  is absent, because those values are not responsible for the missing data.

    • Note that under MAR, the parameters $(\theta_r)$ could include the MMRM fixed effect coefficients for the repeated measures, the covariance parameters from the MMRM for the repeated measures, or MMRM fixed effects for covariates related to missingness (for example, if males were more likely to drop out than females and sex was included as a covariate).  

  • MNAR: $ p(r | y_{obs}, y_{mis}; \theta_r) = p(r | y_{mis}; \theta_r) $ or $p(r | y_{obs}, y_{mis}; \theta_r) $; note that on the right side of the equation $y_{mis}$ is present either on its own or coincident with $y_{obs}$, because the unobserved values of  $y$, $y_{mis}, are assumed to be either wholly or partially responsible for the missing data.

    • Note that under MNAR, the parameters $\theta_r$  could include the probability of missing data pattern and the pattern mixture model fixed & random effect coefficients. Or, under a joint model, the random effects and the time to drop-out. 

Alright, so we understand the concepts and we can see how these concepts can be expressed in terms of conditional probabilities. But practically, what can be done about this? I mean, we need a solution. As you might anticipate, the solutions get increasingly complex as the missing data mechanism becomes increasingly complex; with the simplest being MCAR and the most complex being MNAR.

MCAR

Fortunately, under MCAR nothing is required to maintain unbiased inference on the outcome.

MAR

What about MAR? A bit trickier than MCAR. Descriptive statistics and data visualizations will misrepresent the data. Only likelihood-based analyses have the ability to yield accurate characterization of the data. 

And this is why most COA endpoints measured in longitudinal designs employ efficacy analyses that are likelihood based. For example, the ubiquitous mixed model for repeated measures (MMRM) is used because it is likelihood based and therefore robust to data missing at random.

Focusing on model-based solutions and not imputation, we can see that accurate estimates and inference on the observed data are achievable so long as the outcome model (model for $y$) is correctly specified and the outcome model parameters $\theta_y$ are independent of the missingness parameters $\theta_r$. This property is the definition of the expression “ignorability” of missing data. 

Specifically, so long as the full likelihood for the outcome model and missingness can be factorized into independent probabilities ignorability holds. A brief summary of this factorization is given next.

Let the full likelihood $L(\theta)$  be defined as follows:

$$\begin{aligned} L(\theta) &= \int p( y_{obs}, y_{mis}, r; \theta_y, \theta_r) p(r | y_{obs}, y_{mis}; \theta_r) \mathrm{d} y_{mis} \\ &= \int p( y_{obs}, y_{mis}; \theta_y) p(r | y_{obs}, y_{mis}; \theta_r) \mathrm{d} y_{mis} \\ &= p( y_{obs}; \theta_y) p(r | y_{obs}; \theta_r) \\ &= L(\theta_y) L(\theta_r) \end{aligned}$$

But, because valid estimation and inference under this factorization is predicated on the assumption that the outcome model is correctly specified, one often hears the naïve critique that MAR is unreasonable. For example, in two regulatory reviews since 2017 I have received the following comment regarding the use of linear mixed models: “The mixed model is not recommended because it relies on the untestable assumption that data are MAR.” In one of those reviews the reviewer went further to recommend, incorrectly, that a better solution would be to use “linear regression with Huber-White corrected standard errors”.

This criticism disregards five major considerations:

  1. First of all, outside of simulations, there is no such thing as a correctly specified model. So, this criticism is a bit of a straw person. Further, while this idea of a correctly specified model may seem to imply a mythical quest to divine all possible relevant predictors, within the context of randomized controlled trials, the data generation mechanism is mainly derived from the trial design; therefore, the risk of model misspecification when the design is properly modeled is relatively low.

 

  1. Second, having an untestable assumption regarding missing data in one of the most useful modeling frameworks of the modern era certainly cannot be used to justify a demonstrably inferior alternative like linear regression with Huber-White corrected standard errors. Sensitivity analyses may be required, but not the use of inferior modelling frameworks.

 

  1. Third, this criticism ignores all the evidence of robustness of likelihood-based estimation under model misspecification.

 

  • The likelihood estimator remains a consistent estimator of the population parameters even when:

i.    Fixed effects are partially misspecified[6]

       ii.    The random effect distribution is misspecified[7]

       iii.    Random effect covariances are misspecified[8]

       iv.    Heteroscedastic data are present[9]


  1. Fourth, the assumption that the outcome model parameters $\theta_y$

    and missingness parameters $\theta_r$ are independent can be violated and likelihood estimators remain consistent and asymptotically normal. However, when this assumption is violated, standard errors will be less efficient than under complete data.


  1. Fifth, as noted by the 2010 panel for The Prevention and Treatment of Missing Data in Clinical Trials, “A more realistic condition […] for many studies is MAR”.[10]

Taking all of this evidence together, the logical conclusion is that your linear mixed model will yield accurate estimates and inference when the data are MAR. 

However, because MAR is an assumption whose validity cannot currently be proven the sensitivity of models assuming MAR to violations of MAR is important to understand. Such sensitivity analyses are essential to determining the validity of the assumption as a function of the robustness or susceptibility of model estimates and inference to departures from MAR. The next article delves into viable and regulatory-favored mechanisms for conducting such examinations via sensitivity analyses.


REFERENCES

[1] August 2017 Arthritis Review Committee FDA Briefing Materials on Pfizer’s application for Xeljanz to treat Psoriatic Arthritis (See appendix starting on page 73 for sensitivity examples if the file has been archived to the abyss, we have a physical copy and can email it to you. https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://fda.report/media/106613/FDA-Briefing-Information-for-the-August-3--2017-Meeting-of-the-Arthritis-Advisory-Committee.pdf&ved=2ahUKEwjFluih27yPAxWZWEEAHSoXEk0QFnoECBoQAQ&usg=AOvVaw13i7tdNjowokbApYMtu7ee
[2] Little RJA, Rubin DB. Statistical Analysis with Missing Data. Wiley, Hoboken, NJ, USA (2002). 
[3]  Molenberghs G, Kenward MG. Statistics in practice. In: Missing Data in Clinical Studies. Wiley, Chichester, UK (2007). 
[4] Rubin, D. B. 1976. Inference and missing data. Biometrika 63, 581–92.
[5] Schafer JL, Graham JW. 2002. Missing data: our view of the state of the art. Psychological Methods;7(2):147-77.
[6] Royall, Richard M. 1986. Model robust confidence intervals using maximum likelihood estimators. International Statistical Review 54: 221–26.
[7] Verbeke, Geert, and Emmanuel Lesa re. 1997. The effect of misspecifying the random-effects distribution in linear mixed models for longitudinal data. Computational Statistics & Data Analysis 23: 541–56.
[8] Liang, Kung-Yee, and Scott L. Zeger. 1986. Longitudinal data analysis using generalized linear models. Biometrika 73: 13–22. 
[9] Verbeek, Marno. 2008. A Guide to Modern Econometrics. New York: Wiley.
[10] National Academies of Sciences, Engineering, and Medicine. 2010. The Prevention and Treatment of Missing Data in Clinical Trials. Washington, DC: The National Academies Press. https://doi.org/10.17226/12955. https://nap.nationalacademies.org/read/12955/chapter/6

Have a chat with us

Got any questions? Contact us and we'll be happy to help.