Dealing with Missing Data from Sensors and Wearables
.png?width=750&height=325&name=blog%2011%20(2).png) 
						The potential of using sensors and wearables in developing clinical trial measures to better understand the effects of new treatment interventions is huge. As an industry, we have developed consensus recommendations and approaches on how to select wearables and sensors that provide data that are robust and reliable enough for regulatory decision making. We’ve also have developed implementation considerations and best practices. These can be examined in the work of the Digital Medicine Society (DiMe)1, the ePRO Consortium2 the Drug Information Association3, and in the recent draft guidance on digital health technologies from FDA.4
One gap area in our knowledge when using sensors and wearables is how to adequately deal with missing data. This is a particularly important question when we consider data from continuous streaming devices like activity monitors and continuous glucose monitors (CGMs). For CGM data, where “time in range” is one of the key derived measures describing glycaemic control, common approaches to missing data include excluding data for patients providing less than 70% of monitoring days across 14 consecutive days.
For activity monitoring data, a similar approach is commonplace: Patients providing insufficient “valid days” (days of at least a minimum amount of wear time – e.g., 12 hours or more) are excluded from the endpoint calculation and analysis. The rationale for this is that without this quantity of data, the estimates of activity or glycaemic control are considered unreliable.
Reasons for missing clinical data
Discarding data always seems unsatisfactory, but in the absence of other approaches, it is not surprising that this is typically the way chosen to deal with missingness. On the surface, only including data exceeding a certain threshold of completeness sounds intuitively sensible, but we need to consider the reasons that data are missing.
Statisticians talk about three reasons for missingness: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). In the context of wearables data, MCAR might occur due to a device malfunction or an error in data transfer, resulting in loss of data. MAR might occur if, for example, missing data were more frequently observed in female rather than male participants, perhaps due to the size or form factor of the device affecting its use. MNAR might occur because the patient elects not to use the device at times when they are feeling unwell. The way we deal with missing data may introduce bias depending on the reason for missingness.
An example of how to handle missing activity data
Let’s consider that in the context of rules for inclusion or exclusion of data based on changing the threshold for accepting data. Catellier et al5 report an interesting illustration.
In their study of continuous activity data collected over seven consecutive days among school children, they estimated the time spent in moderate to vigorous physical activity (MVPA) based on including all data, and a number of different valid day definitions by excluding data without at least eight hours daily wear time, ten hours daily wear time, and twelve hours daily wear time.
The overall dataset was shown to underestimate MVPA as it contained some days with only a few hours of wear time. However, there were also differences in the MVPA estimated using the “valid days” rules and the authors inferred that excluding invalid days may introduce bias due to differences in activity between valid and invalid days. This would certainly be the case if data were missing not at random.
Emerging methodologies
In our paper6, we explore emerging statistical approaches for addressing missingness in continuously streamed sensor data. We look at the use of within-patient imputation techniques that use information from complete segments of each day’s time series profile to estimate values in incomplete segments.
These may be suitable when data are missing at random, as we can assume that the values in the missing data segment are from the same distribution as the data from complete segments. Not so, however, if data are missing not at random.
We also explore emerging approaches, including functional data analysis, and deep learning methods with an aim to generate more discussion and research on the optimal ways to deal with missing data when estimating endpoints derived from continuous sensor or wearable data.
Dealing with missingness requires a thoughtful statistical approach to generate robust and reliable inferences. As we design trials to collect continuous data from sensors and wearables, limiting missing data must be an important consideration. This can be affected by a number of factors including the chosen device, placement location, wear interval, and use of real-time reminders and nudges to drive wear compliance. In addition, collecting the reason for missingness is an important parameter as this helps to identify when missingness is at random and not at random, which enables us to correctly classify intercurrent events and account for these in the approaches we adopt.
Read the full Elsevier article.
.png?width=363&height=228&name=blog%2011%20(5).png) 
 
				 
 
				