key: cord-0588223-1fy0qq8o
authors: Reiser, Christian
title: Predicting and Visualizing Daily Mood of People Using Tracking Data of Consumer Devices and Services
date: 2022-02-08
journal: nan
DOI: nan
sha: 1e2cc5e431ed015b86e106ecf4874706c2b16e7e
doc_id: 588223
cord_uid: 1fy0qq8o

Users can easily export personal data from devices (e.g., weather station and fitness tracker) and services (e.g., screentime tracker and commits on GitHub) they use but struggle to gain valuable insights. To tackle this problem, we present the self-tracking meta app called InsightMe, which aims to show users how data relate to their wellbeing, health, and performance. This paper focuses on mood, which is closely associated with wellbeing. With data collected by one person, we show how a person's sleep, exercise, nutrition, weather, air quality, screentime, and work correlate to the average mood the person experiences during the day. Furthermore, the app predicts the mood via multiple linear regression and a neural network, achieving an explained variance of 0.55 and 0.50, respectively. We strive for explainability and transparency by showing the users p-values of the correlations, drawing prediction intervals. In addition, we conducted a small A-B test on illustrating how the original data influence predictions. The source code and app are available online.

We know that our environment and actions substantially affect our mood, health, intellectual and athletic performance. However, there is less certainty about how much our environment (e.g., weather, air quality, noise) or behavior (e.g., nutrition, exercise, meditation, sleep) influence our happiness, productivity, sports performance, or allergies. Furthermore, sometimes, we are surprised that we are less motivated, our athletic performance is poor, or disease symptoms are more severe.

This paper focuses on daily mood. Although negative moods have essential regulating functions like signaling the need for help or avoiding harmful behavior like going on buying sprees, taking risks, or making foolish investments [1] , other studies show that bad moods can also have unfavorable consequences like less resistance to temptations, especially to unhealthy food [2] , impaired learning capabilities [3] , and inhibited creative thinking [4] .

Our ultimate goal is to know which variables causally affect our mood to take beneficial actions. However, causal inference is generally a complex topic and not within the scope of this paper. Hence, we started with a system that computes how past behavioral and environmental data (e.g., weather, exercise, sleep, and screentime) correlate with mood and then use these features to predict the daily mood via multiple linear regression and a neural network. The system explains its predictions by visualizing its reasoning in two different ways. Version A is based on a regression triangle drawn onto a scatter plot, and version B is an abstraction of the former, where the slope, height, and width of the regression triangle are represented in a bar chart. We created a small A/B study to test which visualization method enables participants to interpret data faster and more accurately.

The data used in this paper come from inexpensive consumer devices and services which are passive and thus require minimal cost and effort to use. The only manually tracked variable is the average mood at the end of each day, which was tracked via the app.

This section provides an overview of relevant work, focusing on mood prediction (II-A) and related mobile applications with tracking, correlation, or prediction capabilities. (II-B).

In the last decade, affective computing explored predicting mood, wellbeing, happiness, and emotion from sensor data gathered through various sources.

Harper and Southern [5] investigate how a unimodal heartbeat time series, measured with a professional EGC device, can predict emotional valence when the participant is seated.

Choudhury et al. detect major depressive disorder of Twitter users who posted more than 4500 tweets on average with an average accuracy of˜70% [6] .

Several studies estimated mood, stress, and health with data from multimodal wearable sensors, a smartphone app, and daily manually reported behaviors such as academic activities and exercise, claiming maximum accuracies of 68.48% [7] , 74.3% [8] , 82.52% [9] , all with a baseline of 53.94. Another study scored 78.7%, with a baseline of 50.4% [10] .

All the studies mentioned above are less practical for nonprofessional users committed to long-term everyday usage because expensive professional equipment, time-consuming manual reporting of activity durations, or frequent social media behavior is needed. Therefore, we focus on cheap and passive data sources, requiring minimal attention in everyday life.

One study meeting these criteria shows that mood can be predicted from passive data, specifically, keyboard and application data of mobile phones with a maximum accuracy of 66.59% (62.65% if without text) [11] . However, this project simplifies mood prediction to a classification problem with only three classes. Furthermore, compared to a high baseline of more than 43% (due to class imbalance), the prediction accuracy of about 66% is relatively low.

Several apps allow users to track their moods but lack correlation and prediction features [12] [19] . Some health apps allow correlating symptoms with food and behavior but still do not allow for prediction [20] [21] .

Apps capable of prediction are, and [22] which estimates the activity of the sympathetic nervous system from heart rate and heart rate variability and [23] which calculates stress, energy, and productivity levels from heart data as well [23] . Further, FitBit allows for logging how the user feels and computes a 'Stress Management' score taking the manually logged feeling, data about sleep, electro-dermal activity, and exercise into account [24] . While these apps are capable of prediction, they are specialized in a few data types, which exclude mood, happiness, or wellbeing.

The product description of the smartwatch app 'Happimeter' states to "get your body signals to predict your mood with machine learning" [25] . However, we could not test the app as it requires the operating system wearOS and the app has a user rating of only 1.5 of 5 stars on Google Play and was not updated for more than a year [25].

This project aims to use non-intrusive, inexpensive sensors and services that are robust and easy to use for a few years. Meeting these criteria, we tracked one person with a FitBit Sense smartwatch, indoor and outdoor weather stations, screentime logger, external variables like moon illumination, season, day of the week, manual tracking of mood, and more. The reader can find a list of all data sources and explanations in the appendix (Section VIII).

This section describes how the data processing pipeline aggregates raw data, imputes missing data points, and exploits the past of the time series. Finally, we explore conspicuous patterns of some features.

The sampling rates of the raw data typically vary between five minutes (e.g., heart rate) to about weekly (e.g., Bodyweight and V O 2 M ax).

1) Data Aggregation: The goal is to have a sampling rate of one sample per day. In most cases, the sampling rate is greater than 1/24h, and we aggregate the data to daily intervals by taking the sum, fifth percentile, 95th percentile, and median. We use these percentiles instead of the minimum and maximum because they are less noisy and found them more predictive.

2) Data Imputation: The sampling frequency of Bodyweight and V O 2 M ax is usually < 1/24h. Because Bodyweight and V O 2 M ax represent physical entities that change relatively slowly, we assume a linear change, allowing linear interpolation of consecutive measurements to obtain the 24h frequency. If there are days or features where many values are missing, we drop these days or features, respectively. Otherwise, data imputation fills missing values with the feature's average.

3) Time-series: As the dataset is a time series, and yesterday's features could also affect today's mood, we added all of yesterday's features to the set of today's predictors. We also include the mood of the last days until there is no new significant information about autocorrelation, given the mood of the previous days. As shown in Figure 1 , computing the partial autocorrelation [26] determines these days, when including all days from the left until the first insignificant day. In our case, this means the values of one to four days ago. 4) Standardization: Standardization rescales the features to have a mean of 0 and unit variance.

The dataset has many outliers because the sensors and services are cheap consumer devices. For example, the estimated metabolic energy output, shown in Figure 2 has values at about 1000 kcal and above 4000 kcal. Moreover, Figure 3 shows a suspicious CO 2 spike at 5000 ppm. A closer look into the raw sensor data depicted in Figure  4 indicates an improbable plateau at 5000 ppm. The causal explanation is an ending sensor range at 5000 ppm, which falsely counts all values greater than 5000 to 5000 ppm.

The distribution of the wakeup time looks gaussian except for one suspicious spike at 320 minutes after midnight. However, an alarm clock at 5:20 am indicates the plausibility of this spike.

Improbable values in the dataset are not corrected manually because we do not have access to data in the actual mobile application due to our strict privacy policy. Instead, we exploit robust statistics by aggregating the data via the fifth and 95th percentile instead of the maxima. Our experiments have shown that these percentiles are more predictive than the maxima.

The app computes the Pearson correlation coefficient and p-values between all attributes. Because comparing each attribute with every other, we correct the p-values according to the Benjamini-Hochberg procedure [27] to control the false discovery rate due to multiple testing. We declare a result as significant for p < 0.05. Users can visually explore the data via a plotted time series with a seven-day moving average and manually inspect the relationship between two variables through scatter plots.

Because there is a temporal dependency between observations, standard cross-validation, which assigns samples randomly to the train or test set, would lead to using some data from the future to forecast the past, which is not possible in the real-life application. We use time-series-splits to avoid this fallacy. However, many splits are inefficient w.r.t. data use because only the last split uses all training data. Ultimately, we create only one split, which is a simple train test split, where the training set contains old data points, and the test split the most recent ones.

The estimation parameters for multiple linear regression are computed via the training dataset while applying elastic net regularization and sample-weighting. 

We use a combined L 1 and L 2 weight penalty called elastic net regularization [28] with

being the objective function to minimize, where ρ is the L 1 ratio, α the strength of the penalty terms, w the weights, X the features, and y the predictions. Specific cases are for ρ = 0, which simplifies to ridge regularization and ρ = 1 to lasso regularization. We search for the optimal ρ and α via crossvalidation.

2) Sample-weighting: We assume that the factors influencing a person's mood change over time. To account for it, we use exponential sample-weight decay, as shown in Figure 6 . The formula is

which is exponential decay fitted to a value of 1 for today's datapoint and reaches zero after about 82 years. max(x, 0) ensures that the sample weight never becomes a negative value.

The neural network has two fully connected hidden layers, each with a leaky rectified linear unit [29] as an activation function. The first and second hidden layers have 16 and 8 neurons, respectively. Because we want to regress on a scalar, the output layer has one unit. AdamW [30] is the optimizer, minimizing the mean squared error for 4125 epochs with a learning rate of 10 −4 and a weight decay of one. The number of epochs is determined by early stopping using cross-validation [31] . We searched for the best neural network architecture and hyper-parameters manually through crossvalidation. Fig. 6 . Exponential sample-weight decay discounting older data as they might be less valuable to predict today's mood. Table III in the appendix.

Elastic net regularization with a penalty strength α = 0.12 and L 1 -ratio ρ = 1 leads to the best prediction performance on our dataset. Table I shows all regression weights w for w = 0. Note that lasso regularization selected only nine features to predict the mood. The 95% prediction interval is ±2.3 on the original scale from 1 to 9 and ±1.2 on the standardized scale with unit variance. The average mean squared error on the standardized test set is 0.45, meaning 55% of the original variance can be explained. The effect of sample weighting is negligible.

The average mean squared error is 0.50 on the standardized test set, meaning half of the original variance can be explained.

Screenshot (a) of Figure 7 shows how the app visualizes each feature via time series to allow the user to spot changes over time and trends through a seven-day moving average. Figure 7 shows an example of a scatter plot enabling exploration of how variables visually relate to each other. In addition, it draws a linear regression line and indicates the degree of the linear relationship by visualizing the correlation coefficient in a bar. Figure 8 shows a black box with whiskers on a scale, which is the mood estimate with the 95% prediction interval. This provides the users not only the prediction but also how much they can rely on the accuracy. Above the prediction in screen-shot8, we explain to the user how multiple linear regression calculates the predictions. Each row represents the contribution of the selected features. The row contains a red or green bar if the contribution is negative or positive, respectively, and the size indicates the magnitude of the contribution. The final mood prediction is simply the sum of all contributions.

To understand more about the contribution of a feature, the user can tap on one and see either a bar chart or a regression triangle drawn onto the scatter plot.

The bar chart shows that the contribution (green or red bar) of a feature is the product of the weight of the feature (deep teal) and the difference from today's value to its average value (light teal).

The triangle drawn onto the scatter plot shows the same information but with more context. The triangle's horizontal line represents the difference between the average and today's value of the feature. The triangle's slope represents the feature's weight, which is a regularized regression line of the two variables. Finally, the vertical length between both lines depicts the contribution. The bar and triangle chart contain redundant information. Therefore, we conducted a small A/B test with 10 participants to determine which chart conveys the information more accurately and faster.

Seven of the participants are male, and three are female. The age ranges from 19 to 32, and all of them have an engineering background. The participants answered four single choice questions concerning the bar chart (A) and four similar but not identical questions about the triangle chart (B). We assured proper testing by passively observing them. Each participant worked on both part A and part B; however, 50% of the participants first completed part A and vice versa to control for the order. We measured the accuracy and time required to answer the questions of each part and reported their average w.r.t. the number of participants and number of questions. As shown in Table II , the 'bar chart' results in a slightly higher accuracy of 90% and 43 seconds a marginally faster completion, compared to the 'regression triangle chart' with an accuracy of 85% and 45 seconds. While these results are not significant, the users also commented that they favor the 'bar chart' as the length of the bar representing the weight is more accessible than the slope of the triangle, especially if the slope or triangle is small. Figure 10 that only 25 components explain more than 1% of the total variance. Examples of correlated predictors are • all the fourfold aggregated variables (i.e., the mean, median, fifth-, and 95th percentile of a variable) • weather indoor and outdoor • time in bed and time asleep • walking minutes, heart points, exertion points 2) Multiple linear regression versus neural network: Multiple linear regression performed better on the test set than the neural network. Neural networks can have an advantage over the linear multiple linear regression method because they can approximate nonlinear relationships. Still, the downside is the need for more training data to optimize additional parameters. In our case, the training set is probably too small, leading to overfitting and difficulty generalizing to new data. We expect the neural network will increase its performance with a growing dataset. Besides, an enhanced architecture and improved hyperparameters could lead to better predictions of the neural network. An advantage of multiple linear regression is good explainability, as illustrated in Figure9, which is less intuitive for neural networks [32] . • unmeasured variables which influence a person's mood • sensor data can be noisy, data imputation of missing values non-optimal • multiple linear regression assumes a linear relationship between features and mood and neglects nonlinear mechanisms • the training set might be too small, especially for the neural network 2) Assumed linear mood scale: Our method of asking the users' mood on an absolute scale from 1 to 9 assumes a linear relationship of these values. However, the genuine relationship might not be linear because there may be a higher degree of change on a comparative scale than on an absolute scale for extreme values [33] . Comparative surveys would reduce these biases. However, we decided against them because they require more of the user's time.

3) Recency and fading affect bias: Asking the user's average mood at the end of the day suffers from a potential recency bias [34] , where recent events of the evening have a stronger effect than more distant events in the morning. Furthermore, it is prone to the fading affect bias, where negative memories fade faster than positive ones [35] . Asking for a rating multiple times per day would reduce these biases; however, we decided against it because it requires more effort and is less sustainable over long periods.

The results of the survey are not conclusive because of the small number of participants. Furthermore, the selection of participants might not represent the actual distribution of the users, w.r.t. age, gender, and background. 5) Causal inference: Features correlating strongly with mood and predictors of multiple linear regression could potentially causally affect mood. While this could be the case and interesting to know, correlation does not imply causation, and good predictors are not necessarily causes. For example, the positive correlation between mood and exercise could be caused by exercise causing better mood, or that good mood leads someone to exercise more. Furthermore, both could affect each other, or there could be a confounder like good weather, which causes someone to exercise more and improve mood independently of each other. In some instances, we can remove directions from a causal graph. For example, when there is a positive correlation between the variable weekend and good mood, it is unlikely that the good mood causes the day to be a weekend. Unfortunately, this study does not allow for causal inference. Today, randomized controlled trials are still the gold standard for establishing causal conclusions [36] . Fig. 11 . Sample predictions: Predictions are in red, ground truth values in green. The blue bar represents the 95% prediction interval. It seems that the predictions go in the right direction but are too close to the mean at 6.2.

1) Stronger feature selection and weaker weight penalty: Figure 11 , shows sample predictions. It seems that predictions tend to go in the right direction but are too close to the mean at 6.2. We hypothesize that the weight penalty is too strong, which pushes predictions to the mean, but feature selection is critical because the overall regularization strength α = 0. Further evidence is elastic's-net optimal L 1 -ratio of ρ = 1, meaning that performance is best with the lowest allowed weight penalty while maximizing feature selection.

Fortunately, the equation of elastic net regularization, which operates between the L 1 and L 2 -norm, can be generalized to the L p -norm

with 0 < p < ∞, allowing an even more extreme featureselection to weight penalty ratio. Figure 12 shows how feature selection becomes stronger and the penalty for large parameters weaker as p gets closer to zero. Future work could explore if prediction performance improves for p < 1 like applying a regularization with the L 0.5 -norm. However, this has the downside of leaving convex optimization [37] . 2) Improved data imputation: A common problem in longterm studies is that sensor data is missing. While we impute missing values with the average or linear interpolation, we plan to impute with a deep multimodal autoencoder to enable better mood prediction [39] .

3) Forecasting tomorrow's mood: This project explored predicting the mood of the same day, but we also plan to forecast tomorrow's mood. A study shows it is possible with a mean absolute error of 10.8 for workers and 17.8 for students, while the mood's standard deviation is 17.14 [40] .

In this work, we presented a meta app that can import data from consumer devices and services and allows for manual tracking. The app allows the user to explore data via plotted time-lines and the relationship between variables through scatter plots and correlation coefficients.

The app predicts the mood or any other chosen target variable by automatically aggregating the data into daily features and selecting the best ones to predict the user's mood. This project shows that multiple linear regression can explain more than half of the original variance.

We strive for transparency by conveying information about confidence through p-values and prediction intervals and created the app as an open-source project.

We hope the app helps users understand themselves better and improve their wellbeing, health, and physical & cognitive performance.

I thank Benedikt V Ehinger and Tanja Blascheck for helpful discussions and suggestions.

• Moon Illumination • daytime (the time between sunrise and sunset, which is shorter in the fall/winter and longer in the spring/summer) • OpenWeatherMap, which gives outside weather measurements GPS location temperature heat index (human-perceived equivalent temperature, which includes humidity and windspeed) air pressure humidity windspeed cloud cover precipitation (rain, snow) steps taken burned calories heart rate resting heart rate -VO2Max (maximum rate of oxygen consumption) sleep revitalization sleep duration score sleep restlessness vertical meters exercised (measured in floors)

• Later Upgrade to Fitbit Sense Smartwatch, which extends the following estimates: -Responsiveness points: a proprietary assessment of how well the sympathetic and parasympathetic nervous system are in balance which takes Heart Rate Variability (HRV), Elevated Resting Heart Rate (RHR), Sleeping Heart Rate above RHR, and Electrodermal Activity (EDA) into account -Exertion points (similar to heart points) the temperature of the wrist during sleep sleep points (sleep rating) stress points (average of sleep-, exertion-, and responsiveness points)

• Netatmo indoor weather station -Temperature -Humidity -CO2 in ppm -Noise in dB air pressure in Pa

• Nutrition tracking with MyFitessPal -Carbohydrates intake -Fat intake -Protein intake -Sodium intake fiber intake sugar intake cholesterol intake total calories intake • Bodyweight (Renpho Scale) • Screentime tracking, with RescueTime total screen time productive screentime distracting screen time neutral screen time 

Bipolar disorder -Symptoms and causes

Positive Mood and Resistance to Temptation: The Interfering Influence of Elevated Arousal

How do we learn in a negative mood? Effects of a negative mood on transfer and learning

The Effects of Positive and Negative Mood on Divergent-Thinking Performance

A Bayesian Deep Learning Framework for End-To-End Prediction of Emotion from Heartbeat

Predicting Depression via Social Media

Predicting students' happiness from physiology, phone, mobility, and behavioral data

Multi-task, Multi-Kernel Learning for Estimating Individual Wellbeing

Multi-task Learning for Predicting Health, Stress, and Happiness

Personalized Multitask Learning for Predicting Tomorrow's Mood, Stress, and Health

Multimodal Privacy-preserving Mood Prediction from Mobile Data: A Preliminary Study

MoodPrism Mental health and wellbeing app

Moodily -Mood Tracker, Depression Support -Apps on Google Play

MoodPanda -Your supportive mood diary

Daylio -Journal, Diary and Mood Tracker

Depression -Apps on Google Play

Mood Patterns -Mood tracker & diary with privacy

iOS -Health

Pattern -Correlate, Health Diary, Mood-Tracker -Apps on Google Play

Features

Best Heart Rate Variability Monitor & App

Welltory -guide to a life of health and productivity

Stress Management -Stress Watch & Monitoring | Fitbit

Robust estimation of (partial) autocorrelation

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Regularization and variable selection via the elastic net

Empirical Evaluation of Rectified Activations in Convolutional Network

Decoupled Weight Decay Regularization

Automatic early stopping using cross validation: quantifying the criteria

Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models

Assessment of chronic pain. I. Aspects of the reliability and validity of the visual analogue scale

Mitigation of recency bias in audit judgment: The effect of documentation -ProQuest

Chapter Three -The Fading Affect Bias: Its History, Its Implications, and Its Future

Randomised controlled trials-the gold standard for effectiveness research

L1/2 Regularization: A Thresholding Representation Theory and a Fast Solver

Intro to Regularization

Multimodal autoencoder: A deep learning approach to filling in missing sensor data and enabling better mood prediction

Forecasting stress, mood, and health from daytime physiology in office workers and students