Machine learning of language use on Twitter reveals weak and non-specific predictions | npj Digital Medicine

There is growing interest in the power of artificial intelligence for improving healthcare provision, early intervention, and diagnosis. But large amounts of data are needed to develop and train models, which can be arduous to gather. Social media data has been suggested to be a convenient and readily available source of such data. This is because social media platforms are in widespread use, users produce high volumes of data, regularly and spanning many years, and these data often contain rich personal and emotional information of putative relevance to their mental state. Although several studies have examined this in recent years, there are substantial limitations to the methods in widespread use36,39, including but not limited to the validity of diagnostic classifications employed and the rigour of the machine learning methods employed. Here, we collected Twitter data and 9 validated self-report questionnaires from over 1000 participants assessing their mental health. We used gold-standard machine learning methods with out-of-sample testing to establish the predictive power of models trained to predict depression and other aspects of mental health, using linguistic features derived from Tweets.

A model developed to predict individual differences in self-report depression explained 2.5% of variance when tested out of sample. The age and gender model, however, slightly outperformed the text feature-only model, illustrating that a similar level of depression prediction can be achieved using just these two data points. It is worth noting that age and gender are not routinely available on Twitter but were gathered as part of our survey. When age, gender, and text features were included in the same model, it still only explained approximately 4% of variance in depression severity. We examined the specificity of this depression model, on 8 other questionnaire total scores gathered from the same participants. We found that although the model had some small predictive value for 6 other aspects of mental health studied here, generalised anxiety, schizotypy, obsessive-compulsive disorder, eating disorder, apathy and social anxiety, it was not able to explain variance in alcohol abuse and impulsivity. Furthermore, we found that there were no associations between any aspect of Twitter use, e.g., word count, and the residuals of the depression model’s predictions. Failures in model performance, therefore, seem to be random and not explained by lower engagement nor number of social connections.

We tested if previously identified transdiagnostic symptom dimensions, which tend to perform better than these questionnaires in fitting cognitive test performance40, might improve signal and/or specificity. This was not the case. After controlling for shared variance among the transdiagnostic dimensions, we found that most text features were specific to the residuals of each dimension. Perhaps most strikingly, 1st person singular pronouns have been consistently found to be a key characteristic of depression-relevant language22,41, but when controlling for shared variance, we found that an increased use of 1st person pronouns was actually most associated with the compulsivity and intrusive thought dimension. Overall, generalised anxiety and schizotypy were the best performing models while the alcohol abuse model had close to zero out-of-sample performance. Hierarchical clustering revealed that language use associated with alcohol abuse and eating disorders were most dissimilar to the other disorders. Most prior social media research has focused on associations between language use and alcohol usage at the group, rather than individual, level42. A possible explanation for the low predictive value of the alcohol model is that few people in our study scored high enough to qualify as alcohol dependent.

A depression classification model trained on the presence of depression-relevant keywords had substantially better predictive performance compared to a model trained on dichotomised self-reported depression on a validated instrument. Prior studies have shown that regular expression, i.e., keywords, can be used to identify depression with a high degree of accuracy43,44. To our knowledge, however, no studies have compared relative predictive performance of a depression-keyword-trained model to one trained on depression self-report scores within the same sample. Although self-report measures are more difficult to pragmatically acquire from a large sample, they represent an important and clinically validated ground truth. Our results indicate the potential pitfalls of defining cases of mental illness through keyword-based methods, that is, a sort of content-based circularity can arise when social media posts are used to define caseness, train and evaluate machine learning models. Our data suggest that persons more likely to discuss depression in Tweets, have a distinct pattern of associated language use, but they do not necessarily suffer from clinical depression, with only 50% of these participants meeting the clinical cut-off for depression. These findings underscore the need to use valid ground truth estimates of mental health in developing models of clinical relevance.

Exploratory analyses found that elevated rates of replying and Tweeting were broadly associated with mental health, correlating with all questionnaire total scores, except alcohol abuse and eating disorders. Inconsistent evidence exists around whether people with greater depression severity are more8 or less5 active on social media. We found that participants with greater obsessive-compulsive severity had more followees while people with more severe eating disorder symptoms had more account followers. Depressed individuals have consistently been shown to Tweet more at night than during the day5,45,46. Later Tweet times were associated with depression severity, but also apathy, impulsivity, obsessive compulsive symptoms, and schizotypy. Impulsivity had the strongest association with the insomnia index, in line with prior research showing a positive association between impulsivity and sleep disturbances47,48. Besides findings related to the number of followees and followers, Twitter metadata, like language use, was generally not specific to any one mental health condition.

People with depression have been found to use language differently from healthy controls. Most studies, however, compare people with one mental health disorder to healthy controls22,25,49,50,51; few have examined the specificity of different aspects of language use across disorders. The non-specific patterns of language use observed here, both in prior work29,30,31 and the current study, is likely related to the high comorbidity rates among disorders. We found that only by removing the shared variance among disorders could we identify which aspects of language use were specific to each mental health dimension. Major depressive disorder is positively associated with a variety of other mental health conditions including panic disorder, agoraphobia, generalised anxiety disorder, post-traumatic stress disorder, obsessive-compulsive disorder, and separation anxiety disorder52. For example, a patient diagnosed with major depression is 8.2 times more likely to have a concurrent diagnosis of generalised anxiety than someone without depression53. In our study, we found that depression and anxiety had the most similar language use of any pair of disorders. Depression symptoms overlap strongly with other disorders and are associated with numerous symptoms in other diagnostic categories54. In a network of Diagnostic and Statistical Manual of Mental Disorders-IV symptoms, depression symptoms (insomnia, psychomotor agitation/retardation, and depressed mood) were the most connected symptoms with connections to over 28% of other symptoms in the network55. The spread of symptoms across disorders makes it unlikely that individual text features or even combinations of text features could ever be specific to categorical disorders, a finding in line with the growing consensus that these diagnostic categories are overlapping and warrant revision28.

Social media is not a one-way street. While the content of social media posts reflects the underlying mental health of the user, interactions, both passive and active, on the platforms can act to either improve or worsen mental health. When users experience a stressful event, they are more likely to disclose this information on social media. Self-disclosure was shown to subsequently moderate the adverse effects of a stressful event and led to enhanced life satisfaction and lower depression via enhanced social support56. However, in a separate study, Reddit users who transitioned to talking about suicide had elevated levels of self-disclosure but received less social support and engagement than users who did not57. Furthermore, specific types of social support are more likely to lead to improvements in mental health, e.g., use of the phrase ‘be tough’58. Increasing awareness about these types of comments would help friends, family, and content moderators to know what to say to and what not to say to someone experiencing mental health difficulties. While there are benefits to self-disclosure these can only be realised if the user is able to communicate free of stigma and receive adequate support. The effects of self-disclosure on social media highlight the need to follow users longitudinally and consider factors beyond just language use, i.e., social network structure, when predicting mental health. Considering the availability of online social support could help triage users with the same predicted risk of mental illness; users with less social support should be prioritised for receiving help.

Most Twitter data are generated by a small subset of users, 80% of Tweets are written by only 10% of users59. We found some evidence that machine learning language models perform slightly better when trained on subsets of users with more Tweets. This might suggest that in an even more select sample, e.g., those in the top 10% of users overall, one could produce more reliable predictions. However, two things are important to remember here. First, even in our top quartile, the variance explained only rose to a high of 4.3%, additional gains are unlikely to take this to the realm of real-world clinical utility. Similarly increasing the minimum word count per user only slightly increased the percent variance explained. At a minimum threshold of 400 words, 6.4% of variance was explained, while a threshold of 500 words was slightly worse at 3.4%. Second, those users are not representative of social media users in general, so even if such performance could be achieved, these models are unlikely to be generalisable. An interesting possibility is that the signal may be more meaningfully improved if private sources of text could be harnessed such as text messages. This would have the additional benefit of increasing the amount of data available for each user while simultaneously being more relevant to a user’s true mental health status.

Although we demonstrated that social media data has low predictive power at an individual level, this should be contextualised as part of the broader landscape of effect sizes in mental health science. For example, well-established correlates of mental health problems such as adverse childhood experiences only yield an area under the curve of 0.58 in predicting mental health problems at age 1860. A recent preprint showed that resting-state and structural brain-wide associations to psychopathology are exceedingly small, with no reliable correlation exceeding 0.1661. Because these observations do not have value as individual predictors, does not make the observation devoid of meaning. Mental health is exceedingly complex and likely combinations of a range of sources of multimodal data will be required to take these small effects and transform them into meaningful N-of-1 predictions. Twitter data, by itself, has already proven an interesting testbed for nascent theories of mental health such as network theory, which for example, has struggled to acquire large enough longitudinal datasets to test some of its core predictions62. We recently found for example that using social media posts as a proxy for experience sampling allowed us to study a large cohort of individuals through a transition to a depressed state, detecting subtle network signatures of depression vulnerability63.

Mental health detection from social media offers the potential for generating continuous insights into mental health at the population and individual level, but also poses a unique set of ethical challenges. Large-scale analyses of social media data are typically exempt from requiring participant consent due to the public nature of data and lack of experimental intervention. Because of this exemption, social media users are often unaware of whether or not their data are included in research and when asked, tend to be uncomfortable with the idea that their Twitter data could be used for research purposes without their knowledge64. While it is impractical to ask for consent in all circumstances, requiring consent whenever possible ensures that participants have safeguards for how their data is used. Predicting an individual’s mental health outside of a clinical context inherently poses the question of how to act on that information and whether there is in fact an obligation to act65. Unlike clinicians, software developers are not obligated to intervene if their algorithm detects that a person is struggling with their mental health. If the developers are not obligated to intervene, would the burden fall on family members, friends or the individuals themselves? Even if a patient consented to having their social media feed monitored by their physician, a high rate of false positives could overwhelm a clinician and impede their ability to effectively allocate care. Furthermore, there is a potential for misuse of mental health predictions by bad actors who do not consider the best interests of the user. Passive and automatic detection of mental illness could lead to targeted advertisements of prescription medication66 or result in an increase in health insurance premiums. A final concern relates to algorithmic bias based on the data used to train these models. Social media users tend to be younger, more affluent, and hold more left leaning political views than the general population59,67. Furthermore, social media research is strongly focused on predominantly English-speaking countries yet there is evidence that people from different cultures behave differently online, for example, users from China and India post questions online more frequently than users from the US and UK68. Extrapolating models to very different users than those the models were trained on could lead to systematic biases that impact the predictive performance for groups not included in the training data.

Prior studies have had larger Twitter datasets in terms of the number of posts per user. For example, de Choudhury et al.5 had a mean of 4500 posts per user in a 1-year period, while our study had a mean of about 1100 posts, including likes, per user. As mentioned above, models perform better when they are provided with more training data per user. Indeed, this study achieved greater predictive power than reported here. However, there were other differences across our studies too; our sample was twice as big, and we used an independent training set to build our model and then evaluated it on an independent test set. Compared to simple K-fold cross-validation using the entire dataset5, this procedure is less likely to overfit the data and overestimate predictive performance. Another potential limitation to our study is that our text features analysis was restricted to only using categories from the LIWC library. Some evidence exists that more data-driven approaches, e.g., topic analysis, could slightly improve predictive ability over closed libraries18,69. More sophisticated machine learning models, such as convolutional neural networks have the potential to make superior predictions than more commonly used algorithms, although with the limitation of needing substantially more data70. While these methods might indeed yield improvements in performance, the use of LIWC has key advantages. LIWC is a closed library that has been well-validated and studied across a range of communication media from diary entries22 to spoken word71. This means that the numerical values and classifications assigned to individual words in LIWC does not change from dataset to dataset, as is often the case with topics and neural networks72,73. This makes the insights derived here more reproducible and generalisable to new datasets that may be of keen interest in the future, such as text messages and email communications.

Regarding the choice of social media platform, it is nonetheless a limitation that our study was confined to Twitter. Recent evidence has also shown that Facebook may be more predictive of mental health conditions than Twitter74. We selected Twitter because it is the most used social media platform for studying mental health, comprising approximately 40% of studies on the subject, while Facebook makes up only about 8%36. It remains a limitation that these results could reflect a relative lack of predictive performance that is particular to Twitter. Because we did not have binary diagnostic information, we did not attempt to classify participants with either depression vs. anxiety, obsessive-compulsive disorder etc., i.e., multi-class classification of mental health diagnoses. Instead, we tried to continuously predict a participant’s score on a range of self-report questionnaires probing different aspects of mental health. Therefore, rather than differentiating users with one diagnosis or another, we instead attempted to quantify the similarity of language use between self-report symptoms of highly comorbid conditions. We think this dimensional approach has many advantages, but this creates a limitation in how directly these data can be applied to diagnoses assigned by a clinician. Finally, subjects in this study reported mental health symptoms at the point of study entry, and we analysed data corresponding to the 12-month period directly prior to this. This necessitates taking a ‘trait’ perspective on the mental health symptoms we assessed and it is likely that our model is diluted by variations in state/episodic features of depression. However, in a recent study, we found that individuals’ use of depression-relevant text features in fact didn’t change significantly across within-subject periods of mental health and wellness, suggesting this may not be a major issue63.

We found that language use patterns on Twitter that relate to depression symptom severity cannot be used to develop predictive models with high accuracy on an individual subject basis. A model trained to predict depression is also non-specific, being additionally predictive of several mental health symptom profiles. Although performance was poor at the individual subject level, the effect sizes observed are not out of proportion with other routinely studied cross-sectional observations in psychiatry. The addition of age and gender improved performance of our depression model, suggesting that the combination of various sources of multimodal data (with individually small effect sizes) is a viable path forward to improve predictive power of these class of models. Furthermore, controlling for other mental health conditions and training models on the resultant residuals is a promising method for finding aspects of language use specific to that condition. To our knowledge, we are the first study to train machine learning algorithms on the residuals of mental health dimensions in order to identify unique patterns of language. This approach highlights the benefits of using self-report questionnaires to measure mental health since it is not possible for studies with a binary classification of cases, i.e., healthy control vs. case, to account for shared variance between disorders. Although classification studies are able to identify cases of mental illness, they are unlikely to be able to determine specifically what aspects of language are different and unique to a particular condition. Determining specific changes in language patterns and use is crucial for the utility of using text data for diagnostic purposes, regardless of data source.

Nevertheless, we do not believe that social media should be used in a diagnostic setting both for privacy concerns on behalf of the user and the relatively low quality of prediction. Despite the low signal, by virtue of the availability of large amounts of data, the analysis of social media data remains a useful tool to test theories of mental health that are difficult to test using conventional means. Should people be concerned that their mental health status can be unintentionally revealed by the content of their Tweets? We think the data do not support this as a meaningful risk at present.