Scalable and Quantifiable Mental Health Signals


First, we replicated previous findings that used depression scales and social media data to find linguistic differences between depressed people and neurotypical controls. We then demonstrated that machine learning models could differentiate depressed individuals from their age- and gender-matched controls by examining only their social media data.

Some of the signals these models depend on to make that distinction led to interpretable explanations and insight into the relationship between language and mental health.6,7 For example, depressed people tend to use “I”’ more than their neurotypical counterparts; this may be an effect of depressed people being more inwardly focused and/or an indication that they partake in solitary activities more often (thus some activities that might otherwise have been described using ”we”’ are instead described using ”I”).

We then demonstrated that predictions made from these models have some predictive validity. For example, post-traumatic stress disorder (PTSD)-like language was more prevalent at US military installations compared with civilian areas, and those bases whose members deployed frequently as “boots-on-the-ground” to Iraq and Afghanistan had higher rates of PTSD-like language than those whose members deployed less frequently.8 Similarly, people with eating disorders tend to use more language focused on anxiety and eating.6

Taken together, these findings indicate that there are some quantifiable signals relevant to mental health that we are able to capture via automated analysis. Most of the research in this area has focused on common conditions (eg, depression) as they are prevalent enough that random samples of the general population can yield sufficient numbers to support machine learning. However, our techniques also work for significantly rarer conditions like schizophrenia9 or eating disorders.6