(Un)natural Language Processing
Samuel Johnson (1709 - 1784), is described by the Oxford Dictionary of National Biography as "arguably the most distinguished man of letters in English history”. A lexicographer as well as a poet, essayist, moralist and literary critic, Johnson's A Dictionary of the English Language was first published in 1755 and continues to influence the English we speak today.
When Johnson wrote that, “Every man is of importance to himself, and, therefore, in his own opinion, to others", he may have foretold our modern day obsession with social media and the information that can be gleaned from online posts.
The valuable information within a Facebook status, a product review on Amazon or a tweet is largely contained in text. Our brains are able to interpret a wealth of information on mood, sentiment and context from this text, but the process through which we translate language into information is extremely complex and not easily replicated in machine code. Automated natural language processing can turn text into data without the need for human interpretation, but this is no easy feat and can be met with variable levels of success.
LOL and other words We increasingly turn to social media platforms to express our opinions on a vast range of issues including products, companies, politics and relationships. For example, the annual How Africa Tweets report has shown a 34 fold increase in the number of tweets originating in Africa between 2012 and 2015, with 10% of tweets centered on political issues. But while English is the de-facto language of social media in Africa, regional dialects, slang and the use of non-English words can result in content that would be uninterpretable to modern English speakers, let alone to Samuel Johnson.
Companies and governments now try to stay in touch with public opinion and respond to users in real-time. However to do this, incoming text data needs to be automatically processed efficiently and accurately, and translated into actionable insights. Two main analytical approaches are used to interpret the sentiment of text from social media, and each requires adaptation if we wish to use them to gain insights from text that does not conform to typical English language norms and standards.
Lexicon-based methods use an existing lexicon or dictionary of words with an associated numerical sentiment or mood score to classify text (e.g. disappointed = -1). After summing up all the scores in a body of text and accounting for intensifiers (e.g. very disappointed = -1 x 2) or negation (e.g. not disappointed = -1 x -1) we arrive at an overall score. The stumbling block in using lexicon-based methods is the reliance on an existing lexicon of words and their sentiment. Currently we are limited to lexicons for a few widely spoken languages. Expanding lexicons to incorporate regional or industry appropriate words and their associated sentiment is a relatively simple task. However this requires painstaking effort and usually demands manual encoding.
Machine learning methods on the other hand, automatically associate words with their meaning or sentiment by comparing the frequency of words in a piece of text to a known quantification of the sentiment or feeling expressed by the text as a whole. After ‘training’ the method using texts with known sentiment, new texts can be classified using established relationships between words and the overall feeling. The strength of this approach over lexicon-based methods is that the text classifier can easily be adapted to a new dialect or new context by simply changing the training data to include pre-classified texts written in this language. However, this again requires manual intervention to interpret sentiment, though not on a word-by-word basis.
If we find ourselves in the fortunate situation of being tasked with classifying texts written in a widely-spoken language on a topic for which a wealth of data already exists, (one example of this that comes to mind is English language movie reviews), the complexity of our task is greatly simplified by the existing lexicons and pre-classified texts that are applicable.
There’s nothing natural about natural language processing Unfortunately, most natural language processing tasks - including some of our current projects - do not fit this description. Language is constantly evolving and mutating through time and place. Words take on new meanings, meanings find new words and entirely new words come into existence. Delivering reliable tools for the rich interpretation of social media text across geographies is challenging and depends on methods that are robust to this evolution. However as Samuel Johnson once said, “Few things are impossible to diligence and skill. Great works are performed not by strength, but perseverance.”