- 6th February, 2018
By Phil Massie
Racial and gender bias are unfortunately very much a part of many people’s daily lives. From unrepresentative workforces and pay gaps, to unfair legal verdicts and denial of credit, biased decision making can have wide ranging impacts within and upon our societies. There are benefits to farming these sorts of decisions out to Artificial Intelligence (AI) and Machine Learning (ML) algorithms. These benefits include increased speed and reduced cost. It is tempting to think that algorithms are supremely objective. Surely people’s individual biases, assumptions and experiences play no part in an algorithm’s decisions… right?
Sadly, this is often not true. In many cases the algorithms underlying AIs train off historical data which themselves incorporate historical biases. This often comes about from a lack of time, money or motivation needed to collect newer, more representative data. So if the bias is in the data, the bias will be in the model.
For instance, social media news feeds will typically filter news based on our preferences and previous behaviours. This tends to result in reinforcement of our beliefs through the exclusion of contradictory findings or opinions.
Word embeddings, a common approach involving representing word or phrase associations as vectors, are often used to improve performance in natural language processing tasks. These word embedding data sets are however trained on existing corpora of vocabulary and as such reflect their underlying bias. For instance, researchers showed that commonly used word embeddings trained on Google News articles incorporated ‘gender stereotypes to a disturbing extent.’ Because of the popularity of these data sets, the biases find their way into all manner of applications.
In some districts in the United States, the legal system uses an AI called COMPAS which predicts a defendant’s likelihood of committing future crimes. Superficially, the model appears to perform well with high risk repeat offenders correctly predicted at similar rates across races. ProPublica however found some problems with the system when they looked a little deeper. The model’s false positives (when the model predicted that a defendant would reoffend but they didn’t) overpredicted black people while its false negatives (when the model predicted that a defendant wouldn’t reoffend but they did) overpredicted white people.
Simply put, black defendants were more likely to be wrongly classified as high risk and white defendants were more likely to be wrongly classified as low risk. The systemic racial bias evident in the US criminal justice system’s historic data were manifesting in the model predictions. The model developers responded to ProPublica’s article and ProPublica responded again. It’s an interesting read if you have the time.
Avoiding these biases requires having clear views of both the prejudicial signals in data sets and their potential impacts. Establishing and maintaining this sort of overview allows us to navigate around entrenched biases when training models, (re)engineer the biases out of our data and recognise biased decisions when they occur.
For instance, the team who quantified the inherent bias in the word embedding data have developed approaches to linearly separate words like ‘female and queen’ from those like ‘female and receptionist’. By re-engineering the data, they were able to significantly reduce gender bias in the data without compromising it’s value. They have also developed metrics for measuring inherent bias in embeddings and are investigating applying their approach to racial and cultural biases as well.
It is also important to understand what we are trying to avoid. In the criminal justice model example, involvement of civil rights advocates during model development may have enabled the developers to circumvent the data’s inherent biases by recognising the relevant patterns in the results. After all data scientists aren’t necessarily well versed in civil rights issues.
Biases present in society are likely to manifest in models trained to mimic society. As we deploy algorithms into increasingly important decision making roles, it becomes incumbent upon us to acknowledge systemic social biases so that we can account for them in our models. Unless we recognise such problems, we can never hope to avoid them.
At Ixio Analytics, we always integrate closely with our clients and stakeholders to avoid promoting prejudicial biases.