You’re relaxing at night after a long day’s work, and you’re whiling away the time by scrolling through Facebook, Instagram or Twitter on your phone. And suddenly, you see a comment so deeply offending that you’re shocked anyone could ever have the heart to write it. Has this happened to you? Chances are it has. Social media is filled with hate speech that is sexist, racist, homophobic and anti-semitic.
But why does such content not get removed? After all, we hear so much about powerful AI that can write stories, create deep fakes, understand sentiment; why can the same AI not instantly delete hate speech, or better yet, prevent it from being posted at all? The answer is rather surprising. It turns out that it is incredibly simple to circumvent AI, and to mislead it into thinking that hateful content isn’t actually hate speech.
A study conducted by Rajvardhan Oak, an expert in machine learning systems for cyber security, shows that by manipulating words, spacing and punctuation in text, an adversary can fool even the most powerful classifier. And one doesn’t need to have deep technical expertise to do it either; Oak’s experiments showed that a trick as simple as dropping a letter from an offensive word leads AI into classifying the text as not offensive. Drawing on an example from Oak’s paper, the tweet “go back to where you came from, these fucking immigrants are destroying america’’ is scored as being toxic (and rightly so) with a confidence of 95% by an AI classifier. However, Oak demonstrates that the sentence “go back to where you came from, these fuking immgrants are destroyin america’’ (note the letters dropped from some words!) is scored as being non-toxic with a 63% confidence by the same AI model! Surprising, isn’t it?
Fooling AI is a highly technical branch of machine learning known as adversarial machine learning. Much research has been done into choosing the perfect manipulations to be applied on a sentence to ensure that it is misclassified as non-toxic, and to slightly modify images at the pixel level so that the picture of a monkey is detected as a human face. But Oak’s work shows that anyone with a computer can post hate speech and evade detection by misspelling words or splitting them into two parts – they do not need to do the research or understand the science behind it.
So where does this leave us, and what are we to do? Well, for starters, we need to stop treating AI as an all-knowing God, and not leave ourselves completely at a model’s mercy. Just like AI picks up trends that humans cannot, humans also recognize things that might fool an AI (no human in their right minds would call the latter sentence in the above example non-toxic). We need to advocate for human-in-the-loop decision making, where AI is used simply as a diagnostic tool but the ultimate decision is left to a human being. As users of social media, we must diligently report offensive content that we find – this helps fine tune the platform’s systems that decide how much to trust the AI. At the end of the day, Artificial Intelligence is just that; it’s artificial.