|One of the most common misperceptions in the world of Machine Learning about bias is “If I don’t use age, gender or race, or similar factors in my model, it is not biased.” Well, that is not true.|
Even though the same people holding this opinion know that Artificial Intelligence can ‘learn’ and compute relationships between data, they do not understand that there are proxies to biased data types in other features that are captured. These proxies are called confounding variables and, as the term indicates, unintended variables can confuse the model into producing biased results.
For example, if a model includes the brand and version of an individual’s mobile phones, that data can be related to the ability to afford an expensive cell phone, a characteristic that can imply a certain level of income. If income is not a factor desired to use directly in the computing decision, imputing that information from data, such as the type of phone or the size of the purchases that the individual makes, introduces bias into the model. A high amount on purchases can indicate that an individual is more apt to potentially make these types of transactions over time, again imputing income bias.
Research into the effects of smoking provides another example of confounding variables. In decades past, research was produced that essentially made the correlation that if you smoke, your probability of dying in the next four years is fairly low which possibly meant that smoking is fine. The confounding variable in this assumption was the distribution of smokers. In the past, the smoking population contained many younger smokers whose cancer would develop later in life. The older smokers were already deceased thus did not makeup part of the data sample. Thus, the analytic model contained overwhelming bias and created a biased perception of the safety of smoking.
In the 21st century, similar bias could be produced by a model concluding that, since far fewer young people smoke cigarettes than 50 years ago, nicotine addiction levels are down, too. The challenge of delivering truly ethical AI requires closely examining each data class separately.
As data scientists, we must demonstrate that AI and machine learning technologies are not subjecting specific populations to bias and search for confounding variables. To reach that goal, the relationships learned by machine learning and AI need to be exposed. This is part of the trend toward Responsible AI.
Explainability is paramount to the responsible use of AI and machine learning, and fortunately, algorithms for explaining machine learning go back more than 30 years. Now is the time to implement broadly before we see the spread of unregulated algorithms.
By Scott Zoldi, FICO. He writes in his own capacity.