Is data mining, in principle, discriminative?

We live in what is referred to as the information age and we’ve seen a rapid development of science and technology. Advancements in technology shape our future in powerful and largely unaccountable ways. Are these advancements inevitable, or can we control the technologies that we get, anticipate their implications, prevent hazards and share their benefits? Recently, the main focus has been directed to artificial intelligence and machine learning. Machine learning algorithms are in our homes and our hands all the time. We’re used to ask Siri questions and expect answers within seconds, we’re used to order through Alexa. We expect recommendations based on our preferences, and the list goes on. But how is all this possible? How do these machine learning algorithms work? It seems that we’re more concerned to ask these kinds of questions when there seems to be problems, or when these devices don’t make the right, correct choices for us.

Now days, we’ve seen problems with algorithms that stand at more sensitive domains, such as the criminal justice system and the field of medical testing. When machine learning algorithms started to be applied to humans instead of vectors representing images, studies showed  that algorithms were not always behaving “fairly”.

It turns out that training machine learning algorithms with the standard maximization objectives, meaning maximizing prediction accuracy on the training data, sometimes resulted in algorithms that behaved in a way in which a human observer will deem unfair, often especially towards a certain minority. “Discussions of algorithmic fairness have increased in the last several years, even though the underlying issues of disparate impact on protected classes have been around for decades” (SIIA Releases Brief on Algorithmic Fairness). That is partly because more data is available to be used, especially with the growth of Internet usage. Every time we are connected to the internet, scrolling through social media, using google search bar, ordering online, or any other activity, consciously or not, we’re leaving behind a digital blueprint ready to be used by different programs who collect and store our data for different purposes, some ethical and some not.

Programs with algorithmic calculations adjust themselves as they are exposed to new data and evolve not only from the original design of the program, but also from the weights developed by their exposure to earlier training data. Computational machines and analytical tools are being trained to leverage and recognize statistical patterns in the data. Data mining is one way that computer scientists use to sort large sets of data, identify patterns and predict future trends. When machine learning algorithms and data mining process is being used, it can lead to statistical discrimination. Carlos Castillo, in his presentation on Algorithmic Discrimination, gives some examples of how statistical discrimination can be used. For example: Not hiring a highly-qualified woman because women have a higher probability of taking parental leave(statistical discrimination) or Not hiring a highly-qualified woman because she has said that she intends to have a child and take parental leave (non-statistical discrimination).

Here is another example that he suggests to us:

Carlos Castillo presentation

As the BIG DATA’S DISPARATE IMPACT study suggests, by definition, data mining will always be a form of statistical discrimination. The very point of data mining is to provide a rational basis upon which to distinguish between individuals and to reliably confer to the individual the qualities possessed by those who seem statistically similar. Based on this principle, it is important to take a closer look and discuss some ways that statistical discrimination could be avoided. Data mining looks to locate statistical relationships in a data sets. In particular, it
automates the process of discovering useful patterns, revealing regularities upon which subsequent decision making can rely. The accumulated set of discovered relationships is commonly called a “model,” and these models can be employed to automate the process of classifying entities or activities of interest, estimating the value of unobserved variables, or predicting future outcomes. By exposing so-called “machine learning” algorithms to examples of the cases of interest (previously identified instances of fraud, spam, default, and poor health), the algorithm “learns” which related attributes or activities can serve as potential proxies for those qualities or outcomes of interest. The process of data mining towards solving a problem includes multiple steps: defining the target variable, labeling and collecting the training
data, using feature selection, making decisions on the basis of the resulting model
and picking out proxy variables for protected classes.  As Mark MacCarthy suggests on his study, there are two steps to define statistical concepts of fairness: First, identify a statistical property of a classification scheme, and second, the fairness notion at stake is defined as equalizing the performance of this statistical property with respect to a protected group.

I found an interesting video that explains the problems with algorithmic fairness in the cases where algorithms used to decide whether defendants awaiting trial are too dangerous to be released back into the community.


Yona, Gal. “A Gentle Introduction to the Discussion on Algorithmic Fairness.” Towards Data Science, Towards Data Science, 5 Oct. 2017


MacCarthy, Mark, Standards of Fairness for Disparate Impact Assessment of Big Data Algorithms (April 2, 2018)

Mark MacCarthy, “The Ethical Character of Algorithms—and What It Means for Fairness, the Character of Decision-Making, and the Future of News,” The Ethical Machine (blog), March 15, 2019.

Solon Barocas and Andrew D. Selbst, Big Data’s Disparate Impact, 104 California Law Review 671 (2016)