Final Paper: De-Blackboxing Facial Recognition

Chloe Wawerek

De-Blackboxing Facial Recognition

As the Key to Ethical Concerns 


Facial Recognition has received attention in recent news for issues regarding biases in machines ability to detect faces and the repercussions this would have on minorities. After reviewing the design principles behind facial recognition through material on the history and evolution of AI, I am confident that the ethical issues facing facial recognition is not in the technology itself but how humans shape the technology. This coupled by case studies on research conducted by Pew and MIT emphasizes that skewed data affects how algorithms process and learn which leads some technology to have what is called ingrain biases. This though is easy to solve after de-blackboxing facial recognition, which is what I am to do in this paper.

  1. Introduction

There are certain things that we take for granted as humans. The ability to see, speak, and comprehend are just a few that we can do but have difficulty explaining how we do it. Scientist are trying to replicate what minds can do, like the above-mentioned task, in computers that constitutes a broad new field better known as AI. However, the thing with technology is that everything is an embodiment of electricity designed to represent 1s or 0s also known as bits. String of bits are then defined by programs to be a symbol like numbers. Computing therefore represents how humans impose a design on electricity to perform as logic processors.  As a result, AI programs do not have a human’s sense of relevance i.e. common sense because amongst many things they do not know how to frame a situation, in which implications tacitly assumed by human thinkers are ignored by the computer because they haven’t been made explicit (Boden). In the words of Grace Hopper “The computer is an extremely fast moron. It will, at the speed of light, do exactly what it is told to do—no more, no less.” So, the question then is how did humanity start concerning itself with the ability of computers to recognize faces? I want to examine how facial recognition works and de-blackbox this ability down to the design processes that set the foundation for this technology. Starting from the concept of computation I will trace the evolution of facial recognition to highlight what the root issues are regarding this technology. Fundamentally, Computers have a vision problem because they cannot understand visual images as human do. Computers need to be told exactly what to and what not to look for in identifying images and solving problems, hence extremely fast morons. Understanding this we need to look deeper into why issues exist if humans set the precedent for what computers should see versus what they do see.  

  1. Facial Recognition

2.1 Computation to AI

The designs for computing systems and AI have been developed by means of our common human capacities for symbolic thought, representation, abstraction, modeling, and design (Irvine). Computation systems are human made artifacts composed of elementary functional components that act as an interface between the functions performed by those components and the surroundings in which it operates (Simon). Those functions combine, sequence, and make active symbols that mean (“data representations”) and symbols that do (“programming code”) in automated processes for any programmable purpose (Irvine). Computers then are nothing more than machine for following instructions and those instructions are what we call programs and algorithms. Roughly speaking, all a computer can do is follow lists of instructions such as the following:

  • Add A to B
  • If the result is bigger than C, then do D; otherwise, do E
  • Repeatedly do F until G

Computers, then, can reliably follow very simple instructions very, very quickly, and they can make decisions if those decisions are precisely specific. (Woodbridge) If we are to build intelligent machine, then their intelligence must ultimately reduce to simple, explicit instructions like these, which begs to question can humans produce intelligent behavior simply by following lists of instructions? Well, AI takes inspiration from the brain. If we can understand how the brain functions regarding information processing that surpasses engineering products – vision, speech recognition, learning – we can define solutions to these task as formal algorithms and implement them on computers (Alapaydin). Currently, a machine is said to have AI if it can interpret data, potentially learn from the data, and use that knowledge to adapt and achieve specific goals. However, based on this definition there exist different interpretations of AI, strong vs. weak. Strong AI is when a program can understand in a similar way as a human would. Whereas weak AI is when a program can only simulate understanding. Scientists are still wrestling with the issues of AI comprehension that involves understanding the human world and the unwritten rules that govern our relationships within it by testing programs through the Winograd Schema (Woodbridge).

Example: Question – Who [feared/advocated] violence?

Statement 1a: The city councilors refused the demonstrators a permit because they feared violence.

Statement 1b: The city councilors refused the demonstrators a permit because they advocated violence.

These problems consist of building computer programs that carry out task that currently requires brain function, like driverless cars or writing interesting stories. To do so scientist use a process called machine learning which aims to construct a program that fits a given data set by creating a learning program that is a general model with modifiable parameters. Learning algorithms adjust the parameters of the model by optimizing performance criterion defined on the data (Alapaydin). In layman terms machine learning are algorithms that give computers the ability to learn from data, and then make predictions and decisions while maximizing correct classification while minimizing errors. A machine learning algorithm involves two steps to choose the best function, from a set of possible functions, in explaining the relationships between features in a dataset: training and inference. 

  1. The first step, training, involves allowing a machine learning algorithm to process a dataset and chooses the function that best matches the patterns in the dataset. The extracted function will be encoded in a computer program in a particular form known as a model. The training process then proceeds by taking inputs creating outputs and comparing outputs to the correct outputs from example list in dataset. The training is finished and model is fixed once the machine learning algorithm has found a function that is sufficiently accurate in which the output generated matches the correct output listed in the dataset.
  2. The next step is inference in which the fixed model is applied to new examples that scientists do not know the correct output value and therefore want the model to generate estimates of this value on its own.
    1. Machine learning algorithm uses two sources of info to select the best function. One is the dataset and the other assumptions (inducive bias) to prefer some functions over others, irrespective of the patterns in the dataset. Dataset and inducive bias counterbalance each other, a strong inductive bias payless attention to the dataset when selecting a function. (Kelleher)

Neural networks are a commonly used form of machine learning algorithm that take inspiration from some structures that occur in the brain that this paper will focus on in its de-blackboxing of facial recognition. Neural network uses a divide-and-conquer strategy to learn a function: each neuron in the network learns a simple function, and the overall (more complex) function, defined by the network, is created by combining these simpler functions. In brief, neural networks are organized in layers connected as links that take a series of inputs and combines them to then emit a signal as an output, both inputs and outputs are represented as numbers. Between the input and output are hidden layers that sum the weighted inputs and then apply a bias. These are initially set to random numbers when a neural network is created, then an algorithm starts training the neural network using labeled data from the training data. The training starts from scratch by initializing filters at random and then changing the filters slightly using a mathematical process by telling the system what the actual image is e.g. a toad vs a frog (supervised learning?). Next it applies the activation function (transfer function) that gets applied to an output performing a final mathematical modification to get the result.

2.1 Computer Vision  

Computer vision is extracting high level understanding from digital videos and images. So, the first step is to make digital photos and to do so we need to use a digital camera. When taking a photo, the light of the desired image passes through a camera’s lens, diaphragm, and open shutter to hit millions of tiny micro lenses that capture the light to direct it properly. The light then goes through a hot mirror that lets visible light pass and reflect invisible infrared light that would distort the image. Then the remaining light goes through a layer that measures the colors captured this layer mimics human eyesight as only being able to distinguish visible light and identify the colors red, green, and blue, another explicit presentation of human design in our computational systems. The usual design is the Bayer array which is a matrix array of green, red, and blue colors separated and never touching the same color but contains double the number of green. Finally, it strikes the photodiodes which measure the intensity of the light by first hitting the silicon at the “P-layer” which transforms the lights energy into electrons creating a negative charge. This charge is drawn into the diode’s depletion area because of the electric field the negative charge creates with the “N-layers” positive charge. Each photodiode collects photons of light as long as the shutter is open, the brighter a part of the photo is the more photons have hit that section. Once the shutter closes the pixels have electrical charges that are proportional to the amount of light received. Then it can go through two different process either CCD (charge-coupled device) or CMOS (complementary metal-oxide semiconductor). Either process the pixels go through an amplifier that converts this faint static electricity into a voltage in proportion to the size of each charge (White). The electricity is then converted into data in with the most common being hexcode. Data is always something with humanly imposed structure, that is, an interpretable unit of some kind understood as an instance of a general type. Data is inseparable from the concept of representation. In simplest terms colors are composed of 256 numbers of each shade of red, blue, and green. So, to alter a pictures colors one needs to change the number associated with that color. Black being 0 of all three which is the absence of color and white being 256 of all three.

There are several methods that a computer can then use to extract a meaning from digital images and gain vision. The ultimate goal is to gain context sensitivity which means to be aware of its surroundings i.e. understand social and environmental factors so that the machine reacts appropriately. To do so machine learning relies on pattern recognition. Pattern recognition composes of classifying data into categories determined by decision boundaries.  To do so involves a process that first starts with sensing/acquisition. This step uses a transducer such as a camera or microphone to capture signals (e.g., an image) with enough distinguishing features. The next step, preprocessing, makes the data easier to segment like numerating pixels into a digit by dividing the RGB code of the pixel by 256. Followed by segmentation which partitions an image into regions that are meaningful for a particular task—the foreground, comprising the objects of interest, and the background, everything else. In this step the program determines if it will be a region-based segmentation in which similarities are detected or a boundary-based segmentation in which discontinuities are detected. Following segmentation is feature extraction where features are identified. Features are characteristic properties of the objects whose value should be similar for objects in a particular class, and different from the values for objects in another class (or from the background). Finally, the last step is classification which assigns objects to certain categories based on the feature information by evaluating the evidence presented and decides regarding which class each object should be assigned to, depending on whether the values of its features fall inside or outside the tolerance of that class.

For computer recognition some of the machine learning algorithmic methods through pattern recognition include color mark tracking which searches pixel by pixel through their RGB values for the color of it is looking for. Prewitt Operations is used to find edges of objects (like when a self-guided drone is flying through an obstacle) by searching in patches. To do so scientist employ a technique called convolution in which a rule is created that defines an edge by a number indicating the color differences between a pixel on the left and pixel on the right. Through this concept the Viola Jones Face Detection method uses the same techniques to identify multiple features that identifies a face through scanning every patches of pixels in a picture, such as finding lines for noses and islands for eyes (CrashCourse). The last method and the one we will focus on is convolutions neural networks (ConvNets). This method follows the neural network concept explained in 1.2 but has many different complex layers that outputs a new image through different learned convolutions like edges, corners, shapes, simple objects (mouths/eyebrows), etc. until there is a layer that put all the previous convolutions together. ConvNets are not required to be many layers deep, but they usually are, to recognize complex objects and scenes hence why the technique is considered deep learning.

The image taken from Andrej Karpathy’s blog on ConvNets show how ConvNets operate. On the left is the image and the ConvNet is fed raw image pixels, which represent as a 3-dimensional grid of numbers. For example, a 256×256 image would be represented as a 256x256x3 array (last 3 for red, green, blue). Then convolutions are performed, meaning small filters are applied to slide over the image spatially. These filters respond to different features in the image, it could be an edge, island, or regions of a specific color. There are 10 responses to the filter which represents one column. These 10 responses indicate that there are 10 filters to help identify what the image is. In this way the original photo is transformed from the original (256,256,3) image to a (256,256,10) “image”, where the original image information is discarded and the ConvNet only keeps the 10 responses of the filters at every position in the image. The next 14 columns are the same operation continuously repeated to get each new column. This will gradually detect more and more complex visual patterns until the last set of filters puts all the previous convolutions together and makes a prediction (Karpathy).

Pattern recognition is not 100% accurate. The choosing of the features that create the decision boundaries and space result in a confusion matrix that tells what the algorithm got right and wrong. This inability to be 100% accurate is termed the “Curse of Dimensionality” in which the more features we add to make the decisions more precise the more complicated the classification become and as such experts employ the “Keep It Simple Scientist” method. Faces are even more difficult than other images because differences in poses and lighting or additive features like hats, glasses, or even beards cause significant changes in the image and understanding by algorithms. However, scientist can program algorithms like ConvNet to be mostly right by identifying features and through repetitive training which assists the algorithm to gradually figure out what to look for, termed reinforcement learning.

3. Conclusion

Facial recognition is nothing but pattern recognition. ConvNet is just one of many methods that organizations use to recognize face. Computers are given an algorithm to learn from a trained data set before being applied to a test set. These algorithms are only extrapolating from the trained data accurate predictions trying to get the closest approximation to whatever we want. When the outputs fail, we are not getting a good correspondence between what we inputted and reality. We need to redesign and go back and get better approximations (actionable) to get accurate projections. It is not the technology itself that is wrong but the data humans feed it. Garbage in, garbage out. Thus, we see ethical issues today regarding AI perpetuating racial and gender biases. A Pew research shows large gender disparities between facial recognition technology being able to identify male and females based on faulty data. While a research in 2008 showed glaring racial discrepancies between black and white skin tones. Now knowing the design principles behind facial recognition, accurate training data that reflects the population this technology will be used on is key to solving this issue. To do so organizations should diversify training data and the field by encouraging and supporting minorities in color and gender. Governments should enact regulations to ensure transparency and accountability in AI technology and prevent the use of facial recognition in justice and policing without a standard for accuracy. Other concerns derive from organizations getting these images without the consent of the person in said image and using it in their facial recognition databases. Though this is not a fault of the technology itself but the application of the technology by organizations. As such similar solutions resolve around regulations and transparency. The future does not need to look bleak if people gain a shared understanding of what really drives these issues. The first step is understanding that facial recognition is not a blackbox that cannot be demystified. It is instead just extremely fast pattern recognition utilizing algorithms on sometimes skewed data. Understanding the design principles behind anything can better shape solutions to problems that exist.  


Alapaydin, Ethem. Machine Learning-The New AI. MIT Press, 2016,

Besheer Mohamed, et al. “The Challenges of Using Machine Learning to Identify Gender in Images.” Pew Research Center: Internet, Science & Tech, 5 Sept. 2019,

Boden, Margaret. AI-Its Nature and Future. Oxford University Press, 2016,

Buolamwini, Joy, and Timnit Gebru. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. p. 15.

CrashCourse. Computer Vision: Crash Course Computer Science #35. 2017. YouTube,

Dougherty, Geoff. Pattern Recognition and Classification. Springer, Accessed 26 Feb. 2021.

Irvine, Martin. “CCTP-607: Leading Ideas in Technology: AI to the Cloud.” Google Docs, Accessed 3 May 2021.

Karpathy, Andrej. “What a Deep Neural Network Thinks about Your #selfie.” Andrej Karpathy Blog, 25 Oct. 2015,

Kelleher, John. Deep Learning. MIT Press, 2019,

Simon, Herbert. The Sciences of the Artificial. 3rd ed., MIT Press, 1996,

White, Ron, and Tim Downs. How Digital Photography Works. 2nd ed, Que Publishing, 2007,

Woodbridge, Micheal. A Brief History of Artificial Intelligence. 1st ed., FlatIron Books, 2020,