Category Archives: Week 6

Pattern Recognition and Computing Power

Initially, it was very perplexing attempting to understand the intersection between statistics and machine learning, but this week’s materials have made this more clear. According to the CrashCourse videos assigned, one primary task of machine learning is to determine the most accurate “confusion matrix” for a given set of “labeled data.” (Machine Learning & Artificial Intelligence 2021) As more “features” are added to the matrix, the more complicated the algorithm or SVM (support vector machine) is required to determine the most accurate confusion matrix. However, what has also become clear, is while these machine learning methods are able to analyze large amounts of data and very accurately assign a confusion matrix, like medicine, this is still an imperfect science. (Alpaydin 58) No more is this evident than with the Karpathy article. 

In the Karpathy article, machine learning in relation to graphical interpretation is depicted with a database of selfies from (instagram?) a social media platform. The algorithm Karpathy used, known as t-SNE, would search selfies based on a certain set of parameters (or features) to filter what were deemed the “best” selfies. Karpathy 2015) Yet, these parameters were very limited and did not take into account the multitude of features which might culminate in what could be considered the “best.” For example, one of the parameters used when determining the quality of a selfie was the number of likes received, which is hugely subjective and does not take into account ratios of followers from male to female. Additionally, females on average interact most with other females on social media, whereas men on average are more likely to comment or like female posts. (Fowler 2017) This bias was apparent when the top 100 selfies determined by Karpathy’s algorithm were entirely female. This is likely indicative of an obstacle to overcome with machine learning, and the consideration of a multitude of feature extractions.  


In the Dougherty reading, classifications were broken down into supervised, unsupervised and Bayes decision theory. Each of these methods of classification maintained varying degrees of computing power throughout the process. My question concerns which method is the most efficient in regards to computing power? (Dougherty 19) Additionally, are the methods interchangeable or exclusive to only certain kinds of classification? 

In the Alpaydin reading, document categorization, bag of words and deep learning,  were all mentioned, and in particular in relation to social media metadata gathering. (Alpaydin 69-70) All three have been utilized in disinformation campaigns, but why is this same technology failing to halt disinformation campaigns which still ravage social media platforms? Lastly, in reference to handwritten characters, Alapydin said, “…there is still no computer program today that is as accurate as humans for this task.” (Alpaydin 58) Has this changed since 2016 when the book was written? Or is handwritten text still far from where it could be in accuracy and classification?   


Alpaydin, Ethem. Machine Learning: The New AI. MIT Press, 2016.

Dougherty, Geoff. Pattern Recognition and Classification: An Introduction. Springer, 2013.

Fowler, Danielle. “Women Are More Popular On Instagram Than Men According To New Study.” Grazia, Accessed 1 Mar. 2021.

Karpathy, Andrej. What a Deep Neural Network Thinks about Your #selfie. Accessed 1 Mar. 2021.

Machine Learning & Artificial Intelligence: Crash Course Computer Science #34., Accessed 1 Mar. 2021.


Biases in AI

This was such an interesting topic to further dive into not only because it perfectly explains what I’d describe as one of today’s multi-used yet still “black-boxed” phenomenon of pattern recognition especially as it is applied to computer vision and images. Karpathy’s article doesn’t only highlight and break down the functions and uses of Convolutional Neural Networks but he has managed to depict through his findings how something so computerized can still be very much so human in terms of the societal biases it brings into play. 

“In machine learning, the aim is to fit a model to the data”, explains Alpaydin (Alpaydig, 2016, 58). Computer don’t just know what to do. Someone, a human, has to feed them with  directions and instructions in order for them to actually do something. The computer will follow whatever set of instructions is made for it by the human and execute the commands it has to. This means, that this human that has all of their opinions, biases, beliefs, experience, etc. is to encode into a non-human thing, the ability to execute commands based on human characteristics, capabilities, ways of knowing and understandings. Karpathy’s “experiment” shows exactly how there is biases in algorithms, especially his own, a topic I really looked into during my undergrad (one of Dr. Sample’s very looked into topic) and through my research on uses of ML and NLP in IPAs focusing on speech, language recognition and more. 

Karapthy explains how “a ConvNet is a large collection of filters that are applied on top of each other”. In these convolutional neural networks, “an aritifical neuron, which is the building block of a neural networks takes a series of inputs and multiplies each by a specified weight/number/characteristic and then sums those values all together (CrashCourse, #35, 2017)  To break it down, artificial neural networks, have artificial neurons that basically rake numbers in and spit more numbers out (CrashCourse, #34, 2017). You have the input layer, the hidden layers and the output layer. In the hidden layers, is pretty much where it all happens. It is where the computer sums the weighted inputs, the biases are applied and the activation function is also applied as this is computed for all the neurons in each layer (CrashCourse, #34, 2017). The deeper the neural net, the “weaker” or “stronger” the AI is. The NN can learn to find their own useful kernels/inputs and learn from those. The same was the ConvNets use stored information, banks of these neurons to process image data. 

As you run through them, convolutions happen over and over again as they run small filters and slide them over the image spatially to dig through the different layers of an image in order to find different features. This operation is repeated over and over again “detecting more and more complex visual patters until the last set of filters is computing the probability of entire visual classes in the image” (Karpathy, 2015). This is the part where we have come in and told the AI how to use these filters, when, where, what do we want out of the, etc. We train the ConvNets to know what to keep and what to emit by telling it what is in a way, good or bad, pretty or ugly, etc. Practice makes perfect, is a great saying to apply here as these neural networks will learn through re-inforced learning and by trial and error. The more data points you have the more information you can collect, which means that the more data you have the less uncertainty you also have about the classification, layering and choices made. However, since not all data points are equal and can’t be measure appropriately, the ML model can identify where the highest uncertainty is and ask the human to label the example, learning from those. Through active learning, the model is constantly synthesizing new inputs creating layer after layer until it reaches the wanted result and outcome (Dougherty, 2013, 3). 

For face recognition, the input layer is the image captures which is stored as pixels, defined by a color and stored as a combination of three additive primary colors, RGB, as we saw in our previous lessons as well. (Alpaydin, 2016, 65; CrashCourse #35). With biometrics we then get the ability to recognize and authenticate people by using their characteristics both behavioral and physiological. Of course, this also helps with training computers to recognize mood and emotions and not just one’s identity which trains them to learn, pick up and adapt to a human’s or their user’s mood and feelings.  

During the classification process, during the segmentation and labelling part, the image is separated into regions that are used for each particular task. The foreground which entails the objects of interest and the background which is everything else that will be disregarded. Labelling the objects then comes into play which obviously makes it easier for future use to immediately categorize or extract whatever needed from an image but ironically, we can even say that labelling in many cases, is foreshadowing the biases that can be found in algorithms. The following feature extraction is when characteristic properties of the objects come into play and distinguishes them/places them in a different category from objects they either share similarities or differences with and so forth… (Dougherty, 2013, 4-13). Further playing a role and testifying to how biases are created even in tech exactly because it is basically a reflection of societal biases, issues and human systems of classification. 

I couldn’t stop thinking about how much Karpathy’s experiment reminded me of how if a few years ago (and by few I mean even 2-3 years ago) if you Googled “beautiful girls” for a few scrolls the only photos with be those of generic (pun-intended – “generated”) white women because the algorithms identified as beautiful (honestly, not much has changed now either). A computer doesn’t know what is “pretty”, “ugly”, “good”, “evil”. Humans have inputed and labelled recognizable patters and standards of beauty further bringing to the surfaces, the racisms and biases that are very much so present in our world but also the underrepresentation of minority groups and BIPOC in tech. Even in Karpathy’s results, one can see the obvious majority of who are in these selfies. 


Based on his explanations of what was categorized as a good and bad image, I’d definitely would like to ask him what and how those distinctions where made. Also, how is a selfie of Ellie Goulding (famous singer) there if he supposedly through out and separated photos with either too many or not enough likes compared to others and people with too many or not enough followers as others? 

Based on his worst selfies, one of the criteria is “low lighting”, however, is it just the low lighting that is categorized as bas or is dark skin also included in that? “Darker photos (which usually include much more noise as well) are ranked very low”.  This also speaks to the issue of Snapchat or instagram filters and their inability to pinpoint and find features on people with darker skin in order to apply the filter on them. 


P.S. Check back in the future for a more updated list on cool articles and readings about biases in algorithms! Need to do some digging and go through my saved material and notes from previous years! 


Ethem Alpaydin, Machine Learning: The New AI. Cambridge, MA: The MIT Press, 2016.

Crash Course Computer Science, no. 34: Machine Learning & Artificial Intelligence

Crash Course Computer Science, no. 35: Computer Vision

Crash Course AI, no. 5: Training an AI to Read Your Handwriting

Geoff Dougherty, Pattern Recognition and Classification: An Introduction (New York: Springer, 2012). Excerpt: Chaps. 1-2.

Andrej Karpathy, “What a Deep Neural Network Thinks About Your #selfie,” Andrej Karpathy Blog (blog), October 25, 2015,

Fascinating and Powerful Pattern Recognition

This week, a practical application of ML/AI is introduced- Pattern Recognition. Recently, Convolutional Neural Networks (ConvNet), a Deep Learning algorithm, is the most popular way to achieve pattern recognition. In the blog of Karpathy, he also used ConvNet to decide what is a good selfie. According to the classification system in Dougherty’s Pattern Recognition and Classification, Karpathy firstly found a mass database with 500 million images with the hashtag of selfie. Then, for pre-processing, he used another ConvNet to label 200 million images that contain at least one face. He input the standards of a good selfie (can we say the standard he put in is a hyper-parameter?) to extract features/kernels. It is worth noting that, in the experiment, the standard he took, which I was skeptical about before I read the article, is fair-ranking with certain weights for the audience, likers, and followers (it would be potentially useful for influencers on social media; and since time may be a critical influence factor, can we also add it into the standard?).  After those steps, he got a sufficient dataset to train his ConvNet model with Caffe, a deep learning framework. The model then processed the dataset in its hidden layers to give the classification results.

This experiment demonstrates the word from Crash Course #35- “abstraction is the key to building complex systems, and the same is true in computer vision.” The abstraction in the experiment is complicated since there are too many features in a photo to consider, and that’s really fascinating. Since, according to Dougherty, document recognition is also a part of pattern recognition, I am wondering how the algorithm could tell the difference in the translation process between characters in Japanese and characters in Chinese without their codes (for example, “学生” means “student” in both Chinese and Japanese)? I know the pattern recognition algorithm is context-sensitive, but does it mean the translating algorithm need to train with both Chinese and Japanese dataset? 

Weekly Takeaways

Karpathy’s article provides readers with an interesting example to deblackbox Convolutional Neural Networks and in this case, how they can be used to do feature detection, pattern recognition, and probabilistic inferences/predictions for classifying selfie photo images. To put it simply, Karpathy introduces readers to ConvNets by providing an example with animals. Basically, the process is a large collection of filters that are applied on top of each other. After we have trained a dataset, when we attempt to test our dataset, we send a raw image, which is represented as a 3-dimensional grid of numbers’ convolutions, and one operation is repeated over and over a few tens of times (depending on how many we’ve decided to run it through). Small filters slide over the image spatially and this operation is repeated over and over, with yet another set of filters being applied to the previous filtered responses. The goal of this process is to gradually detect more and more complex visual patterns until the last set of filters is computing the probability of entire visual classes in the image.

To get into greater detail of what Karpathy explains through selfies, I will refer to the Crash Course videos and start with the idea of preprocessing. When raw images are fed to be tested, the computer reads the image by looking at pixels. Each pixel is a grayscale value between 0 and 255.  To normalize each pixel value and make them easier for the neural network to process, we’ll divide each value by 255.  That will give us a number between 0 and 1 for each pixel in each image and makes the data easier to process. Carrie Anne of Crash Course does a great job of going into detail about this pixel processing. In effect, an artificial neuron, which is the building block of a neural network, takes a series of inputs and multiplies each by a specified weight, and then sums those values altogether. These input weights are equivalent to kernel values; neural networks can learn their own useful kernels that are able to recognize interesting features in images. These kernels contain the values for a pixel-wide multiplication, the sum of which is saved into the center pixel of an image. We then perform a convolution, which is the operation of applying a kernel to a patch of pixels in order to create a new pixel value. Convolutional neural networks use banks of those neurons to process image data and after being digested by different learned kernels, output a new image.

The issues I found with Karpathys article I will speak about in the form of questions:

  • Karpathy writes that we don’t know what the filters should be looking for, instead, we initialize them all randomly and then train them over time….could that be a waste of resources/time/space? How do we “not know” what the filters should be looking for when it is clear to the human brain that the ‘things’ we want to differentiate are distinct in some way/shape/form?
  • Karpathy also says our algorithm needs to be trained on a very diverse but similar set of data, but in the example he used, how many variations of frogs/dogs, or more broadly, humans are there? I think there is a huge potential exclusivity, and I find it interesting that the best-ranked selfies are almost entirely skinny white females.
  • Also, this may be an unimportant point, but I dislike the way Karpathy trained the data and believe faulty reasoning could lead to reliance upon technology that is improperly trained. In his example, it should be about more than how many followers you have or likes you received. What if this is the user’s 10th post of the day or 13th selfie in a row and people are not inclined to like the photo? What if Instagram was down/the algorithm hid the photo from viewers and it did not get many likes? I don’t think describing factors like this as“an anomaly” and claiming “it will be right more times than not” is a fair enough argument when the stakes are higher. 
  • While this was not raised in Karpathy’s article but rather the Artificial Intelligence Crash Course video, the speaker said it is extremely expensive to label a dataset and that is why for this lab we used prelabeled data. Could financial concerns like this lead to larger-scale ethical issues? Is there any way around this?


CrashCourse. Computer Vision: Crash Course Computer Science #35, 2017.
———. How to Make an AI Read Your Handwriting (LAB) : Crash Course Ai #5, 2019.
“What a Deep Neural Network Thinks about Your #selfie.” Accessed March 1, 2021.

Analysis of Karpathy’s Article Key Points – Heba Khashogji

Machine Learning (ML) and Deep Learning (DL) can be used to analyze a tremendous number of images, extract useful information and make decisions about them (Machine_Learning&Artificial_Intelligence, 2017) like classifying E-mails, recommending videos, diseases prediction, recognizing handwriting ((LAB):CrashCourseAi#5, 2019) etc. ML gives computers the ability to extract high-level understanding from digital images (CrashCourseComputerScience#35, 2017).

The first appearance of such a model back in 1993, but the first actual use was in 2012 due to the GPUs’ development and the massive increase in data sizes (ImageNet, for example) (Karpathy, 2015).

ConvNet takes a 256x256x3 image as input and produces a probability of each output (class). The class with the highest probability will be chosen. At each layer, ConvNet performs convolution using filters, getting information like edges, color, etc. (CrashCourseComputerScience#35, 2017). More complex features will be extracted when we go deeper and deeper into the network. At the training process, filters are initialized randomly and trained until the network learns to match the image with the correct class (Karpathy, 2015). The training process of a deep network is complicated and takes much more time than traditional ones. Still, the accuracy is much better than the deep networks’ ability to handle massive data (ALPAYDIN, 2016).

Karpathy ConvNet to Classify Selfie Images

Karpathy applied the following vital steps to classify selfie images into good and bad:

  1. Gathering images tagged with #Selfie word (5 million images).
  2. Organizing the dataset: Karpathy divided the dataset into 1-million good and 1-million bad selfies based on some factors like the number of people that have seen the selfie, number of likes, number of followers and number of tags. 100-based groups were stored as good selfies while the rest ones stored as bad ones.
  3. Training: Karpathy selected the VGGNet pre-trained model and used Caffe to train it on the collected selfie dataset. ConvNet tuned its filters in a way that best allows the separation of the good and bad selfies under a well-known method called supervised learning (Dougherty, 2013).
  4. Results: The author selected the best 100 selfies out of 50000 selected by ConvNet. He introduced some advice to take a good selfie based on ConvNet results like females occupying about 30% of the image, cutting off the forehand, showing long hair, etc. He concluded that the style of the image was the key feature to make a good selfie.
  5. Extensions: The author also performed three different tasks; the first was the classification of celebrities’ selfies. Although there were specific factors to select the best selfies, oppose examples like including men and illumination problems appeared in some of the best selfies. The second task was to apply the t-SNE algorithm taking images and making some clustering by grouping them into categories based on similar conditions like the L2 norm. Results showed clusters like sunglasses, full-parts and mirror-included. The third task was to discover the best crop of a selfie. Karpathy randomly cropped image and introduced fragments to ConvNet, which decided the best crop. He found that ConvNet prefers selfies with heads taking about 30% of the image and chops off the forehead.

In some cases, ConvNet selected rude crops. Karpathy inserted a spatial transformation layer before the ConvNet and backpropped into six parameters defining an arbitrary crop. This extension didn’t work well. It sometimes was stuck. He also tried to constraint the transform, but it wasn’t helpful. The good news is that no global search is needed if the transform has three bounded parameters (Karpathy, 2015).

  1. Availability: Anyone on Twitter can use the “deepself” bot designed by karpathy to analyze his/her selfie and get the score of goodness his/her selfie is.

References: Link:

(LAB):CrashCourseAi#5. (2019). Retrieved from YouTube:

ALPAYDIN, E. (2016). Machine Learning: The New Al . Cambridge: Massachusetts Institute of Technology.

CrashCourseComputerScience#35. (2017). Retrieved from Youtube:

Dougherty, G. (2013). Pattern Recognition and Classification. New York: Springer Science+Business Media.

Karpathy, A. (2015). Retrieved 2020, from

Machine_Learning&Artificial_Intelligence. (2017). Machine Learning & Artificial Intelligence. Retrieved from YouTube:

Simplify the problem through the layers

The article is to solve a problem what makes a good selfie. The features to tell whether a selfie is good or not are abstract representations, which means hard to describe. We are easy to say this is a good selfie by our feeling, while we are difficult to conclude what makes a good selfie. It just like we can figure out object ‘A’ in different kinds of handwriting, but we cannot explain well why this is an A. Luckily, the Deep Neural network is good at it, to find rules and features hidden the class and apply them to new instances. Actually, Karpathy considered the problem what makes a good selfie as a tow-class classification problem how to divide selfies into good class and bad class, which is easier for Deep Neural Network to solve.

To deal with the problem, the first step is to acquire data suitable for the ConvNet. In the video How to make an AI read your handwriting (LAB): Crash Course Ai #5, to let the machine understand the handwriting correctly, he should first digitize the data and then make the data similar to the letter in the EMNIST, a training set. Though the selfie data do not need to be done like that since it is digital, it still has some restrictions like the images should contain at least one face.

The next step is feature extraction. Like the discrimination of nuts and bolts is on the area, the discrimination of apples and bananas is on the circularity, the classification of selfies also depends on different features. Maybe from a human perspective, there is a big difference between distinguishing fruits and distinguishing between good and bad selfies, while for the machine, there is no essential difference. It’s just that the complexity of the distinction may require a lot of filters, not just about circularity or size. As for how to choose a dimension or filter, it is tried out by the machine through training. Like the article said, random filters depend on the results, mathematical process for changing all filters. The data goes through the filters and finally becomes some kinds of values for output.

On the other hand, the selfies as input will be considered as pixels and go through like input layer → hidden layer → output layer. In my understanding, hidden layers is a set of layers. There are many filters in each layer for particular task, or we can say each filter will only be excited about specific and corresponding feature. For the image itself, it will be segmented to isolate different objects from each other and from the background, and the different features are labeled. And filters will detect whether the data have corresponding features. It seems like divided the images into many pieces and filters will answer yes or no (binary) to determine whether it will output some value to next layers or filters according to image’s feature. (But to be honest, I feel a little confused with the concepts of layer, column, filter, object and feature)


1 What’s the invariance of selfie image? Can you give some examples?

2 It seems that features of good selfies in the article are concluded through the results by the author instead of the machine. Is it possible that the standard of machine in good selfie is different from human’s conclusion? Because the machine just deals with pixel and may do not understand like Face should occupy about 1/3 of the image. For example, in the article there is one selfie that the machine just got rid of the “self” part of selfie for a machine version good selfie.

3 The article said t-SNE is a wonderful algorithm and it takes some number of things and lays them out in such way that nearby things are similar. I am curious about how the t-SNE deal with movies. Compared with selfies, the movies are dynamic and have a larger amount of data.


Alpaydin, E. (n.d.). Machine Learning. MACHINE LEARNING, 225.

Dougherty, G. (2013). Pattern Recognition and Classification. Springer New York.

ML to the Trolley Problem

In Karpathy’sKarpathy’s article, the pattern recognition function presented as a performance of ML. Karpathy introduced how Convolutional Neural Net can distinguish the good and bad profiles by adding filters over and over again randomly. Based on my understanding, the whole process is letting the computer subtract a massive amount of information and capture the common features, further transfer those common features as a standard to distinguish the definition of good and bad. According to the video inserted in the article, each filter has different purposes. Some can identify the faces; the others are designed to capture the clothing part, as Karpathy said, like teaching a child to identify the common figures. Another thing I notice is the binary function it uses. I was surprised a simple binary function could also do this complicated process. If I understand correctly based on the formal material, Karpathy might use the binary function to let the machine “know” if this profile is good or bad by divide all the characters into questions like “does the picture have one face or more?”, “Is it a male or female?”, “Does the face occupy the large portion of the picture or not.” And the Machine makes the decision based on the answer to these questions. That’s why Karpathy listed the final standard of what makes a profile good and bad, the process of classification, and identified features.

The reason why Machines can learn is a credit to the algorithms builds inside. So, it obtains the ability to analyze the orders or what we called the patterns from the database and further make predictions and make decisions. Which lead to a question I was concerned about for a long time. (Pretty irrelevant) Does the process of ML make the machines make more rational decisions than humans? Like on the question of the Trolley Problem. Since the computer has very high calculation speed, what if the Machine calculates and compares these people’s conditions like: “Based on the family and health record, ask questions like how long the person has left to live, the percentage of commit a crime.” Also, based on the education and financial record to identify the person’s contribution to society… To let the Machine determine who gets to survive in this accident.


“What a Deep Neural Network Thinks about Your #selfie.” n.d. Accessed February 26, 2021.

CrashCourse. 2017a. Machine Learning & Artificial Intelligence: Crash Course Computer Science #34

The Rumble of A Roomba

I think the most interesting thing for me this week, which was highlighted by the Karpathy article, is how much we need to adjust the data we input into a system before creating a machine learning model. A uniformity must exist with a reduction of noise to accurately embody the true answer to the problem we are attempting to solve.

This most acutely reminds me of a Roomba or a robotic vacuum. People, including myself, will sing the praises of owning one of these devices as it keeps the floor clean regularly without the need of external monitoring, to fulfill a role which I would normally take on as I have now two quite furry companions. The thing about the vacuums though is that I have to make sure everything that I don’t want it to run into is off the floor, that there is no water abounding, that there isn’t any string to get caught up in it’s gears or motors. Essentially I have made my apartment a system in which the robot vacuum can work in peace without any obstruction.

The same goes for this learning model which Karpathy uses. These photos are taken and cleaned in some way to create a uniformity which the system itself can understand and process, with anything deviating from that uniformity not being captured.

What does this tell us about the pictures? That the system knows how to pick out certain types of well done and popular made photos which people have uploaded on the internet. This is amazing! Though does it teach us anything about the photos themselves or how to take photos? No, as these are techniques that could be learned by studying photography and design.

What I worry about is not clean practices of learning by the messiness of the world. Something has to give, will the world become more orderly to accommodate the model? Or will the model eventually be strong enough to be able to handle the messiness of the world? I am sure it’s somewhere in between but it will be interesting where we will find ourselves on that spectrum.

Pattern Recognition: The Foundations of AI/ML- Chirin Dirani

The readings and videos for this week add more level of understanding to the foundations of AI and ML. Learning about pattern recognition is another step toward deblackboxing computing systems and AI. According to Geoff Dougherty, pattern recognition is when we put many samples of an image into a program for analysis, this program should recognize a pattern specific to the input image and to identify the pattern as a member of a category or class this program already knows. Because there are many categories or classes, we have to classify a particular image into a certain class, and this is what we call classification. The recognition process happens by training convolutional neural networks algorithms (ConvNets) to help the program recognize the pattern. These ConvNets can be applied to many image recognition problems like recognizing handwriting text, spotting tumors in CT scan, monitoring traffic on roads and much more. Dougherty emphasizes the fact that pattern recognition “is used to include all objects that we might want to classify.” The materials for this class provide many case studies for the applications of pattern recognition through ConvNets. I will start with Andrej Karpathy piece on how to take the best selfie, then will elaborate on digital image analysis as I understood it from the crash course; computer vision, and will end with the crash course video on using pattern recognition for Python code to read our handwriting. 

The first case is the interesting article by Andrej Karpathy. He tried to find what makes a perfect selfie by using convolutional neural networks (ConvNets). For Karpathy, ConvNets “recognize things, places and people in personal photos, signs, people and lights in self-driving cars, crops, forests and traffic in aerial imagery, various anomalies in medical images and all kinds of other useful things.” Karpathy introduced the basics of convolutional neural networks job and was more focused on his applied techniques of using pattern recognition in digital image analysis. By  training ConvNets, the program was able to recognize the best 100 selfies. Despite the fact that this case is an ideal case study for pattern recognition using ConvNets, however, I finished the article with more questions than the ones I had when I started it. The fact that we can feed ConvNets with images and labels of whatever we like! Made me convinced that these ConvNets will learn to recognize the labels that we want. This fact pushed me to question whether objective or subjective these ConvNets are! In other words, would the outputs change according to the gender, race, orientation and motives of the human feeding inputs?

To bridge the gap in understanding the convolutional neural networks algorithms, missing in Karpathy article, and how it works in decision making and pattern recognition (facial recognition here), I relied on the crash course episode on ML/AL. According to this video, The ultimate objective of ML is to use computers to make decisions about data. The decision is taken by using algorithms that give computers the ability to learn from data then make decisions. To start with, the decision process is called classification and the algorithm that does it is called classifier. To train machine learning classifiers to make good predictions, we need training data. Machine learning algorithms separate the labeled data by decision boundaries. At this stage, ML algorithms work on maximizing correct classifications and minimizing wrong ones. Decision tree is one example of ML techniques and it represents dividing the decision space into boxes. The ML algorithm that produces a decision can depend on statistics for making confident decisions or could have no origins in statistics. The decision tree in this case is called artificial neural networks inspired by the neurons in our brains. Similar to brain neurons, artificial neurons receive inputs from other cells, process those signals and then release their own signal to other cells. These cells form into huge interconnected networks able to process complex information. Rather than chemical and electrical signals, artificial neurons take numbers input and release numbers. They are organized into layers connected by links forming a network of Neurons. There are three levels of layers; Input layer, hidden layer/s and output layer. Hidden layers can be many layers and this is where Deep Learning comes from. There are two kinds of algorithms. The first one is sophisticated algorithms but not intelligent (weak or narrow) because they do one thing and they are intelligent at specific tasks such as finding faces or translating texts. The second kind is the general purpose AI algorithms (Strong AI). These algorithms pick up large amounts of information and learn faster than humans (Reinforcement learning).  

As for the second case of image analysis process. We feed an image as an input into a program, once a face in an image is isolated, more specialized computer vision algorithms layers can be applied to pinpoint facial landmarks. Emotion recognition algorithms can also interpret emotion and give computers the ability to understand when the face is happy, sad or maybe frustrated. Facial landmarks capture the geometry of the face, like the distance between eyes, nose or lips size. As the levels of abstraction are used in building complicated computing systems, similarly, they are used for facial recognition. Cameras (hardware level) provide improved sights then camera data is used to train algorithms to crunch pixels to recognize a face and process outputs from those algorithms to interpret  facial expressions. 

The last case is about crash course video on programming ConvNets to recognize handwritten letters and convert them into typed text. In this case, a language called python is used to write codes. The issue here is what Ethem Alpaydin called “The Additional Problem of segmentation,” which is how to write a code that figures out  where one letter ends and another begins. In this case, the neural network are programmed to recognize a pattern instead of memorizing a specific shape. To do so, the following steps should be implemented:

  1. Create a labeled dataset to train the neural network by splitting data into training sets and testing sets. 
  2. Create a neural network. AI should be configured with an input layer, some number of hidden layers and the ability to output a number corresponding to its letters prediction. 
  3. Train, test and tweak the code until l it’s accurate enough.
  4. Scan handwritten pages and use the newly trained neural network to convert into typed text.

In conclusion and to reemphasize what we said in previous classes, computer systems, AI and ML are useful but can not be intelligent like humans. It is all about understanding computing design layers. By understanding the process of pattern recognition today, we reveal  another level in this system and as Professor Irvine says, “There is no magic, no mysteries — only human design for complex systems.”


Crash Course Computer Science, no. 34: Machine Learning & Artificial Intelligence

Crash Course Computer Science, no. 35: Computer Vision

Crash Course AI, no. 5: Training an AI to Read Your Handwriting

Ethem Alpaydin, Machine Learning: The New AI. Cambridge, MA: The MIT Press, 2016.

Geoff Dougherty, Pattern Recognition and Classification: An Introduction (New York: Springer, 2012).

Professor Irvine Introduction Intro to Computing Design Principles & AI/ML Design


Learning to Read

Before delving into the key issues and points from Karpathy’s article, we need to deconstruct pattern recognition to its basics. Pattern Recognition is a subset of Machine Learning because it is a process that gives computers the ability to learn from data that can then be used to make predictions and decisions. This process composes of classifying data into categories determined by decision boundaries. The goal is to maximize correct classification while minimizing errors. To do so it goes through a step-by-step process most notable from Dougherty’s reading regarding the image below. The entire method boils down to this:

  • Sensing/Acquisition – uses a transducer such as a camera or microphone to capture signals (e.g., an image) with enough distinguishing features.
  • Preprocessing – makes the data easier to segment like numerating pixels into a digit by dividing the RGB code of the pixel by 256.
  • Segmentation – partitions a signal into regions that are meaningful for a particular task—the foreground, comprising the objects of interest, and the background, everything else.
    1. Region-based = similarities are detected.
    2. Boundary-based = discontinuities are detected.
  • Feature Extraction –
    1. Features are characteristic properties of the objects whose value should be similar for objects in a particular class, and different from the values for objects in another class (or from the background). Examples: Continuous (numbers) or Categorical (nominal, ordinal)
  • Classification – assigns objects to certain categories based on the feature information by evaluating the evidence presented and decides regarding the class each object should be assigned, depending on whether the values of its features fall inside or outside the tolerance of that class.

The first four steps I interpret as preparing the data and features that the algorithm will apply to the data, and the final step is the where the action occurs in a simple and fast manner. Using a picture for an example, this happens by sending the data in each pixel through this process. Now that we know what pattern recognition consist of, we can further now examine Karpathy’s explanation of Convolution Neural Networks (ConvNet) which is just another form of pattern recognition specifically a type of classification method. Other methods include decision trees, forest (which are just compilations of decision trees), support vector machines, and neural networks.

To understand ConvNets we should start with understanding neural networks. Neural networks are organized in layers connected as links that take a series of inputs and combines them to then emit a signal as an output, both inputs and outputs are represented as numbers. Between the input and output are hidden layers that sum the weighted inputs and then apply a bias. These are initially set to random numbers when a neural network is created, then an algorithm starts training the neural network using labeled data from the training data. The training starts from scratch by initializing filters at random and then changing the filters slightly using a mathematical process by telling the system what the actual image is e.g. a toad vs a frog (supervised learning?). Next it applies the activation function (transfer function) that gets applied to an output performing a final mathematical modification to get the result. ConvNet follows the same principle but has more hidden layers performing more data analysis to recognize complex objects and scenes, this is also termed deep learning.

Karpathy was able to highlight this through a practical example using selfies which I found both amusing and enjoyable. I think the key points he raises that are echoed in the other readings is that pattern recognition is not 100% accurate. The choosing of the features that create the decision boundaries and space result in a confusion matrix that tells what the algorithm got right and wrong. This inability to be 100% accurate is termed the “Curse of Dimensionality” in which the more features we add to make the decisions more precise the more complicated the classification become and as such experts employ the K.I.S.S. method. However, we can program algorithms like ConvNet to be mostly right by identifying features and through repetitive training assist the algorithm to gradually figure out what to look for, this I believe is termed supervised learning or maybe reinforcement learning? In sum ConvNet is a form of pattern recognition used as a tool for machine learning that still has obstacles to overcome but is now being used to interpret data to convert handwriting into text, spot tumors in CT scans, monitor traffic flows on road, propel self-driving car, possibilities are endless!


Understanding the definitions of supervised vs unsupervised does that mean supervised learning is pattern recognition? Does that then mean unsupervised learning does not exist, if so, what are some examples?

Where does reinforcement learning fall under supervised or unsupervised?

Are features another term for bias and weights?


Alapaydin, Ethem. 2016. Machine Learning-The New AI. MIT Press Essential Knowledge Series. Cambridge, MA: MIT Press.

CrashCourse. 2017a. Machine Learning & Artificial Intelligence: Crash Course Computer Science #34.

———. 2017b. Computer Vision: Crash Course Computer Science #35.

———. 2019. How to Make an AI Read Your Handwriting (LAB) : Crash Course Ai #5.

“Dougherty-Pattern Recognition and Classification-an Introduction-2013-Excerpt-1-2.Pdf.” n.d. Google Docs. Accessed February 26, 2021.

“What a Deep Neural Network Thinks about Your #selfie.” n.d. Accessed February 26, 2021.