Weekly Takeaways

Karpathy’s article provides readers with an interesting example to deblackbox Convolutional Neural Networks and in this case, how they can be used to do feature detection, pattern recognition, and probabilistic inferences/predictions for classifying selfie photo images. To put it simply, Karpathy introduces readers to ConvNets by providing an example with animals. Basically, the process is a large collection of filters that are applied on top of each other. After we have trained a dataset, when we attempt to test our dataset, we send a raw image, which is represented as a 3-dimensional grid of numbers’ convolutions, and one operation is repeated over and over a few tens of times (depending on how many we’ve decided to run it through). Small filters slide over the image spatially and this operation is repeated over and over, with yet another set of filters being applied to the previous filtered responses. The goal of this process is to gradually detect more and more complex visual patterns until the last set of filters is computing the probability of entire visual classes in the image.

To get into greater detail of what Karpathy explains through selfies, I will refer to the Crash Course videos and start with the idea of preprocessing. When raw images are fed to be tested, the computer reads the image by looking at pixels. Each pixel is a grayscale value between 0 and 255.  To normalize each pixel value and make them easier for the neural network to process, we’ll divide each value by 255.  That will give us a number between 0 and 1 for each pixel in each image and makes the data easier to process. Carrie Anne of Crash Course does a great job of going into detail about this pixel processing. In effect, an artificial neuron, which is the building block of a neural network, takes a series of inputs and multiplies each by a specified weight, and then sums those values altogether. These input weights are equivalent to kernel values; neural networks can learn their own useful kernels that are able to recognize interesting features in images. These kernels contain the values for a pixel-wide multiplication, the sum of which is saved into the center pixel of an image. We then perform a convolution, which is the operation of applying a kernel to a patch of pixels in order to create a new pixel value. Convolutional neural networks use banks of those neurons to process image data and after being digested by different learned kernels, output a new image.

The issues I found with Karpathys article I will speak about in the form of questions:

  • Karpathy writes that we don’t know what the filters should be looking for, instead, we initialize them all randomly and then train them over time….could that be a waste of resources/time/space? How do we “not know” what the filters should be looking for when it is clear to the human brain that the ‘things’ we want to differentiate are distinct in some way/shape/form?
  • Karpathy also says our algorithm needs to be trained on a very diverse but similar set of data, but in the example he used, how many variations of frogs/dogs, or more broadly, humans are there? I think there is a huge potential exclusivity, and I find it interesting that the best-ranked selfies are almost entirely skinny white females.
  • Also, this may be an unimportant point, but I dislike the way Karpathy trained the data and believe faulty reasoning could lead to reliance upon technology that is improperly trained. In his example, it should be about more than how many followers you have or likes you received. What if this is the user’s 10th post of the day or 13th selfie in a row and people are not inclined to like the photo? What if Instagram was down/the algorithm hid the photo from viewers and it did not get many likes? I don’t think describing factors like this as“an anomaly” and claiming “it will be right more times than not” is a fair enough argument when the stakes are higher. 
  • While this was not raised in Karpathy’s article but rather the Artificial Intelligence Crash Course video, the speaker said it is extremely expensive to label a dataset and that is why for this lab we used prelabeled data. Could financial concerns like this lead to larger-scale ethical issues? Is there any way around this?


CrashCourse. Computer Vision: Crash Course Computer Science #35, 2017. https://www.youtube.com/watch?v=-4E2-0sxVUM.
———. How to Make an AI Read Your Handwriting (LAB) : Crash Course Ai #5, 2019. https://www.youtube.com/watch?list=PL8dPuuaLjXtO65LeD2p4_Sb5XQ51par_b&t=67&v=6nGCGYWMObE&feature=youtu.be.
“What a Deep Neural Network Thinks about Your #selfie.” Accessed March 1, 2021. https://karpathy.github.io/2015/10/25/selfie/.