Simplify the problem through the layers

The article is to solve a problem what makes a good selfie. The features to tell whether a selfie is good or not are abstract representations, which means hard to describe. We are easy to say this is a good selfie by our feeling, while we are difficult to conclude what makes a good selfie. It just like we can figure out object ‘A’ in different kinds of handwriting, but we cannot explain well why this is an A. Luckily, the Deep Neural network is good at it, to find rules and features hidden the class and apply them to new instances. Actually, Karpathy considered the problem what makes a good selfie as a tow-class classification problem how to divide selfies into good class and bad class, which is easier for Deep Neural Network to solve.

To deal with the problem, the first step is to acquire data suitable for the ConvNet. In the video How to make an AI read your handwriting (LAB): Crash Course Ai #5, to let the machine understand the handwriting correctly, he should first digitize the data and then make the data similar to the letter in the EMNIST, a training set. Though the selfie data do not need to be done like that since it is digital, it still has some restrictions like the images should contain at least one face.

The next step is feature extraction. Like the discrimination of nuts and bolts is on the area, the discrimination of apples and bananas is on the circularity, the classification of selfies also depends on different features. Maybe from a human perspective, there is a big difference between distinguishing fruits and distinguishing between good and bad selfies, while for the machine, there is no essential difference. It’s just that the complexity of the distinction may require a lot of filters, not just about circularity or size. As for how to choose a dimension or filter, it is tried out by the machine through training. Like the article said, random filters depend on the results, mathematical process for changing all filters. The data goes through the filters and finally becomes some kinds of values for output.

On the other hand, the selfies as input will be considered as pixels and go through like input layer → hidden layer → output layer. In my understanding, hidden layers is a set of layers. There are many filters in each layer for particular task, or we can say each filter will only be excited about specific and corresponding feature. For the image itself, it will be segmented to isolate different objects from each other and from the background, and the different features are labeled. And filters will detect whether the data have corresponding features. It seems like divided the images into many pieces and filters will answer yes or no (binary) to determine whether it will output some value to next layers or filters according to image’s feature. (But to be honest, I feel a little confused with the concepts of layer, column, filter, object and feature)


1 What’s the invariance of selfie image? Can you give some examples?

2 It seems that features of good selfies in the article are concluded through the results by the author instead of the machine. Is it possible that the standard of machine in good selfie is different from human’s conclusion? Because the machine just deals with pixel and may do not understand like Face should occupy about 1/3 of the image. For example, in the article there is one selfie that the machine just got rid of the “self” part of selfie for a machine version good selfie.

3 The article said t-SNE is a wonderful algorithm and it takes some number of things and lays them out in such way that nearby things are similar. I am curious about how the t-SNE deal with movies. Compared with selfies, the movies are dynamic and have a larger amount of data.


Alpaydin, E. (n.d.). Machine Learning. MACHINE LEARNING, 225.

Dougherty, G. (2013). Pattern Recognition and Classification. Springer New York.