This article attempts to answer the following questions: whether and how the recommendation system can recommend a topic-related content that has never appeared in the system. To figure out the question, the article gives a brief description of the recommendation system and concludes that without relative data, the system cannot recommend relative content. Then the article focuses on the external data of the app and deblackbox the digital fingerprint to show that it is possible to improve the recommendation system by tracking users and sharing data. Finally the article discusses the data privacy and expressed some concerns.
The recommendation system of various apps based on machine learning and algorithms brings us a lot of convenience. Shopping apps recommend products we need, video apps recommend videos that attract us, and search engines guess what we want to search. The prediction or interference of our needs through machine learning and various algorithms is actually well understood, since they are based on the behavioral data created by ourselves. For example, I bought a science fiction novel, and the shopping app recommends other science fiction novels; another example, if I click on a cat-related video, the app will recommend more cat videos. Admittedly, based on the different recommendation system algorithm, there are different recommendation strategies, but most of the strategies are explainable and understandable from a human perspective. But in daily life, we may meet the following situations: we discussed a topic with friends (maybe on other apps, or even in reality). And the topic has never been discussed or searched on an app. but after a while, the topic-related advertisements or videos are recommended on the app. This coincidence naturally makes us question: can the recommendation system recommend a topic-related content that has never appeared in the system, or is our mobile app monitoring us all the time and extracting key words for recommendation? This article will try to answer this question. First, explain the composition and data sources of the recommendation system in general. Then, starting from data sources, explain how large Internet companies build user portraits in multiple dimensions, debalckbox the method of tracking users. And finally discuss the impact of mobile phone fingerprints (digital fingerprints) on data privacy.
How does the recommendation system works?
Before discussing whether the recommendation system can make recommendations as accurate as monitoring，we need to briefly describe how the recommendation system works. Simply put, the recommendation system is divided into three aspects: data, algorithm, and architecture. The data provides information and is the input of the recommendation system. The data contains user and content attributes information and user behavior and preference information, such as clicking on a certain type of video, purchasing a certain type of goods, etc. The algorithm provides the logic for processing the data, that is, how to process the data to get the desired output. Take the most commonly used algorithm in the recommendation system, Collaborative Filtering algorithm, as an example. Collaborative Filtering is based an assumption: if A and B have similar historical annotation patterns or behavior habits in some content, then they will have similar interests in content. It generally uses the nearest neighbor algorithm to calculate the distance between users by using the user’s historical preference information, and then uses the weighted product reviews of other user which is the nearest neighbor to predict the target user’s preference for a specific product. The system recommends products or content to target users based on the result. The architecture specifies how data flows and processes. It specifies the process of how data travels from the client to the storage unit (database) and then back to the client.
In other words, the recommendation system categorizes raw data and forms user portraits, attaches model tags or labels (ie patterns) to each user, and then recommends content based on various algorithms, such as the Collaborative Filter just mentioned.
Fig1, data processing
As fig1 shown above, the original data contains four aspects:
User data refers to the user’s personal information, such as gender, age, registration time, mobile phone model, etc.
Content data refers to the content provided by the app. Foe example, content data of shopping apps such as Taobao and Amazon are related with products and product reviews. Content data of video apps such as Tik Tok and Netflix are related with videos and video reviews.
User behavior logs refer to what the user did on the app, such as what videos they searched for, what videos they shared, or what product they purchased.
External data is data given by other apps. A single app can only collect a certain aspect of the user’s preferences data. For example, a video app can only describe what type of content user prefers in the video field. But if we integrate other different types of app data, the user’s data dimension will be greatly enriched.
The fact labels are cleaned based on the original data, including dynamic and static portraits:
Static portrait refers to the attributes of the user which are independent of the product scene, such as age and gender. Such information is relatively stable.
Dynamic portrait refers to the user’s behavior data on the app, and explicit (the behavior clearly expressed by the user) includes likes, sharing, etc. It is worth mentioning that if it is a comment, it is necessary to use NLP to determine whether the user is positive, negative or neutral. The implicit ones (the user does not clearly express their preferences) include the duration time the user watch video, clicks, etc.
Model labels are obtained through weighted calculation and cluster analysis through fact labels, which means weight for each dimension, and then calculate, and the users will be classified (cluster analysis) depended on the calculation.
In short, the recommendation system processes the data layer by layer by using various models and algorithms, and then returns the corresponding recommendation results. But in any case, the recommendation system cannot give recommendation results out of nothing. It needs to input various data, process the data according to algorithms designed by humans, and return the results according to certain logic. Therefore, for a single app, if we have not discussed the topic on the app (that is, there is no corresponding data for the recommendation system), it is reasonable that the app will not return the recommendation results of the related topic.
However, it can be seen in fig1 that the data source is not limited to the app itself. If there is corresponding external data, the recommendation system have the ability to recommend the content corresponding to the external data. In fact, technically speaking, large Internet companies such as Google, Alibaba and ByteDance, etc., usually have multiple apps in different fields, which can share user data and expand user portraits’ dimensions through user account information and digital fingerprints. Take Alibaba as an example. Ali’s apps include map, health, payment, video platform and even weibo, a social platform, so Ali’s portrait of Chinese users can cover many dimensions. It is worth mentioning that for different apps with common accounts, it is reasonable to directly match the account with the database. However, some Ali-owned apps, such as AutoNavi Maps, do not require users to log in to their accounts. Does Ali have a way to track this kind of users? The answer is yes. For users who use the app without logging in to a personal account, the app can identify or track users by the fingerprint of the smartphone.
How to track users?
Existing tracking mechanisms are usually based on either tagging or fingerprinting (Klein & Pinkas, n.d.). Tracking here are similar to the word recognize or identify mentioned above.
The typical tagging method is cookies. Cookies are data stored on the user’s local terminal. It is a small piece of text information sent by the server to the client browser and stored locally on the client as a basis for the server to identify the user’s identity status. Their main use is to remember helpful things like your account login info, or what items were in your online shopping cart (Cover Your Tracks, n.d.). But now, whether PC browser or mobile phone, there are many users who choose to delete or hide cookies, which leads to the poor effect of using cookies to identify users.
Figure 2, some measurements of fingerprints, source: https://amiunique.org/fp
All the measurements in Figure 2 are to find out the uniqueness of the user. It is worth mentioning the measurement of Canvas and WebGL. When drawing a 2D picture or 3D picture on different operating systems and different browsers including PC and mobile phone, the generated image content is actually not exactly the same, even if it looks the same to our eyes. So by extracting the picture information of Canvas and WebGL, we can uniquely identify and track the user.
Deblackbox the digital fingerprints
Figure 3, Generic methodology of digital fingerprints, source: (Baldini & Steri, 2017)
Meanwhile, we can also deblackbox digital fingerprint following the fingerprint recognition process of the browser.
Figure 4, Browser fingerprint recognition process
Looking at the two pictures together, digital fingerprint recognition is composed of 3 entities, namely the mobile phone on the client side (refers to Browser), the apps on the server side (refers to Website), and the database (SQL). In fact, for fingerprints of mobile phones, in addition to the above-mentioned measurements similar to browser fingerprints, such as device information, user configuration, etc., there are also many measurements about mobile phone components (hardware). But all the data needs to be digitized before proceeding to the next step. Therefore, for apps, digital information that can be directly obtained is usually used for identification.
Back to the original question, when the user portraits of people are enriched, the portraits will not only include behavioral data, but also interpersonal relationship data and the data about relationship between you devices (PC, phone and so on) and accounts. For example, if you shared a shopping link to a friend a long time ago, your user portrait and your friend’s user portrait will be considered relevant, so when you discuss a topic with your friend, your friend may have left data on the topic online. The recommendation system based on the relationship between you and your friends, as well as other data such as location, coexisting in a local area network, etc. It is reasonable that after discussing the topic, the recommendation system will recommend the relevant content to your friend and also recommend it to you at the same time.
Discussion of data privacy and sharing personal data
In the article, we ask a question based on a daily phenomenon: whether the mobile app has the ability to make recommendations as accurate as monitoring. First of all, we introduce the basic composition and operation of the recommendation system, and concluded that the recommendation system cannot give recommendation results out of nothing. It needs to input various data and process the data according to the algorithm designed by humans. The result should be relative with the input data. From the perspective of data sources, we deblackbox the process of digital fingerprint and believe that the data sharing of apps in different fields and the user tracking technique can enrich user portraits and make accurate recommendations. Finally, the article expresses the concerns about the impact of digital fingerprint on data privacy, and considers that data privacy in the mobile phone field needs more research and corresponding restrictive measures.
Baldini, G., & Steri, G. (2017). A Survey of Techniques for the Identiﬁcation of Mobile Phones Using the Physical Fingerprints of the Built-In Components. 19(3), 29.
Eckersley, P. (n.d.). How Unique Is Your Web Browser? 19.
Klein, A., & Pinkas, B. (n.d.). DNS Cache-Based User Tracking. 15.
Laperdrix, P., Rudametkin, W., & Baudry, B. (n.d.). Beauty and the Beast: Diverting modern web browsers to build unique browser fingerprints. 18.
Zheng, T., Zhang, X., Qin, Z., Li, B., Liu, X., & Ren, K. (n.d.). Learning-based Practical Smartphone Eavesdropping with Built-in Accelerometer. 18.
Cover Your Tracks. (n.d.). Retrieved May 13, 2021, from https://coveryourtracks.eff.org/learn
Anand, S. A., & Saxena, N. (n.d.). Speechless: Analyzing the Threat to Speech Privacy from Smartphone Motion Sensors. 18.
FP-STALKER: Tracking Browser Fingerprint Evolutions. (n.d.). 14.
Das, A., Borisov, N., & Chou, E. (n.d.). Every Move You Make: Exploring Practical Issues in Smartphone Motion Sensor Fingerprinting and Countermeasures. 21.
Hauk, C. (2021, January 14). Browser Fingerprinting: What Is It and What Should You Do About It? Pixel Privacy. https://pixelprivacy.com/resources/browser-fingerprinting/