Author Archives: Hao Guo

This is why you can’t find Google Cloud Computing Monopoly on Google

Cloud is not a technology but an operational model that gathers networked computing. Before, the cloud was used for removing the complexity of the connection between end-to-end data links. And now it applies to many types of server-side like any types of physical electronic devices (What Is Cloud Computing? 2019). The three major design principles and the fundamental architecture of cloud computing goes falls into SaaS, Paas, and Iaas. 1) Software as a Service (SaaS) is an on-demand service mainly for end users that are not required to install the application, but rather can access through website and interact with multiple users at the same time. Products like Google doc, Microsoft 365 are typical SaaS services that are cheap to work with and hold their computing resources managed by the vendor. The benefit is that there is no platform limitation. The users can manipulate the service from anywhere at any time, and simultaneously cooperate with others without technical barriers because the vendor takes care of it. The downside is that the most convenient functioned found in SaaS service are highly dependent on the usage of the internet. That saying, without the internet, Google Docs is like any still notes. 2) Platform as a Service (Paas) benefits the developers by providing a programmable and operational system where they can build and run their programs without worrying about the fundamental framework. Users are responsible for managing their data and app resources, the vendors will take care of the rest. 3) Infrastructure as a Service (Iaas) is designed for administrators where very editable but also complex to operate. Data storage, virtualization, servers & networking are all vendors’ responsibilities (Cloud Computing Services Models – IaaS PaaS SaaS Explained, 2017). The layers of how the cloud works: (Cloud Service Provider)-(Router)- (Network Cloud)-(Router)- (Cloud user/Host).

It is funny how some authors think the existence of a supercomputer or cloud computer makes us degenerated to the mainstream age. The invention of individual electric devices is a milestone in gaining back our power of self-control over our private data. But ever since the popularization of applying cloud computing, our data goes back to centralism (Frantsvog et al., 2021). As I was writing the topic about cloud computing monopoly, there is no better example than Google. Thinking about how many fields does Google products and services cover, and how many lawsuits on antitrust cases they have to face each year. I won’t list them all, but I will post a screenshot to make my point, and also remember that’s not all there is to it (Browse All of Google’s products & Services – Google, n.d.).

Another funny thing is, when I “googled” ‘Google cloud computing monopoly’, barely anything showed up, not even in google scholar. And as you may notice, Google is now launching its payment method and e-shopping market. All the data is collected from us before has finally been utilized. By analyzing our frequently visited locations, routings, our payment history as well as our purchasing habit and searching history, also our stored data/personal information like the password for each website, google probably knows us better than ourselves. The convenience comes along with private info intruding and leaking. There have been cases where google users lost access to their own data. By taking over 90% internet search market, Google provided the best free services, and use this free service to collect, analyze and trade data to make more profit (Vellante, 2020). Another downside of cloud computing monopoly is that government also faces difficulty accessing these data from the third party due to the current CNDA regulation (Snapp, 2021). And who gets to access these data becomes the biggest concern. And the non-transparency feature makes this open data a Blackbox where no one knows what exactly happened inside (Moss, 2020).  



Browse All of Google’s Products & Services – Google. (n.d.). Google.

Cloud Computing Services Models – IaaS PaaS SaaS Explained. (2017, April 6). [Video]. YouTube.

Frantsvog, D., Seymour, T., & John, F. (2021). View of Cloud Computing. The Clute Institute.

Moss, S. (2020, October 7). House reports on tech monopolies: Here’s what it says about Amazon Web Services. DCD.

Snapp, S. (2021, April 7). What To Do About the Extreme Monopoly Implications of Hyperscale Public Cloud Providers. Brightwork Research & Analysis.

Vellante, D. (2020, November 3). Google’s antitrust play: Get your head out of your ads – and double down on cloud and edge. SiliconANGLE.

What is Cloud Computing? (2019, December 1). [Video]. YouTube.

Wikipedia contributors. (2021, May 7). Cloud computing. Wikipedia.

De-Blackboxing AI/ML in Credit Risk Management

Hao Guo


This essay discusses Artificial Intelligence (AI)’s applications in the financial field, specifically in the usage on banks sector’s credit risk management. Starting with a short description of fundamental AI and financial background knowledge, followed by the shortcomings of traditional banking and further emphasized the importance of applying AI techniques in credit risk assessment along with its positive consequences. The major component forces on introducing different ML models applied in the process of credit risk management and later de-Blackbox AI technique by analyzing each ML model in detail. The toolkit example of Zen Risk developed by Deloitte helps visualized the de-Blackboxing process by providing a real-life case. The article also listed current concerns and potential risks of applying AI-generated models in the financial field at the bottom to provide a more comprehensive analysis. 


In the era of data explosion, artificial intelligence (AI), as a tool that great at processing massive amounts of data in a limited timeline, has been used in various fields to save manpower and material resources. Credit risk management as a sector mainly composed of data and models in the financial field, the intervention and the integration of artificial intelligence technology has become an inevitable trend. The AI-generated model can improve efficiency and increase productivity while reducing operating costs. AI and ML are game-changer in the field of risk management due to the feature of properly addressing the risks posed by financial institutions that involved with a large number of complex types of information every day (Trust, 2020). Even though there are many prominent aspects of applying AI techniques in financial sectors, only 32 % of financial services applied in their industries, mainly on data prediction, financial product recommendations, and voice recognition. For now, the most common AI usage in the financial industry is chatbots which provide simple financial guidance to customers. A more complex application of AI in banking is to identify and eliminate fraud. However, the real potential market falls into risk management which highly related to financial institutions’ revenue (Archer Software, 2021). For better encouraging financial institutions especially the banking sector to keep completive in the market, making more profit, along with providing more comprehensive services for customers, de-Blackboxing AI-generated models becomes a priority for achieving these goals. How will the AI/ML model transform the banking system and further finning credit risk management? What is the Blackbox in this particular process? How do people interpret it? What are the benefits and negative results of this technique? Those are the questions waiting to be addressed.

  1. AI usage in Financial systems

1.1 Artificial Intelligence

Artificial Intelligence is the technology that allows the machine to mimic the decision-making process and the problem-solving ability of human beings by processing a massive amount of data with rapid, iterative algorithms to eventually automatically seek out its pattern. In short, like Alan Turing described, AI is “A system that acts like humans.” The most sturdy applications we encountered today like Siri, Amazon’s recommendation system is ANI (Artificial Narrow Intelligence) which is differs from the AGI (Artificial General Intelligence) because it only specifies in the particular field instead of mastering in a variety of sectors (Artificial Intelligence (AI), 2021).

1.2 Financial Systems

A financial system is a network connected by financial institutions like insurance companies, stock exchanges, and investment banks which allows organizations and individuals to perform capital transformations (Corporate Finance Institute, 2021). Along with the data explosion, processing financial dossier becomes more complex and time-consuming than ever before. The most arresting feature of the past financial system is the high dependency on human ingenuity (Part of the Future of Financial Services series from the World Economic Forum and Deloitte, 2018).

1.3 The Application of AI in Financial Service Spectrums

The introducing of AI brought a sky-level high efficiency into the financial system. Based on the CB Insights report (2018), over 100 companies indicated that their applications of AI improved communities’ performances in many aspects. Figure one maps out partial associations and their AI technique operating areas. Most AI financial service falls into nine categories: 1) Credit Scoring / Direct Lending 2) Assistants / Personal Finance 3) Quantitative & Asset Management 4) Insurance 5) Market Research / Sentiment Analysis 6) Debt collection 7) Business Finance & Expense Reporting 8) General Purpose / Predictive Analytics 9) Regulatory, Compliance, & Fraud Detection. Architectural Intelligence nearly across the entire financial service spectrum. Credit risk management categorized under Credit Scoring / Direct Lending is the ceiling priority in those many AI regulatory areas.

Figure 1. The AI in Fintech Market Map, 2017

  1. AI in Banking Majors

2.1 Banks Income Structure

The core of the banking business model is lending. Banks create monetary currency using the income earned from lending instruments and customer-facing activates. Other profit-making approaches including: Customer deposits, mortgages, personal loans, lines of credit, bank fees, interbank lending, and currency trading (Survivor, 2021). The majority of financial services in the banking sector are associated with lending which severely relies on the credit of the obligator. Back in 2018, 29% of customers claimed their preference of using credit cards for daily consumption (Schroer, 2021). As of 2020, 44% of U.S consumers carry mortgages and the number is growing steadily at a rate of 2% annually (Stolba, 2021). Borrowers’ behavior of repayment failure can cause banks to go bankrupt.

2.2 The Shortcoming of Traditional Lending Assessment

The audit process of issuing a loan requires a lot of manpower because customer files are often crowded with too many objective noise components. The slightest misjudgment will result in a wrong decision and further cause profit losing and injury borrower’s interest. For the individual borrower, forming a risk profile can affect one’s life to a great extent. For example, whether the individual can drive and live safely, the possibility of them being educated, and the chance of receiving medical treatment. For business borrowers, their risk picture is involved in more complex situations due to their data across a variety of parameters which need a longer period, cost more manpower and material resources to generate a holistic risk profile. Credit risk can affect a borrower’s financial status, lose the loaner’s capital, and damage both reputations (PyData, 2017). And the suboptimal underwriting, inaccurate portfolio monitoring methodologies, and inefficient collection models could aggravate these lending problems (Bajaj, n.d.).

Figure 2. overview of common steps in the lending process. ([[Graph]], n.d.)

2.3 The Importance of Applying AI in Lending Assessment

Processing a large number of credit assessments in limited timelines is the precedence for the banks to solve. The credit information’s existence in a form of dynamic data lets the AI have full leeway. Since the prominent feature of the AI technique is interpreting massive data in terse time with near-perfect accuracy (Bajaj, n.d.). AI helps the banks to streamline and optimize credit decisions in a wider range, transform noisy objective information into quantitative trading to better portrait consumers’ risk portfolios. AI-generated WAP (mobile banking application) can help banks knowing their borrowers’ financial conditions deeper but with more privacy by monitoring users’ financial behaviors. The AI-driven assessment can better analyze borrowers’ banking data, tracking their financial activities, and further avoid giving risk loans and reduce the possibility of encountering credit fraud (Use Cases of AI in the Banking Sector, 2021). AI is replacing many financial positions like data science analysts and FRM (financial risk manager) by proving safer, smarter, and more effective financial services to consumers (Schroer, 2021).

Figure 3. Machine learning surfaces insights within large, complex data sets, enabling more accurate risk (McKinney, n.d.)

  1. AI in Credit Risk Management

In the video above (RISKROBOT TM – Explainable AI Automation in Credit Risk Management – SPIN Analytics Copyright 2020, 2018), the broadcaster introduced RISKROBOT as a classic example of a credit risk computing AI technique and provided a cursory description of steps AI needed to portrait a consumer’s credit risk profile and generates a report. In another presentation made by PyData (2017), the reporter takes ZOPA as an example to introduce their ML technique involved credit risk management process. By comparing, it is quite obvious the fundamental procedures of applying AI techniques in credit risk management are relatively comparable.

3.1 AI Decision-Making’s Involvement in Credit Risk Management

Unlike human service banks which take days or even weeks to evaluate and process borrowing formalities, AI-driven banks provided extensive automated and nearly real-time services to the individual borrowers and SME lending. Following the local data sharing regulation, AI-assisted banks generate more accurate assessment results by evaluating clients’ both traditional data sources like bank transaction activities, FICO score, tax return histories, and new data information resources like general location data report, utility info more quickly, massively and extensively. AI’s decision-making involvement in credit risk management observes clients from sophisticated perspectives decreases the possibility of offering a risky loan by screening out potential fraud performers (Agarwal et al., 2021).

Credit Qualification

Instead of using a rule-based linear regression model, AI-driven banks built complex models to analyze both structured and unstructured data collected from user’s browsing histories and their social media to perform an objective and comprehensive analysis on individuals and SMEs who lack official credit records or authentic credit information reports. When building and refining the ML quantitative model, customers with significant loan risk characteristics are automated filtered by early algorithms. Potential default borrowers with wavery financial portraits required manual verification in the early stage and were comprehended by the ML model through categorizing more comparable cases in the self-auditing process (Agarwal et al., 2021).

Limit Assessment and Pricing

AI/ML technique allows the banks’ analysis borrowers off the record financial condition by applying optical character recognition (OCR) to extract data from non-documentation files like e-commercial expenditures from costumer’ email and their telecom records. The ML model in this intervention can dissect loan appliers’ actual financial disposition power, to provide a more rational loan amount that does not exceed the borrower’s repayment ability, and further using NLP (natural language processing) to determine the repayment interests (Agarwal et al., 2021).

Fraud Management

ML model is also devastated in detecting the five costliest frauds: 1) identity theft 2) employee fraud 3) third-party or partner fraud 4) customer fraud, and 5) payment fraud like money laundering (Agarwal et al., 2021). A chinses bank Ping An applied facial recognition to identify the confidence level of borrowers’ financial statements. The AI-driven facial recognition mobile phone software can detect and process 54 subtle expressions in 1/15 to 1/25 of a second by tracing eye movements (Weinland, 2018).

Figure 4. The combination of AI and analytics enhances the onboarding journey for each new customer. (McKinsey & Company, 2021)

3.2 ML Models in Credit Risk Management

Support Vector Machine (SVM)

SVM is a supervised machine learning algorithm often used for individual feature classifications (Ray, 2020).  By using the concept of Structural Risk Minimalization (SRM), SVM calculates and differentiates the two classes of hyper-plane in high-dimensional space and lines by using the linear model in a high-dimensional space. SVM helps to analyze credit risks by classifying the decisions to the rational breadth (Iyyengar, 2020).

Figure 5. (Kaggle: Credit Risk (Model: Support Vector Machines), 2020)

Decision Tree (DT)

Decision tree (like CRT, QUAID, QUEST, C5.0) are responsible for making predictions by inserting pre-programed decision rules subtracted from data features and generating tree-like structures terminated by decision notes which corresponding to input various. Starting from the top/rooting component, tracking down each branch that represents specific features of the borrower to find the predicted value (credit risk) (Fenjiro, 2018).  

Figure 6. Decision Tree in loaning approval case, 2018


Neural Networks (NN)

Neural Networks technique is a processor which simulates the activities of the human brain to collect the detected information and store the knowledge. Three major layers in Neural Networks include the input layer, hidden layer, and output layer. Other ML models like MA (Metaheuristic Algorithm) are also fit for analyzing credit risk management. However, the application depends on hands-on situations bases on the misclassification level, the accuracy of algorithms, and computational time (Iyyengar, 2020).

Figure 7. The neural network layers for credit risk evaluation

3.3 De-Blackbox AI in Credit Risk Management

Case study of Zen Risk

The AI technique used in credit risk management is a double-edged sword that performs extremely efficiently, but the progress wasn’t transparent enough for both loaners and borrowers to further touch the bottom. Deloitte designed a de-Blackboxing tool especially for revealing the myth of the AI-driven credit risk assessment process. The platform called Zen Risk aids to help clients access, compare and study the modernist ML models for better understand, analyzing AI techniques applied in credit risk management, and also make more accurate predictions. Zen Risk as a de-Blackbox toolkit, promised its clients a complete transparency evolution, audibility process, and clear output. The Zen-Risk case study will open the Blackbox on the perspectives of the applied models, features along with general outcome explanation and individual forecast explanation (Phaure & Robin, 2020).

Starting with the advanced pre-data processing stage where data filtering, classification, cleansing, and identified outlier parameters happens. Clients with prepared data sources can determine the perfect match ML model to use (like NN, DT, SVM, MA mentioned above). Zen Risk visualized the model choosing process for users to better understanding what happens when different ML models computing the data. The solution can be integrated into a single model, or hybrid models when seeking for comprehended investigation. The straightforward solutions generally fall into simple ML models like Boosting (like LightGBM) and Neural Network. Heterogeneous classifiers, individual classifiers, and Homogeneous classifiers are the most common methods used in this stage. When encountering complex situations, applying hyperparameter optimization algorithms (like LIME, SHAP) is necessary to perform a more engaged data interaction (Phaure & Robin, 2020).

Taking the tree-like model as an example, the algorithm performed in the first stage can present the importance of each feature by assessing its quantitative value. The computing process captures the impact of manipulating a variable on the model evaluation metric. During the transformation, the decrease in model quality is often associated with the variable’s Importance and influence. For example, if value = 600 is the standard of loaning rejection, then the feature of credit amount and age indicates a highly correlated factor with whether approving a loan than the features of loaning purposes (Phaure & Robin, 2020).

Figure 8. Deloitte artificial intelligence credit risk

LIME as a local model categorized as a post-hoc model-agnostic explanation technique that explains the individual prediction of de-Blackboxing ML in credit risk assessment by the lights of other approximate easy-to-decrypted Blackbox (Misheva, 2021). Unlike tree models, LIME remains encrypted as a Blackbox that only allows the users to study what’s happening inside by providing a similar transparent model (like linear regression, decision tree). The figure below is the LIME model possessed on these succedaneums (XGBOOST Model), to explain whether if the input data (borrower) has risky potential or not (Phaure & Robin, 2020).

Figure 9. Deloitte artificial intelligence credit risk LIME model presented by XGBOOST Model 

The most promising model is the Shapley value analysis (SHAP) which calculates the portion of each feature contributed in the individual prediction that is hard to accomplish by applying a simple linear function model. Unlike LIME which presents various factors of an individual, SHAP presents a unique value that indicates the direct answer onto a specific individual. The function of SHAP is showing above where f presents full feature, i presents the added features. And the next figure shows the result of a borrower is too risky for granting a loan generated by the SHAP model (Phaure & Robin, 2020).

  1. Concerns

Non-transparency and Data Bias

32 % of financial institutions displace fears of applying AI techniques and ML models in the credit risk assessment. AI-driven models are accurate in providing final outputs (making loaning decisions), but the complex calculation turns the entire thinking progress into a Blackbox which is hard to decrypt. It is more difficult for the financial institutions to explain the unqualified reasons to the borrowers other than providing a numerical result, and also hard for financial servers to report to their superiors why do these models receive these predictions (scores) (Kerr-Southin, 2021). The feature of Non-transparency in AI-generated models makes it even harder to detect and correct data bias which could deepen the discrimination.


The algorithms in the AI-generated credit risk assessment model are programmed by human beings. The programmer’s proficiency directly affects the performance level of the model. The model risk could severely harm a financial institution because is often too scaled to retrieve the loss. Any insignificant mistakes like hiring non-experience modelers and operators, no back-testing, and operational problems in the model could result in irretrievable damage. One large US bank lost $6 billion due to value-at-risk model risk. Under the regulation of protecting customer and company privacy, these failed modeling examples are growing on the tree but cannot be publicly studied. It blocks the way of learning from past experiences. Constant trials and errors have become the only effective solution at present (McKinsey & Company et al., 2015).


Through detailed analysis of several common artificial intelligence (AI) lending risk analysis models in the financial field, it is not difficult to find that because there is a large amount of complex data types got involved, while machines can conclude faster and more accurately than ever under the correct model control, but it becomes harder for humans to explain the reasons behind the scene. The AI/ML in Credit Risk Management is frankly unable to de-Blackbox to the bottom but could only analyze an individual model to help us understand more. It is worth noting that although this technology has brought many benefits to various financial organizations and individuals, such as saving manpower, material and decrease time costs, there are also many hidden concerns and potential risks like deepening discrimination cause by its non-transparency feature. 


  1. Agarwal, A., Singhal, C., & Thomas, R. (2021, March). AI-powered decision making for the bank of the future. McKinsey & Company.
  2. Archer Software. (2021, January 18). How AI is changing the risk management? Cprime | Archer.
  3. Artificial Intelligence (AI). (2021, May 4). IBM.
  4. Bajaj, S. (n.d.). AI, machine learning, and the future of credit risk management. Birlasoft.
  5. CB Insights. (2018, July 20). The AI In Fintech Market Map: 100+ Companies Using AI Algorithms To Improve The Fin Services Industry. CB Insights Research.
  6. Corporate Finance Institute. (2021, January 27). Financial System.
  7. Deloitte France. (2018, December 11). Zen Risk [Video]. YouTube.
  8. Fenjiro, Y. (2018, September 7). Machine learning for Banking: Loan approval use case. Medium.
  9. [Graph]. (n.d.). SAS.
  10. Iyyengar, A. (2020, August 18). 40% of Financial Services Use AI for Credit Risk Management. Want to Know Why? Aspire Systems.
  11. Kabari, L. G. (n.d.). The neural network layers for credit risk evaluation [Graph].
  12. Kaggle: Credit risk (Model: Support Vector Machines). (2020). [Graph]. Kaggle.
  13. Kerr-Southin, M. (2021, January 22). How FIs use AI to manage credit risk. Brighterion.
  14. McKinney. (n.d.). Figure 3. Machine learning surfaces insights within large, complex data sets, enabling more accurate risk [Graph].
  15. McKinsey & Company. (2021). Figure 4. The combination of AI and analytics enhances the onboarding journey for each new customer. [Graph].
  16. McKinsey & Company, Härle, P., Havas, A., Kremer, A., Rona, D., & Samandari, H. (2015). The future of bank risk management.
  17. Misheva, B. H. (2021, March 1). Explainable AI in Credit Risk Management. ArXiv.Org.
  18. Part of the Future of Financial Services series from the World Economic Forum and Deloitte. (2018, September 7). The new physics of financial services: How artificial intelligence is transforming the financial ecosystem. Deloitte United Kingdom.
  19. Phaure, H., & Robin, E. (2020, April). deloitte_artificial-intelligence-credit-risk.pdf.
  20. PyData. (2017, June 13). Soledad Galli – Machine Learning in Financial Credit Risk Assessment [Video]. YouTube.
  21. Ray, S. (2020, December 23). Understanding Support Vector Machine(SVM) algorithm from examples (along with code). Analytics Vidhya.
  22. RISKROBOT TM – Explainable AI Automation in Credit Risk Management – SPIN Analytics Copyright 2020. (2018, November 27). [Video]. YouTube.
  23. Schroer, A. (2021, May 8). AI and the Bottom Line: 15 Examples of Artificial Intelligence in Finance. Built In.
  24. Stolba, S. L. (2021, February 15). Mortgage Debt Sees Record Growth Despite Pandemic. Experian.,-In%20line%20with&text=Even%20with%20the%20moderate%20growth,highest%20they%20have%20ever%20been.&text=As%20of%202020%2C%20approximately%2044,2019%2C%20according%20to%20Experian%20data.
  25. Survivor, T. W. S. (2021, February 10). How Do Banks Make Money: The Honest Truth. Wall Street Survivor.
  26. The AI In Fintech Market Map. (2017, March 28). [Graph]. CBINSIGHTS.
  27. Trust, D. B. (2020, October 17). Applying AI to Risk Management in Banking and Finance. What’s the latest? Deltec Bank & Trust.
  28. Use Cases of AI in the Banking Sector. (2021, April 21). USM.
  29. Weinland, D. (2018, October 28). Chinese banks start scanning borrowers’ facial movements. Financial Times.

Final Reflection of AL/ML, information and Data

In recent years, the application of more complex AI is increasing. However, it is dangerous and also unreliable to make decisions that are based on unexplainable techniques, due to fact that people prefer to adopt a technique that they could fully trust. Better interpreted the machine can assist them to make more reasonable predictions and correct the discriminations that exist in the model, and also provide a more explainable model. AI relatively shares more concepts of philosophy than any other science technique, because it involves more scenes of consciousness (McCarthy, 2012). However, AI is after all complex algorithms computed by humans, it is more like a representation of intelligence, or a deep explanation of intelligence, but not intelligence itself, not even close to self-awareness.

Using one sentence to conclude what I have learned in this class, is that data and information are used to feed AL and further develop ML, and output more data and information. The fun fact is that human-trained machines so hard so it could replace us do the majority of tasks, and leaves us being the boss. However, with the rapid development of AI/ML, and less transparency of this technology, and more reliance on them, it will soon become hard to provide then actual reliable pieces of evidence. We are creating a huge black box that we think we understand because we feed them the data collected by us and we have the ability to analyze the outcome. Humans became even more confident because we think that we fully understand the concept and the algorithms since its build by us. But the problem is, do we? Does the outcome extracts from AL are more accurate and justice? Or we just more use to disguise the truth by so-called absolute accuracy just because the information was formed by a machine?

What I have concerned about is fully presented in the documentary “Code Bias”. A facial recognition system created by an African American female is unable to accurately recognize dark skin race, especially females. It is ironic that people are planning to entirely rely on this technique. As Cathy O’Neil in the video said, algorithms are using historical data to make a prediction of the future. It’s even more true when it comes to Deep Learning. Looks like the future is all being controlled by the group of people who collects the data and knows the code.


Artificial Intelligence (Stanford Encyclopedia of Philosophy). (2018, July 12). Stanford Encyclopedia of Philosophy.

C. Lipton, Z., & Steinhartd, J. (2018, July 27). Maak kennis met Google Drive: één plek voor al je bestanden. Troubling Trends in Machine Learning Scholars.

Coded Bias | Trailer | Independent Lens | PBS. (2021, March 22). Independent Lens.

McCarthy, J. (2012). The Philosophy of AI and the AI of Philosophy. Professor John McCarthy.

Data Overload

Before 2000, things were mainly stored on analog equipment like music on tapes and movies on film. Ever since 2002, with the improvement of obsoleting the poverty, the demand for efficient sharing dramatically increased. Because more people have the opportunity of getting electric devices, the digital age has begun. What differentiates big data from data is the 5v characters: large volume, more variety, high velocity, veracity, and value. The volume usually is the main factor to determine whether it is categorized as big data or not. (more than TB or PB). The term big data was popularized by John Mashey in the 1990s. Currently, the major usage of big data covers a wide range. Other than cloud computing, the data analyzing tech include ML and natural language processing as we mentioned in our previous class. Visualization also considers an expression of big data. The point of having big data is to deal with these massive amounts of data more efficiently than the traditional data processing equipment is unable to handle. However, the storage speed starts to exponentially increase ever since the 1980s. Even though there is one-third of our data are text and still images, it still becomes very crowded. Speaking of its application, I personally think the craze of NFT in 2021 is a sign of the big data explosion. It may represent a new stage of big data. The claim of ownership is what’s really behind NFT, and if the sense of “digital assets” starts to become convincing for the majority group, then it unenviably causes internet traffic. What’s worse is that most of the space will be occupied by the “recreations” (Just something that came to my mind).

Every aspect of our life relates to the application of big data. For example, the usage in a medical community makes sure the patients can receive personalized healthcare, more specifically, by abstracting data from a large database to build a detailed mechanistic model for individual patients. The detailed data source robust the treatment and makes the curative effect more efficient. However, other than the data bias, the data traffic is another major challenge. A single breast tomosynthesis takes nearly 450 MB, which equates to high-resolution commercial photography (besides, a professional photographer takes about 1000 – 2000 photos each shooting, and most of them are wasted in the cloud). Another example of big data usage is the recommendation systems we experience every day. On YouTube, Facebook, online shopping, etc. For example, Netflix, what surprised me most is not just they recommend the categories that seem interests to you, they even switch the covers to tempt their audiences, to let them change their mind. If an audience watches romantic movies a lot, the merchant will extract a clip of a scene of a kissing couple, even if it’s a horror movie or action movie. In this case, you may pass the movie for the first time simply because you didn’t like the content, they still will convince you to reopen it by simply changing a cover. And we as the audience are unable to notice most of the time. There are other examples like google Maps, dating apps, government uses big data to keep records, monitor the crime rates. Environmentally, official institutions can predict the natural disaster five days ahead by observing the exiting data. Other applications occur in education, marketing, social media, etc.

Question: What’s the usage of big data in blockchain? 


Intro to Big Data: Crash Course Statistics #38. (2018, November 14). [Video]. YouTube.

Viceconti, M., Hunter, P., & Hose, R. (2015). Big Data, Big Knowledge: Big Data for Personalized Healthcare. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 1–2.

What Is Big Data? (2016, March 7). [Video]. YouTube.

Wikipedia contributors. (2021, April 12). Big data. Wikipedia.

Numbers don’t lie?

AI is all about training computers to learn/find out patterns from a massive amount of data. As AI gets more involved in our daily life, the logic and the ethics behind each algorithm should become more transparent. Technology does not have right or wrong, but their decisions and outcomes highly relate to or even entirely depend on the data provided by human creators. Since it is so easy for the AI to predict and further control our life-changing decisions (like dating app, housing renting, debt…), it is important to understand the relationship between the creator and creates (HEWLETT PACKARD ENTERPRISE – Moral Code: The Ethics of AI, 2018, 03:15-05:21).  Because AI is an outcome of human action, the bias could happen when (Hao, 2020):

  1. Framing the problem
  2. Collecting the data (The data is one-sided/it reflects the bias itself)
  3. Preparing the data (Subjective)

It’s hard to fix the bias in AI because, first, it is often too late to fix it; second, the complex algorithm already learned what it has been taught, fixing the root doesn’t change the branches.

AI emphasis human bias and deepen the exited stereotype. As the example listed in source materials. AI generates the people who live in certain areas as more possible to commit the crime, and that areas happens gathered by colored groups or poor. AI becomes a racist without being told. Another example would be Amazon filter out female candidates by training AI to hire new employees based on the historical candidates hiring information, which crowed by white males. In this case, AI becomes a sexist without being told by words but learned by past truth.

One specific issue is how AI has deepened gender discrimination especially the blooming of generating fake face techniques. According to The Verge website, fake face-generating techniques and deep faking face-changing techs are mostly used to create unreal pornography (Vincent, 2019). And as you could imagine, most of the victims are female. And the inevitable truth is, even though the videos or pictures are known as fake, what’s there is already there. Eliminating the entity doesn’t necessarily eliminate its existence. Same to the political aspects. Since “it is getting harder to spot a deep fake video”, this ethical issue could only get worth if without some following law restrictions (It’s Getting Harder to Spot a Deep Fake Video, 2018, 03:15-05:21). Some misunderstanding of AI is that because AI is generated by algorithms, we assume it is impersonality, more impartial, or rational than humans. Because we believe numbers/data don’t lie.  

Question: What tools can we use to reduce the AI bias? 


Hao, K. (2020, April 2). This is how AI bias really happens—and why it’s so hard to fix. MIT Technology Review.

HEWLETT PACKARD ENTERPRISE – Moral Code: The Ethics of AI. (2018, June 29). [Video]. YouTube.

It’s Getting Harder to Spot a Deep Fake Video. (2018, September 27). [Video]. YouTube.

Vincent, J. (2019, February 15). uses AI to generate endless fake faces. The Verge.

How Siri Works

Natalie Guo

Q1: What is Siri?

A1: Siri is an Intelligent Virtual Assistant (IPA) or Intelligent Personal Assistant (IVA), or a Chatbot in common words.

Q2: What does Siri do?

A2: Siri can perform phone actions and natural language interface based on voice/verbal command. It can also perform remote instructions or ganged with third-party apps to better satisfy users’ needs.

Q3: What techniques does Siri need to accomplish the tasks above?

A3: Speech Recognition Engine + Advanced ML tech + Convolutional Neural Network + Long Short-term Memory + Knowledge Navigator + text-to-speech voice based on deep learning technology.

Siri doesn’t “recognize our voice or understand our commends”, it translates the info into digital data/test messages that it can process and match with its database. One possible solution is “recognizing” the info as pieces of sound waves. And notice, each cluster may represent a specific word, when those clusters combine, it generated into a sentence. Inside the Blackbox, a huge database collects a massive amount of “voice wave” samples to let Siri select and learn which cluster represents what natural language meaning How does Siri work? (2011, December 20).

Then, the algorithm behind Siri, the Natural Language Processing is driven by ML techniques, takes away (please correct me if I’m wrong). Siri was made to pick up keywords and important phrases. During the text – speech process, a function called PRAAT, which is developed by Nuance, can take the waveform, turn it into a spectrogram, and create phonetic labels (which recognize the vowel), stress labels, pitch labels, and further decide which part get selected during the interface.

The article of Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant, it explains in detail. First, the DNN-power (Deep Natural Network) voice trigger keeps Siri “in the cloud” which can hear the user’s command of “Hey Siri” at any moment, then it computes the confidence score to identify if you actually want to wake Siri up.  The two layers used in Siri, one is for detection and the other is for checking.

Question: Siri seems to have trouble when the user suddenly needs to change the command, or punctuate a run-on sentence when there are several subjects occur in the same sentences. Why does it happen and what do we need to work on to make it better?  


Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant. (n.d.). Apple Machine Learning Research. Retrieved March 15, 2021, from

How does Siri work ? (2011, December 20). [Video]. YouTube.

Inside Nuance: the art and science of how Siri speaks. (2013, September 17). [Video]. YouTube.

This Is The Algorithm That Lets Siri Understand Your Questions | Mach | NBC News. (2017, June 28). [Video]. YouTube.

NLP and Google Translate

NLP is an interdisciplinary field that combines the knowledge of linguistic, computer science, and artificial intelligence. The two main categories of NLP are Natural Language Understanding, which presents how a computer can pull out meaningful information from a cluster of words and further categorize it, like Gmail recognizing the spam emails and generating them into trash mails for us. The second category is Natural Language Generation, which is more complicated since it is also responsible for understanding the context, distinguishing further subtracting the key information from these contexts. And the purpose is to make the machine understand the human’s natural language. So it can build a connection that allows the occurrence of interactions between humans and machines. In order to achieve this goal, several methods have been used. Morphology: This allows the computer to learn from different word roots and categorize base on the sharing similarity. Distributional Semantics: Learn the meaning of words based on the occurrence and the frequency of a word that appeared in the same sentence or context. And also the encoder-decoder models where a computer can encode a sentence, and decode it into a string of unsupervised clusters. However, since there are too many key details for the machine to remember, RNN occurred, where a computer can learn just the representation of each word, and make predictions. For example, every time I use google documents when I repetitively write about a single word over and over again under a specific topic. It will automatically provide suggestions, usually in phrases or terms, which are super relevant to the topics I’m writing about. Like for now, since I repetitively talk about AI-related stuff. The google document just gives me suggestions like “computer science”, “machine learning”, “natural language processing”, whenever I type something starting with “machine”, “computer” and “natural”, even though some of the time I didn’t tend to type that. 

Since the computer couldn’t understand human natural languages, it is necessary to transform those characters into the language that the computer can understand, for example, numbers like 0, 1. The phrase structure rule is the grammar for computers. And the parse tree can tag every word of a speech and further reveals how the sentence is structured, which allows the computer to access, process, and respond to the information more efficiently. How does google translate work exactly? Since we cannot use the word-to-word translation because the natural language requires a reasonable sequence to be understood. So, the first step is to code the natural language into numbers that can be operated by computers, which is called Vecoter Mapper, and then in order to get the language we want, the break-out sentences have to be generated into a whole again. We simple reverse the whole process, from the recurrent neural network back to the vector mapper. In the whole process, the neural network allows the computer to learn the patterns from the massive amount of real examples based on actual conversations and writings. 

Questions:  How does the computer “understand” the grammar exactly? Or in fact, it doesn’t, it just finds the pattern after all. 



CS Dojo Community. (2019, February 14). How Google Translate Works – The Machine Learning Algorithm Explained! YouTube.

CrashCourse. 2017. Natural Language Processing: Crash Course Computer Science #36

CrashCourse. 2017. Machine Learning & Artificial Intelligence: Crash Course Computer Science #34

“Machine Translation.” 2021. In Wikipedia

“Natural Language Processing.” 2021. In Wikipedia

A Neural Network for Machine Translation, at Production Scale. (2016, September 27). Google AI Blog.


ML to the Trolley Problem

In Karpathy’sKarpathy’s article, the pattern recognition function presented as a performance of ML. Karpathy introduced how Convolutional Neural Net can distinguish the good and bad profiles by adding filters over and over again randomly. Based on my understanding, the whole process is letting the computer subtract a massive amount of information and capture the common features, further transfer those common features as a standard to distinguish the definition of good and bad. According to the video inserted in the article, each filter has different purposes. Some can identify the faces; the others are designed to capture the clothing part, as Karpathy said, like teaching a child to identify the common figures. Another thing I notice is the binary function it uses. I was surprised a simple binary function could also do this complicated process. If I understand correctly based on the formal material, Karpathy might use the binary function to let the machine “know” if this profile is good or bad by divide all the characters into questions like “does the picture have one face or more?”, “Is it a male or female?”, “Does the face occupy the large portion of the picture or not.” And the Machine makes the decision based on the answer to these questions. That’s why Karpathy listed the final standard of what makes a profile good and bad, the process of classification, and identified features.

The reason why Machines can learn is a credit to the algorithms builds inside. So, it obtains the ability to analyze the orders or what we called the patterns from the database and further make predictions and make decisions. Which lead to a question I was concerned about for a long time. (Pretty irrelevant) Does the process of ML make the machines make more rational decisions than humans? Like on the question of the Trolley Problem. Since the computer has very high calculation speed, what if the Machine calculates and compares these people’s conditions like: “Based on the family and health record, ask questions like how long the person has left to live, the percentage of commit a crime.” Also, based on the education and financial record to identify the person’s contribution to society… To let the Machine determine who gets to survive in this accident.


“What a Deep Neural Network Thinks about Your #selfie.” n.d. Accessed February 26, 2021.

CrashCourse. 2017a. Machine Learning & Artificial Intelligence: Crash Course Computer Science #34

Unicode as a master key

Data as a layer above information is something given. “Data,” to be interpretable, is something that can be named, classified, sorted, and be given logical predicates or labels (attributes, qualities, properties, and relations) (Irvine, 2021). And since data needed to be representable and designed to the form that will apply to many platforms and devices, to transform from the string of pure data into the extension of a comprehensive human symbolistic sign, everything makes sense. The problem occurred decades ago when scientists first started wanting to share data with each other. In simple words, the architectures pre-script the language while building the computer, so the device can “work-on its own” to no small degree to process the data. The scientists simply needed to decode the result based on their codebook. However, because each computer was designed based on different codebooks, sharing and exchanging the data became a problem. It required the technologists to comprehend more info of different data types and also block the connections between devices (Instructions & Programs: Crash Course Computer Science #8, 2017, 03:15-05:21). 

Unicode was designed to solve these problems. As professor Irvine mentioned: “Unicode is the data “glue” for representing the written characters (data type: “string”) of any language by specifying a code range for a language family and standard bytecode definitions for each character in the language) (Irvine, 2021). Based on my understanding, all of the text we see is not actually word by word, and it is still pictures by pictures combined together. Taking Chinese word as an example. When we see a character:

尛 = + +

And each part in the square was build up by a sequence of binary code, combined by the selecting font. After processing hardware and software, we see the final combination of all three images generated a total character pixel by pixel. Unicode provides interpretable data that are accessible for electronic devices. And that’s also the reason why we need to use Unicode. 

Many devices have a problem that they could only process a binary language system, especially Latin-originated characters rather than multi-language systems, simply because they share more common basic data info. In order to solve this problem, the developers use the strings from Unicode to allow devices to decode and understand the language on their own to further communicate with each other. In this case, using Unicode saves more storage and more easily for users to process software locally. 


Irvine, Martin. (2021). “Introduction to Data Concepts and Database Systems.”

Instructions & Programs: Crash Course Computer Science #8. (2017, April 12). [Video]. YouTube.

Information with and without meanings

To understand the information theory and its application in the engineering field, we must forget its daily life meaning, which represents the knowledge obtained during investigation and instruction. Signal transmission or processing happens when the content we send or receive in any form of electrical formations like text, sounds, images, films… and even the files we uploaded and converted these into signals. The main feature is simple but puts simple steps repeatedly into layers and layers, contributing to complex transmission functions. First, the message source takes the content, refers to the codebook, and turns the content but detached the meanings into electric signals, then the abstract signals pass through the psychical wires or tubes for transmitting these signals to the destinations. Before I finish this. “
end to end” signal journey, the password will be transferred back to the content with meanings that apply to the social environment using the same codebook (Martell, 2015).

As professor Irvine mentioned in his article: “The model provides an essential abstraction layer…The meaning and social uses of communication are left out of the signal transmission model because they are everywhere assumed or presupposed as what motivates using signals and E-information at all” (Irvine, 2020). For this reason, I think to expect another practical sense of why the signal-code transmission model is not a description of meaning may also relate to the limitation of signal transformation. According to Shannon’s A Mathematical Theory of Communication, the large amount of signal will reduce its accuracy and activity due to the long-distance process of transforming. And the interference like noise will also disturb its preciseness. Two ways to solve it are using enough energy to boost a strong enough signal not to get affected by surrounding irrelevant signal resources. The second is to have bandwidth large enough to allow these massive amounts of signal to pass through without getting deducted.

Speaking of big data, the description of meaning applied in social content represents a lot of information if conducted into signals. That’s why Shannon uses “bit” instead of other decimal systems. It leaves the machines less opportunity to make mistakes and gives them more chance to represent more meanings and contents by reducing the input choices. Because the more we use as input, the less we get as output. And the reason why the information theory is only sufficient as substrates is that without the comprehension of semiotic meaning human uses every day, it loses the purpose of encoding, decoding, and transforming.



Denning, P. J., & Martell, C. H. (2015). Great principles of computing. The MIT Press.

Irvine, M. (2020). Introduction to Computer System Design. 


From machine to machine learning.

The most valuable takeaway for me was learning the difference between a computer, a computer system, AI, machine learning, and deep learning (AI Robertson ML Robertson DL). In Peter J. a computer originally represented as a job title which refers to people who calculates an artifact that can automated processing info, and eventually developed into a machine that can understand the info. Denning and Craig H. Martell’s Great Principles of Computing, the author broke the statement about “Computer is just coding or programming” by stating the long progress of computing development and its controversial evolution.

            Early 1970, “Computer science equals programming.”

            In the 1970s, “Computing is the automation of information processes.”

            Late 1970 s, “Computing as the study of ‘what can be automated.”

            1980 s, “Understanding their information processes and what algorithms might govern them.”

Looking back to history makes me even more surprised about how rapidly the computing technology has been developed and how fast people can keep up with all these updates and react to such changes. But still, “with the bounty come anxieties.” In Kashmir Hill’s article The Secretive Company That Might End Privacy as We Know it clearly spilled out our concerns. Using ML as a tool to help law enforcement should be a way to decrease the criminal rates and processing the case even faster by replaying human labor into tireless machines. However, because these machines can have “unintended operations”, the results aren’t always right, especially towards specific groups of people, and the idea of extracting a face behind every phone or even videos it occurs freaks people out. This face recognition technology hasn’t been generally authorized.

It reminded me of a case that happened several years ago about how the public concerns about their privacy while “enjoying” the convenience the private intruding technology brings them, like location sharing and tagging. IPhone by that time took advantage of it and started advertising how important they value their customers’ privacy. The ironic thing is, they still cooperate with Google and many other information collecting companies to spy on their customers and predict their preferences to make more profit, but refuse to provide a password to the law enforcement for providing evidence and solving a crucial murder case to prove to the public about how much they “value their privacy”.

Another takeaway I contracted was from John D. Kelleher’s Deep Learning about how machine learning was designed to learn the patterns from the massive data by providing a calibration so they “understand” what right or wrong. It reminds me of how humans learn from the beginning. Our past experience (knowledges, relationships with others, rewards…) are the massive data base, we learn from the past to find the pattern so we know what to do what’s not to do, and what works what doesn’t.

AI and Giraffe – Hao Guo

AI as an apparatus that follows simple instructions or algorithms computed by humans will not become a threat at all. On the contrary free more tedious, time-consuming jobs, getting works done more effectively, precisely, and a lot faster. I had this typical “hope statement” when I first started the readings. According to the crash course videos, the concept of AI has already been applied to many fields. And it’s making profound progress, from the financial area like loan landing to the medical industry as x-ray examine. Like many AI users, even people with zero programming backgrounds involuntary interact and negotiate with AI every day. As what’s going on behind these screens and machines remains as a Blackbox, the media and businesses took this advantage for making more profit. They exaggerated AI from a device that runs on code written by the human to an independent thing that generates and runs on its own code. However, in fact, those smart machines we use for the daily advance are not even near to the actual concept of AGI. As Michael Wooldridge mentioned in his book A Brief History of Artificial Intelligence, it took us decades from zero to one. The AI we are concerned to wipe out entire human beings is nothing more than machines for following instructions. Like the Youtuber, John Green presented in his video about how his robot companion can recognize his face but won’t be able to perform meaningful conversations or censoring and proving appropriate responses in an environment without receiving specific instructions. 

Do less humans make AI useless and less threatened to humans? Like Professor Irvine said: “American has a long history of creating a ‘new’ technology with a combination of hype, hope, and hysteria.” AI wiping out human beings is unwavering faith, but a matter of time. That’s my “hysteria” statement after I watched Ethem Alpaydin’s Machine Learning: The New AI and the documentary Do You Trust This Computer. Michael Wooldridge believes AI is hard to exceed human intelligence because AI requires tremendous space and power to process such a massive amount of data. And look back to history, what we have achieved today took much more effort and experienced more failures than we could imagine. However, like the Youtuber John Green said: “History reminds us that revolutions are not many events as they are processed.” In other words, no matter how long it took us from inventing a room large fast calculator to a pocket-small smartphone, it only takes us a happening of singularity to develop an AI to a real AGI. For example, the happening of the big bang and the first existence of living creatures. 

The singularity of AI development somehow reminds me of the studies of giraffes. Research indicates that giraffes’ necks were just as short as other animals at the very beginning. However, due to climate change (the causing remains as myth, but climate change is considered as one of them), there were fewer eatable plants left on the ground, so these short neck giraffes were forced to grew longer necks to achieve the food from upper places, along with an extra blooding pumping organ to help their hearts send sufficient blood through their extreme long neck to their heads. However, from all the fossils we discovered, we could only find the origin short neck giraffes. Still, not one fossil indicates any of their evolution processes like a mid-long neck giraffe. (Sounds so irrelevant, it just occurred to me). 

And the last I want to say after I examined all the materials is that they don’t necessarily need to be smarter than us to destroy us. Like a nuclear bomb, they are not smart but can wipe out the entire creature by a mistake made by human beings.