The many contexts and uses of the terms “information” and “data” make these terms perplexing and confusing outside an understood context. Using the method of thinking in levels and our contexts for defining data concepts, outline for yourself the concept of “data” and its meaning in two of the data systems we review this week. One “system” is the encoding of text data in Unicode for all applications in which text “data” is used; others are database management systems.
What is Data Science?
Data science incorporates a set of principles, problem identification, algorithms, and processes for extracting unapparent and helpful patterns from large data sets. Many of the data science elements have been developed in related fields, such as machine learning and data mining. In fact, the terms data science, machine learning, and data mining are often used interchangeably. The commonality across these disciplines is a focus on improving decision making through the analysis of data. However, although data science borrows from these other fields, it is broader in scope. Machine learning (ML) emphases on the design and assessment of algorithms for extracting patterns from data. Data mining typically handle the examination of structured data and often suggests a focus on commercial applications.
A Brief History of Data Science.
The term data science can be traced back to the 1990s. Nevertheless, the fields that it profits by having a much longer history. One thread in this more extended history is data collection history; another is the history of data analysis. In this section, we review the main developments in these threads and describe how and why they converged into the field of data science. Of necessity, this review introduces new terminology as we define and name the important technical innovations as they arose. For each new term, we provide a brief explanation of its meaning; we return to many of these terms later in the book and give a more detailed description of them. We begin with a history of data collection, then provide a history of data analysis, and, finally, cover data science development. (Kelleher and Tierney, 2018).
Document and Evidence
The word information commonly refers to bits, bytes, books, and other signifying objects, and it is convenient to refer to this class of objects as documents, using a broad sense of that word. Documents are essential because they are considered evidence.
The Rise of Data Sets.
Academic research projects typically generate data sets, but in practice, it is generally impractical for anyone else to attempt to make further use of these data, even though significant research funders now mandate that researchers have a data management plan to preserve generated data sets and make them accessible.
Finding operations depend heavily on the names assigned to document descriptions and the named categories to which documents are assigned. Naming is a language activity and so inherently a cultural activity. For that, we introduce a brief overview of the issues, tensions, and compromises involved in describing collected documents. The notation can be codes or ordinary words. Linguistic expressions are necessarily culturally grounded and so unstable and, for that reason, are in conflict with the need to have stable, unambiguous marks if systems are to perform efficiently.
The First Purpose of Metadata: Description
The primary and original use of metadata is to describe documents. There are various types of descriptive metadata: technical (to describe the format, encoding standards, etc.); administrative. These descriptions help in understanding a document’s character and in deciding whether to make use of it. Description can be instrumental, even if nonstandard terminology is used.
The Second Use of Metadata: Search
Thinking of metadata to describe individual documents reflects only one of the two roles of metadata. The second use of metadata is different: it emerges when you start with a query or with the description rather than the document—with the metadata rather than the data— when searching in an index. This second use of metadata is for finding, search and discovery. (Buckland, 2017).
Both “information” and “data” are used in general and undifferentiated ways in ordinary and popular discourse. Still, to advance in our learning for AI and all the data science topics that we will study, we all need to be clear on these terms and concepts’ specific meanings. The term “data” in ordinary language is a vague, ambiguous term. We must also untangle and differentiate the uses and contexts for “data,” a key term in everything computational, AI, and ML.
No Data without Representation.
In whatever context and application, “data” is inseparable from the concept of representation. A good slogan should be “no data without representation” (which can be said of computation in general). By “representation”, we mean a computable structure, usually of “tokens” (instances of something representable) corresponding to “types” (categories or classes of representation, roughly corresponding to a symbolic class like text character, text string, number type, matrix/array of number values, etc.). (Irvine, 2021).
Knowledge of database technology increases in importance every day. Databases are used everywhere: They are fundamental components of e-commerce and other Web-based applications. They lay at the core across the organization’s operational and decision support applications. Databases are also used by thousands of workgroups and millions of individuals. It is assessed that there are more than 10 million active databases in the world today.
This book aims to teach the essential relational database concepts, technology, and techniques that you need to start a career as a database developer. This book fails to teach everything that matters in relational database technology. Still, it will give you adequate scope to create your databases and participate as a group member in developing a more immense, more complex database. (Kroenke et al., 2017).
The data type attribute (numeric, ordinal, nominal) affect the methods we can use to analyse and understand the data. Use to describe the distribution of values that an attribute takes and the more complex algorithms we use to identify the patterns of relationships between attributes. At the most basic level of analysis, numeric attributes allow arithmetic operations. The typical statistical analysis applied to numeric attributes is to measure the central tendency (using the mean value of the attribute) and the dispersion of the attributes’ values (using the variance or standard deviation statistics).
Machine Learning 101
The primary tasks for a data scientist are defining the problem, designing the data set, preparing the data, deciding on the type of data analysis to apply and evaluating, and interpreting the data analysis results. What the computer brings to this partnership is processing data and searching for patterns in the data. Machine learning is the field of study that develops the algorithms that computers follow to identify and extract data patterns. ML algorithms and techniques are applied primarily during the modelling stage of CRISP-DM. ML involves a two-step process.
First, an ML algorithm is applied to a data set to identify useful patterns in the data. Second, once a model has been created, it is used for analysis. (Kelleher and Tierney, 2018).
- Kelleher, J & Tierney, B (2018). Data Science. The MIT Press: London.
- Buckland, M. (2017). Information and Society. The MIT Press: London.
- Irvine, M. (2021). Universes of Data: Distinguishing Kinds and User of “Data” in Computing and Al Applications.
- Kroenke, D., Auer, D., Vandenberg, S. L., & Yodeer, R.C. (2017). Database Concept. Pearson: NY.