What exactly is data mining?

by Eric Cruet

One of my pet peeves has always been never use an technological acronym (or for that matter any acronym) if I don’t know what the letters mean.  As of late, phrases such as big data and data mining have become part of the technology lingo.  But what do these terms really mean?

BIG Data

In a 2001 research report [1] and related lectures, the META Group (now Gartner) analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional.  IBM [2] later modified the definition to include a fourth dimension, veracity.  The “4Vs” definition states that:

Big data is high-volume, high-velocity, high-variety and questionable veracity information assets that demand cost-effective, innovative forms of information processing for enhanced insight, trust, data assurance and decision making.  Some examples:

HighVolume: As of 2012, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data.  Currently, applications in the areas of meteorology, genomicsconnectomics, complex physics simulations, and biological and environmental research, as well as Internet search, finance and business informatics easily amass terabytes —even petabytes—of information.  The huge volumes of data generated present limits because the datasets cannot be processed in a reasonable amount of time.  For instance:

  • Turning 12 terabytes of Tweets created each day into improved product sentiment analysis
  • Converting 350 billion annual meter readings to better predict power consumption

High Velocity: Big data refers not only to huge datasets, but also to voluminous amounts of data in streams.

The difference between a stream and a database is that the data in a stream is lost if you do not do something about it immediately

For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to process it efficiently.  Real scenarios include:

  • Scrutinize 5 million trade events created each day to identify potential fraud
  • Analyze 500 million daily call detail records in real-time to predict customer churn faster

High Variety: Big data is any type of data – structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. Interesting and unexpected patterns are found when analyzing these data types together:

  • Monitor 100’s of live video feeds from surveillance cameras to target points of interest
  • Exploit the 80% data growth in images, video and documents to improve customer satisfaction

Questionable Veracity: With security threats on the rise, governments, scientists and businesses have trust issues in the information they use to make decisions. How can you act upon information if you don’t trust it? Establishing trust in big data presents a huge challenge as the variety and number of sources grows.

Data Mining 

When you mention data mining, images of digging down deep into some endless pit of data come to some people’s minds.  But data mining focuses on amounts larger than what can fit into the system’s main memory.  Programmatically, the treatment of this information requires specific algorithms to meet processing requirements of CPU, time, memory, and resource allocation.

Depending on who you ask, you will get a different definition of data mining.  I prefer to take an algorithmic point of view: data mining is about applying algorithms to data.  But a widely accepted definition is that it is the process of discovery for “models” that fit the data. 

Statisticians were the first to use the term “data mining [3].”  Originally, “data mining” or “data dredging” was a derogatory term referring to attempts to extract information that was not supported by the data.  However, today, “data mining” has taken on a positive meaning.  Statisticians now view data mining as the construction of a statistical model, that is, an underlying distribution from which the visible data is drawn. 

In closing, data mining is more about the method of finding a model that “fits” the data, as opposed to digging through data in the hopes of finding something (you’re not sure you are looking for).  It is not synonymous with machine learning.  Some data mining appropriately use algorithms from machine learning.  This makes particularly good sense when we are not sure of what we are looking for in the data (as mentioned above).  In the next post we will cover the specifics of a well established statistical model.

References:

[1] Douglas, Laney. “3D Data Management: Controlling Data Volume, Velocity and Variety”. Gartner. Retrieved 6 February 2001.

[2] Zikopoulous, P. (2013). IBM: The Big Data Platform (pp. 34-58). N.p.: McGraw Hill.

[3] Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge University Press.