Data, Representation and Visualization


Warning: Use of undefined constant user_level - assumed 'user_level' (this will throw an Error in a future version of PHP) in /home/commons/public_html/wp-content/plugins/ultimate-google-analytics/ultimate_ga.php on line 524

By Eric Cruet

I. Introduction

“Use a picture. It’s worth a thousand words.” Arthur Brisbane -1911

What constitutes an awesome visualization? Some think of it as flashy graphics, while others look for busy charts and colorful graphs.   A better general definition is one that provides a clear, visual frame to represent data, in a way that allows the observer to “see” a trend, outline, pattern, outlier, or other significant information which would otherwise been imperceptible to him by just looking at the source data.

Mediology is based on the differentiation of transmission and communication.  According to Regis Debray, to communicate means to transport information in a space within one and the same space-time-sphere.  To transmit means to transport information in time between different space-time-spheres. Communication is a moment in a longer process and a fragment of a larger whole, that we call transmission [1]. Based on Debray’s definition, visualization is a medium as opposed to a specific tool.  A tool generates bar charts and graphs.  A medium has the ability to communicate emotion, curiosity, activity, energy, and granularity.  For instance, the pictorial representations to scale of human anatomical components in Gray’s Anatomy have persevered the test of time, communicating the same information across generations of medical students globally.

Data is the basis of any visualization and as such, an abstraction of information and facts. The data set is a collection of snapshots of the desired data at one point in time and usually serves as the basis for the visualization.  Statistics are used to manipulate and analyze the set, since collectively, the data points in the set generate means, medians, and standard deviations.  But what is most important is the context associated with the data and the results, in other words, what they represent.  They translate into descriptions of people, places, and things that allow the comparison and contrast of specific items.  When you drill down on the data, you obtain individual details about members and objects of the population.  All of the above can be used to tell visual stories, and make data that usually look like columns and rows of numbers, human and relatable.

II. Data

When you ask most people what is data, they reply with a vague description usually related to a file, an application or numbers.  Some might mention spreadsheets or databases. These are all containers and formats that data comes in, but provide it very little context.  That’s where representation and visualization come in.

William Cleveland and Robert McGill are often cited for their work on perception and accuracy in statistics [2].  Elements like position, scale, and the use of scatterplots, followed by length, angle and then slope can be attributed to their work.  Edward Tufte is also credited with identifying some of the first basic rules of design.  But his most important rule was that “most principles of design should be greeted with some skepticism” [3].

Data is undergoing a paradigm shift.  There is more to the term “big data” than the quantity. Most of our institutions were established under the assumption that decisions would be made with information that was scarce, exact, and causal in nature.  This situation is changing rapidly now that amounts of data are huge, can be quickly processed, and some degree of uncertainty is acceptable [4].  In certain operating scenarios, correlations are more important than causality.  Most importantly, many times we are interested in data streams, as opposed to data snapshots.

So the type, source and volume of data influences the way information is represented, communicated and visualized.  The following example contains statistical data for traffic fatalities [5] in the US in a chart format:

This is the basic table containing the source data.  In order for it to tell us more than just counts of fatalities, a process of representation and visualization needs to take place.  This entails the application of computational methods, good design principles, some basic rules about art layout, color, and the use of templates, and decisions about the level of granularity for the type of information you wish to relay.

 

III. Representation

There is value in looking at data beyond the mean, median, or total because these measurements only tell part of the story.  Many times, aggregates or values around the middle of the distribution hide the interesting details that really need focus for decision making or illustrative purposes.

Outliers which stand outside of the centrally situated values could also be needing attention. Changes over time sometimes indicate that something positive (or negative) is happening (or about to happen) in the system under observation.  Regular occurrences or patterns could help you anticipate future events and granularity can be adjusted depending on variability. The graph below [10] is an example of a creative representation of the data table in the previous section:

Although these are snapshots, they provide a different perspective on the the same data by communicating alternative information.  One glance at the chart tells you that traffic fatalities have decreased substantially over time.  Key milestones are listed by year of significance.  It’s a different take on what could be a boring line chart.

Finally, a poster[10] drilling down into a comparison of traffic vs. total fatalities data for 2008 – 2009:


The poster format is well suited for this type of information.  It utilizes a variety of graphs and charts to represent data.  It summarizes the information well and has a good level of detail.  The use of color is appropriate and the shapes and sizes complement each other.

When you look at these representations, they look much better than columns and rows of statistics one after the other.

IV. Visualization

Visualization has been around for centuries, but it is relatively new as a field of study.  Even the experts in the field have not settled on what exactly comprises it.  One of the topics of debate is: when and where does visualization become art?

The answers to these questions vary depending on whom you ask.  But rather than think of the field as composed of disparate categories that work independently from others, it is better thought of as a continuous spectrum that stretches from statistics to data art [9].  Although you can find examples at each extreme, most of what you commonly see is a mixture of both.  Where there is a balance of statistics, design, and aesthetics is most likely that you will find the best examples of visualization work.

My post in Week 1 deals with mapping large scales of change.  Much of mankind’s preoccupation has been with changes in the sciences, technology, sociology and economics.  More recently, the concern has shifted to variations in climate, global financial states, the effect of technology on society, and the increasing use of unlawful violence intended to coerce or to intimidate governments or societies i.e terrorism.

Traditionally, network, graph, and cluster analysis are the mathematical tools used to understand specific instances of the data generated by these scenarios at a given point in time. But without methods to distinguish between real patterns and statistical error, which can be significant in large data sets, these approaches may not be ideal for studying change.  Also, patterns and trends can be better ascertained by observing behaviour over time, as opposed to at a specific point in time. By looking at a time series and assigning weights to individual networks, we can determine meaningful structural differences vs. random fluctuations [2].

In the follow up post in Week 2, the unique, clever use of alluvial diagrams [2] by M. Rosvall and C. T. Bergstrom in their research entitled “Mapping change in large networks”, is a good example of how accurate statistics, good design, and simple artwork can reveal interesting, otherwise, hidden patterns in the data.  Using bibliometrics, which utilizes quantitative analysis and statistics to find patterns of publication within a given field or body of literature, they tracked citation patterns among scientific journals.  This allowed them to map idea flows and how the flow of ideas influenced changes in the science disciplines over time.  The resulting diagram and link to the research can be found below:

 

Just at a glance, what is evident from the “picture” is the fact that from 2000 – 2010, the neurosciences emerged as a new “discipline” from the fields of neurology and molecular and cell biology.  In this case the visualization served as the data analysis tool revealing the changes that the research hypotheses was trying to uncover. 

Along the lines of using visualization as a data analysis tool, the collaboration team of Fernanda Viégas and Martin Wattenberg, (at http://hint.fm) have invented artistic, creative ways of using visualization to express data.  Although they have a suite of impressive examples at their site, history flow is a tool that allows you to explore the history of any Wikipedia entry over time.

As shown below, the visual looks like an inverted stacked area chart where each layer reprpesents a body of text.  As time passes, new layers are added or removed and you can see the change in overall size via the total vertical height of the full stack:

The image above is the diagram for the wiki article on abortion. The black gashes show points where the article has been deleted and replaced with offensive comments. This type of vandalism turns out to be common on controversial articles.  The authors performed statistical analysis in 2003 to investigate the issue of online vandalism [6], and discovered that the median lifetime of certain types of vandalism is measured in minutes.  This is an alternate use of the alluvial diagram   shown previously, but in a different context.  Whereas in the previous case it was mapping changes in science, in this instance it is mapping changes to the bodies of text in Wikipedia articles.

Another great example of using visualization as a tool is this wind map, which provides a living portrait of the wind currents over the U.S.  Clicking on the map will take you to the real time instance.  Check it out:

Finally, researchers in cognitive science are using Diffusion Tensor Imaging (DTI), an MRI-based neuroimaging technique which makes it possible to visualize the location, orientation, and anisotropy of the brain’s white matter tracts.  Once they have a suitable group of sample volunteers, they test for neuropsychological factors such as general cognition, memory and information processing speed.  In addition, metrics such as fiber counts, length, diffusion rate, and diffusion anisotropy are statistically correlated to support the data.  The statistical relationship to age is usually modeled using a linear regression.  The fiber’s direction is indicated by the tensor’s main eigenvector. This vector can be color-coded, yielding a cartography of the tracts’ position, direction (red for right-left, blue for foot-head, green for anterior-posterior), and anisotropy (as indicated by the tract’s brightness).  In the following study, the researchers provide a visual assessment of the white matter maturation for 80 subjects of distint ages [7]:

 

This image illustrates the significant age related differences between tract-based bundles in the brain.  Red and blue indicate negative and positive correlation respectively.

 

 

 

This diagram shows the significant age related effects in connectivity based bundles.  Red and blue indicate negative and positive correlation, respectively.  Light gray connections had no significant effects, and the higher the saturation in the color, the more significant the age related effect in the result.  The population average bundle volume (sum of fiber lengths) is mapped to cord thickness.  Total bundle volume of each grey matter region is mapped proportionately to arc length.

Key: L=Left, R=Right, F=Frontal, T=Temporal, P=Parietal, O=Occipital, S=Subcortical

In closing these DTI scans can also derive neural tract directional information from the data using 3D or multidimensional vector algorithms based on six or more gradient directions, sufficient to compute the diffusion tensor. The diffusion model is a rather simple model of the diffusion process, assuming homogeneity and linearity of the diffusion within each image voxel. From the diffusion tensor, diffusion anisotropy measures such as the fractional anisotropy (FA), can be computed. Moreover, the principal direction of the diffusion tensor can be used to infer the white-matter connectivity of the brain (i.e. tractography; trying to see which part of the brain is connected to which other part).  Here’s a video clip on 3D DTI:

 

V. Interpretations

The intention of visualization is to communicate results to a wider audience.  Imagine you are a tour guide.  Put yourself in the tourist’s position.  You’re on a tour of a city where historic events have occurred over centuries.  What would the tourist want out of the tour?  He wants to know about when and where key events happened, who the main characters were, and why the buildings have particular shapes or colors.  All tour guides have their own personality, but they should stay on course and on the subject that the tourist paid to hear about.  Above all else, the tourist wants the guide to be factual and truthful in his account of events.  If he doesn’t know the answer to a question, he should be honest and say so.

As leading a tour of data through the use of visualization, presenters (or representers) should assume similar responsibilities.  It’s your duty to point out key highlights, background info, stay focused, and eliminate confusion.  Always aim your content at your target audience, and remember, speak the truth and nothing but the truth.

“The naked truth is always better than the best dressed lie”

Ann Landers (1918 – 2002) 

Appendix A: 3D Visualizations – Follows References Section

References:

[1] Debray, Régis “Qu’est-ce que la médiologie?” Trans. Martin Irvine. Le Monde Diplomatique, August 1999, p32.

[2] Cleveland, W. S., & McGill, R. (1984). Graphical perception: Theory, experimentation, and application to the development of graphical methods.Journal of the American Statistical Association79(387), 531-554.

[3] Tufte, E. R., & Graves-Morris, P. R. (1983). The visual display of quantitative information (Vol. 2). Cheshire, CT: Graphics press.

[4] Mayer-Schönberger, V., & Cukier, K. (2013). Big Data: A Revolution that Will Transform how We Live, Work, and Think. Eamon Dolan/Houghton Mifflin Harcourt.

[5]http://www.census.gov/compendia/statab/cats/transportation/motor_vehicle_accidents_and_fatalities.html

[5] Rosvall, M., & Bergstrom, C. T. (2010). Mapping change in large networks. PloS one5(1), e8694. http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0008694

[6] Viégas, F. B., Wattenberg, M., & Dave, K. (2004, April). Studying cooperation and conflict between authors with history flow visualizations. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 575-582). ACM.

[7] Cabeen, R. P., Bastin, M. E., & Laidlaw, D. H. (2013). A Diffusion MRI Resource of 80 Age-varied Subjects with Neuropsychological and Demographic Measures. ISMRM.

[8] http://www.technologyreview.com/photoessay/411056/the-brain-unveiled/

[9] Yau, N. (2013). Data Points: Visualization That Means Something. John Wiley & Sons.

[10] http://www.caranddriver.com/features/safety-in-numbers-charting-traffic-safety-and-fatality-data

Appendix A: 3D Visualizations 

Diffusion spectrum imaging [8], developed by neuroscientist Van Wedeen at Massachusetts General Hospital, analyzes magnetic resonance imaging (MRI) data in new ways, letting scientists map the nerve fibers that carry information between cells. This image, generated from a living human brain, shows a reconstruction of the entire brain The red fibers in the middle and lower left are part of the corpus callosum, which connects the two halves of the brain.

This image, generated from a living human brain, shows a subset of fibers. The red fibers in the middle and lower left are part of the corpus callosum, which connects the two halves of the brain.

Mapping Diffusion

Neural fibers in the brain are too tiny to image directly, so scientists map them by measuring the diffusion of water molecules along their length. The scientists first break the MRI image into “voxels,” or three-dimensional pixels, and calculate the speed at which water is moving through each voxel in every direction. Those data are represented here as peanut-shaped blobs. From each shape, the researchers can infer the most likely path of the various nerve fibers (red and blue lines) passing through that spot.

This image is the isolated optic tract, which relays visual signals from the eyes to the visual cortex, from the brain of an owl monkey. The blue lines at lower right represent nerve fibers connecting the eyes to the lateral geniculate nucleus (marked by the white ball), a pea-size ball of neurons that acts as a relay station for visual information. Those signals are then sent to the visual cortex, at the back of the head, via the blue and purple fibers that arc across the brain.

 

Edited on Microsoft Surface RT