k
Especially Big Data

We’re pleased to announce the launch of our new podcast: Especially Big Data. While it deals with data both big and small, the podcast explores the many subjects data can document– which can include everything from what you ate for breakfast this morning to the variability of global ocean currents. In other words, people collect data to track almost everything these days; and we’re here to tell you everything it can explain, while also shedding light on everything it leaves out.

In our pilot episode, The truth about grandma’s perfume, we dive into the implications and ethics of documentation, and explore whether your daily dose of rhino horn extract should be recorded for “medicinal” or “commercial” purposes. Tune in and tell us what you think.

You can keep up with our latest episodes here.

Related posts
k
One million forks in a centrifuge

For the past few weeks, I have served as Fathom’s in-residence explorer of 3D printed information design with Formlabs’ Form1+ printer. Because my goal was to focus on the physical medium and form, I tried to stay away from directly 3D-ifying data visualizations that already exist in 2D (think extruded line graphs, bar graphs, etc.), or from arbitrarily mapping data points onto 3D space for the sake of aesthetics. Instead, I zeroed in on the features of physical objects that cannot be expressed on a screen, breaking them into two categories: material and interaction. More on forks later.

To encode meaning into material itself, I considered what makes us want to pick up and touch certain objects, as well as the subtle features that cue our immediate understanding of an object – what it’s for, who uses it, how old it is. In terms of interactions, I analyzed tasks that are very efficient for humans in the 3D world: quickly sorting and distributing many small objects, using peripheral vision, rotating and seeing multiple views of an object, and focusing on a small detail while keeping the bigger picture visible in the background.

Most of my experiments employed texture as a way to show what an object represents. If the texture is legible and successfully references familiar objects, there is no need for a key – one piece would clearly represent grain, another meat – though both made of 3D printed plastic. My inspiration for the food-related textures came from a series of chocolates designed by Nendo, a Japan-based studio, which led me to a massive list of Japanese words with no English equivalent, all used to describe the textures of different foods.

One example of an efficient human task that really stuck with me was the motion of quickly sorting silverware into compartments. Thinking along the lines of categorization, I remembered the card game SET from my childhood, and set out to design a 3D version. In this design, I also allowed for a hierarchy of information: certain characteristics and trends are discernible at a glance, while more subtle details reveal themselves through a closer look at a subset of pieces.

In a more spontaneous experiment, I tested what kind of objects people are compelled to pick up and play with in order to understand them, designing a sort of handheld clock with multiple hands that nest within each other. No one (myself included) could quite figure out what it should be, though maybe that is partly what made it compelling. I also discovered some of the limitations of SLA printing, which is the method of 3D printing that uses a UV laser to harden a vat of liquid resin one layer at a time. For starters, the laser has difficulty printing articulating parts that require a certain clearance between them, as the parts tend to fuse together during the printing process. It is also challenging to clean support material out of inner channels.

My final series of explorations were based on data from a national public libraries survey, containing indicators on library use and spending from 1992-2013. I chose to really push the idea that objects can tell stories, with the example of sea glass in mind. Sea glass experts can tell how old a piece of glass is, what kind of bottle it came from, and where it was made – all from subtle cues like texture, color, thinness, purity, and knowledge of trade routes.

I let the content of the data itself inform my design: referencing the metal type used in letterpress printing, with each piece of type representing a state in 2003 (left) and 2013 (right). The height of each piece reflects how much money each state’s libraries spent on printed material per capita (as of 2013), and the amount of wear of each piece on the right indicates how much spending on printed material was cut in the last decade.

Tackling such an open-ended prompt, I feel I only scratched the surface of 3D printed information design, and further exploration is definitely in order. Some other ideas thrown around the studio were to make metal casts from 3D printed parts, design for different abilities (e.g. blindness), use mechanical properties like friction to sort how easily different parts slide across a table, and create a big data analogue to the silverware in a drawer: one million forks in a centrifuge.

k
The Preservation of Favoured Traces, now as a poster and book

Hot off the press—and just in time for the holidays—are two print projects that look at the six editions of Charles Darwin’s, On the Origin of Species. Originally developed as an interactive piece, we decided to continue our tradition of producing and selling unique printed artifacts. And as always, all of the proceeds from this work will be donated to charities focused on education, science, music, art, food, and homelessness.

Charles Darwin first published On the Origin of Species in 1859, and continued revising it for more than ten years. As a result, his final work reads as a composite, containing more than a decade’s worth of shifting approaches to his theory of evolution. In fact, it wasn’t even until his fifth edition—published ten years after the original book—that he introduced the concept of “survival of the fittest.” By color-coding each word of his final text by the edition in which it first appeared, our interactive and print representations of Darwin’s work trace the evolution of his thoughts and revisions, showing that even today’s most established scientific theories underwent major changes before becoming accepted or absolute ideas.

The original interactive version was built in tandem with exploratory and teaching tools, enabling users to see changes at both the macro level, and word-by-word. The printed poster allows you to see the patterns where edits and additions were made and—for those with good vision—you can read all 190,000 words on one page. For those interested in curling up and reading at a more reasonable type size, we’ve also created a book.

The poster measures 24 by 36 inches and is printed on Bright White Finch Fine 80lb cover. The type is set in Bell Centennial, a typeface that maintains legibility at very small sizes. We pushed the limits of this typeface at a very tiny 2.8 points. This allows the color-coded type to be seen as stripes of color from far away, but when you get very close to the print you can still make out all the words. The poster is printed using six unique Pantone spot colors to allow the text the cleanest detail possible. Purchase a copy of the poster here.

The 6 by 9 inch hardcover book is printed on demand, and sold through Blurb. Printing on demand means each book is printed individually for you. The type is set in Century Schoolbook at 9.75 points for a more pleasant reading experience, and contains the unique color-coded edition highlights. Purchase your copy here.

To see the full project description and to explore the interactive version, please visit our project page.

Happy holidays from all of us at Fathom!

k
Metric spaces and information distance

In my last blogpost, I showed some visualizations generated by usage data from our tool Mirador. These visualizations rely on the calculation of a “distance” between variables in a dataset, and Information Theory allows us to define such distance, as we will see below.

The notion of distance is essential to most visual representations of data, and we are intuitively– possibly innately– familiar with it. If we are in two or three dimensional space, we can use the Euclidean distance between two points p_{1}=(x_{1}, y_{1}) and p2=(x_{2}, y_{2}), defined as d(p_{1}, p_{2}) = sqrt((x_{1} – x_{2})^2 + (y_{1} – y_{2})^2), to determine the distance between any pair of points in the space.

But how do we define “distance” between more abstract entities, such as random variables? Mathematically, a distance function in an arbitrary set is a function that gives a real number for any pair of objects from the set, and satisfies the following “metric” properties:

• d(p, p) = 0 for any p. The distance of any element with itself is always zero.
• d(p, q) = 0 if and only if p = q. The distance between two objects can only be zero when the two objets are identical, and vice versa.
• d(p, q) \leq d(p, w) + d(w, q). This last property is called the triangle inequality, and geometrically means that the distance traversed between two objects p and q is always less than traversing through an intermediate object w:

View this post on a larger screen for the interactive version of the visualization above.

Any function that satisfies these three properties is called a distance. The Euclidean distance discussed before is one such function, but there are other distance functions in 2-D or 3-D space that are not Euclidean, for example the Manhattan and Chebyshev distances.

Thus, if we are in the 2-D or 3-D spaces there are several distance functions we can use to quantify how far apart are pairs of elements from each other. However, if we are working with sets of elements that are not 2-D or 3-D vectors, it can be harder to get a sense of “distance” between “points” in the space. I found it very interesting that we can actually define a proper distance function between arbitrary random variables. In a previous post, I did an informal introduction of the Shannon Entropy H(X), a mathematical measure of the amount of “surprise” received upon measuring a random variable X. This definition led us to the concept of mutual information I(X, Y), which quantifies the level of statistical dependency between two variables X and Y.

We concluded that I(X, Y) = H(X) + H(Y) – H(X, Y), which we can visualize as the area shared between the marginal entropies H(X) and H(Y), as depicted in this diagram.

The mutual information varies between 0, when the two variables are independent, and H(X, Y), when they are statistically identical. So what about the remainder of subtracting I(X, Y) from the joint entropy H(X, Y)? It is 0 when the variables are identical, and takes the maximum value H(X, Y) when they are totally unrelated. Could it be then that the following quantity:

D(X, Y) = H(X, Y) – I(X, Y)

is our distance function? We can use a simple Venn diagram to represent this function graphically:

View this post on a larger screen for the interactive version of the visualization above.

The smaller the intersection is (the less correlated the variables are) then the larger the area of the disjoint pieces will be, and so the distance D(X, Y). When the variables are entirely uncorrelated, then the intersection is empty and the distance reaches its maximum value H(X, Y).

In order to find out, we need to prove that this function does indeed satisfy the three metric properties. From the Venn diagram itself we can quickly verify the first two: when the two circles are completely overlapping, then the difference between area of the intersection and the area of the union is exactly 0, which means that D(X, X) = 0. We already discussed that if two variables are statistically identical then the mutual information is equal to the joint entropy and so D(X, Y) = 0. For the converse, we just need to note that if the area of the intersection is the same as that of the union, then the only possibility is that the two ellipses are coincident, hence X = Y.

The final part is to check the triangle inequality, meaning that we need to verify that:

D(X, Y) \leq D(X, Z) + D(Z, Y)

This looks like the most challenging step! However, we can put together a simple graphical proof inspired by the previous pictorial representation of our candidate distance function. Since this “informational distance” is precisely the portion of the joint entropy that is not shared between the two variables, we could represent the situation with three variables also via a Venn diagram as follows:

View this post on a larger screen for the interactive version of the visualization above.

By hovering over the elements of the inequality, we can see that the sum D(X, Z) + D(Z, Y) is greater or equal than D(X, Y) since it covers the entire area of the union of the three circles, with the exception of the intersection between all of them.

This visual demonstration relies on identifying the circles with the Shannon entropies of each variable, and the intersecting areas with the corresponding mutual informations. Do you think this identification is valid? Send me an email if you have some thoughts about these assumptions, or any other comments!

Check out this essay by Jim Bumgardner on Information Theory and Art, published on the Issue #3 of the Mungbeing online magazine.

And another good example of combining online text with interactive illustrations of statistical concepts, in this case on the topic of P-hacking.

Implementation details
I used Processing and Miralib to generate the videos and images in the previous post, p5.js for the interactive snippets embedded in the blogpost, and MathJax for the mathematical formulas.

k
Search processes in correlation space

This new post is the continuation of a series of writings (1, 2) on discovering correlations in complex datasets. Some of the ideas I discussed so far have made their way into Mirador, a tool for visual exploratory analysis developed in collaboration with the Sabeti Lab at Harvard University and the Broad Institute. By visualizing “information distance” to construct a geometric representation of statistical correlation, I will describe the usage patterns within the interface of Mirador. Keep reading for the details!

As part of the development process of Mirador, last year we organized a contest where users were invited to submit their correlation findings in four public datasets: Behavioral Risk Factor Surveillance System, National Health and Nutrition Examination Survey, Lahman’s Baseball Database, and the World Bank Development Indicators.

One key goal of Mirador is to enable users to find new hypotheses without prior knowledge of where the interesting correlations might be. In order to reach this goal, we created an interface to simplify the process of searching through arbitrary combinations of variables and defining any subsamples of interest. The interface of Mirador can be seen as a probe to navigate a very large, virtually infinite space of possible correlations. A visualization of the trajectories of search processes could be very interesting in itself since it may reveal the breadth of data explored in Mirador, and perhaps elucidate the efficacy of the user interface.

Relying on Information Theory to quantify the separation between variables in a dataset, I carried out a few experiments in visualizing the search trajectories from submissions received during the Mirador Data Competition. I found the mathematical aspect of these experiments very interesting as well. It turns out we can actually formalize a notion of distance between random variables, and use it to construct a spatial representation of these trajectories (we will look into the details of this distance function in the next post).

In short, each trajectory is generated by taking the variables under inspection in Mirador, and placing springs between each pair of variables so that the rest lengths of the springs are equal to the “information distance” between the variables. In the video below, the paths of the variables in a selected pair are rendered in blue, while the text at the top indicates the subsamples set by the user during the exploration of the dataset:

The full videos for each trajectory are available following these links: Behavioural Risk Factors, Lahman’s Baseball Database, and World Bank Development Indicators. In these videos, the path of each particle representing a variable in the dataset disappears once the user removes that variable from the current view in Mirador. The static images below correspond to the same trajectories, but all the paths are aggregated so that the final result gives an overall representation of the entire search process:

The representation essentially lets us compare how a user’s perception of statistical relatedness vary with actual information distance. What I found striking from all these trajectories is that selected correlations are never the closest by distance. A reason for this could be that variables in very close proximity typically correspond to “trivial” associations (for example: age and education level). Potentially meaningful correlations exist in an intermediate range of distances that would make entirely automated search difficult, and thus require manual inspection from the user based on her expert knowledge or intuition.

While working on these visualizations, I kept thinking about the collision trajectories of atomic particles, as can be seen through a Bubble chamber:

Although the images and videos came out differently (more random than I would have preferred, but still with each dataset having unique patterns), I think this is a relevant visual reference for the idea of the trajectories in correlation space. One could think of the variables in a dataset as some kind of elementary “data particles,” with the search process being the laws that define the movement in this space.

In the next blogpost, I will go over the mathematics behind the information distance, with the aid of some interactive visual demonstrations implemented with p5.js!

## Some of our partners

Founded in 2010 by Ben Fry, Fathom Information Design works with clients to understand complex data through interactive tools and software for mobile devices, the web, and large format installations. Out of its studio in Boston, Fathom partners with Fortune 500s and non-profit organizations across sectors, including health care, education, financial services, media, technology, and consumer products.

How can we help? hello@fathom.info.