# Metric spaces and information distance

In my last blogpost, I showed some visualizations generated by usage data from our tool Mirador. These visualizations rely on the calculation of a “distance” between variables in a dataset, and Information Theory allows us to define such distance, as we will see below.

The notion of distance is essential to most visual representations of data, and we are intuitively– possibly innately– familiar with it. If we are in two or three dimensional space, we can use the Euclidean distance between two points p_{1}=(x_{1}, y_{1}) and p2=(x_{2}, y_{2}), defined as d(p_{1}, p_{2}) = sqrt((x_{1} – x_{2})^2 + (y_{1} – y_{2})^2), to determine the distance between any pair of points in the space.

But how do we define “distance” between more abstract entities, such as random variables? Mathematically, a distance function in an arbitrary set is a function that gives a real number for any pair of objects from the set, and satisfies the following “metric” properties:

• d(p, p) = 0 for any p. The distance of any element with itself is always zero.
• d(p, q) = 0 if and only if p = q. The distance between two objects can only be zero when the two objets are identical, and vice versa.
• d(p, q) \leq d(p, w) + d(w, q). This last property is called the triangle inequality, and geometrically means that the distance traversed between two objects p and q is always less than traversing through an intermediate object w:

Any function that satisfies these three properties is called a distance. The Euclidean distance discussed before is one such function, but there are other distance functions in 2-D or 3-D space that are not Euclidean, for example the Manhattan and Chebyshev distances.

Thus, if we are in the 2-D or 3-D spaces there are several distance functions we can use to quantify how far apart are pairs of elements from each other. However, if we are working with sets of elements that are not 2-D or 3-D vectors, it can be harder to get a sense of “distance” between “points” in the space. I found it very interesting that we can actually define a proper distance function between arbitrary random variables. In a previous post, I did an informal introduction of the Shannon Entropy H(X), a mathematical measure of the amount of “surprise” received upon measuring a random variable X. This definition led us to the concept of mutual information I(X, Y), which quantifies the level of statistical dependency between two variables X and Y.

We concluded that I(X, Y) = H(X) + H(Y) – H(X, Y), which we can visualize as the area shared between the marginal entropies H(X) and H(Y), as depicted in this diagram.

The mutual information varies between 0, when the two variables are independent, and H(X, Y), when they are statistically identical. So what about the remainder of subtracting I(X, Y) from the joint entropy H(X, Y)? It is 0 when the variables are identical, and takes the maximum value H(X, Y) when they are totally unrelated. Could it be then that the following quantity:

D(X, Y) = H(X, Y) – I(X, Y)

is our distance function? We can use a simple Venn diagram to represent this function graphically:

The smaller the intersection is (the less correlated the variables are) then the larger the area of the disjoint pieces will be, and so the distance D(X, Y). When the variables are entirely uncorrelated, then the intersection is empty and the distance reaches its maximum value H(X, Y).

In order to find out, we need to prove that this function does indeed satisfy the three metric properties. From the Venn diagram itself we can quickly verify the first two: when the two circles are completely overlapping, then the difference between area of the intersection and the area of the union is exactly 0, which means that D(X, X) = 0. We already discussed that if two variables are statistically identical then the mutual information is equal to the joint entropy and so D(X, Y) = 0. For the converse, we just need to note that if the area of the intersection is the same as that of the union, then the only possibility is that the two ellipses are coincident, hence X = Y.

The final part is to check the triangle inequality, meaning that we need to verify that:

D(X, Y) \leq D(X, Z) + D(Z, Y)

This looks like the most challenging step! However, we can put together a simple graphical proof inspired by the previous pictorial representation of our candidate distance function. Since this “informational distance” is precisely the portion of the joint entropy that is not shared between the two variables, we could represent the situation with three variables also via a Venn diagram as follows:

By hovering over the elements of the inequality, we can see that the sum D(X, Z) + D(Z, Y) is greater or equal than D(X, Y) since it covers the entire area of the union of the three circles, with the exception of the intersection between all of them.

This visual demonstration relies on identifying the circles with the Shannon entropies of each variable, and the intersecting areas with the corresponding mutual informations. Do you think this identification is valid? Send me an email if you have some thoughts about these assumptions, or any other comments!

Check out this essay by Jim Bumgardner on Information Theory and Art, published on the Issue #3 of the Mungbeing online magazine.

And another good example of combining online text with interactive illustrations of statistical concepts, in this case on the topic of P-hacking.

# Implementation details

I used Processing and Miralib to generate the videos and images in the previous post, p5.js for the interactive snippets embedded in the blogpost, and MathJax for the mathematical formulas.

# Search processes in correlation space

This new post is the continuation of a series of writings (1, 2) on discovering correlations in complex datasets. Some of the ideas I discussed so far have made their way into Mirador, a tool for visual exploratory analysis developed in collaboration with the Sabeti Lab at Harvard University and the Broad Institute. By visualizing “information distance” to construct a geometric representation of statistical correlation, I will describe the usage patterns within the interface of Mirador. Keep reading for the details!

As part of the development process of Mirador, last year we organized a contest where users were invited to submit their correlation findings in four public datasets: Behavioral Risk Factor Surveillance System, National Health and Nutrition Examination Survey, Lahman’s Baseball Database, and the World Bank Development Indicators.

One key goal of Mirador is to enable users to find new hypotheses without prior knowledge of where the interesting correlations might be. In order to reach this goal, we created an interface to simplify the process of searching through arbitrary combinations of variables and defining any subsamples of interest. The interface of Mirador can be seen as a probe to navigate a very large, virtually infinite space of possible correlations. A visualization of the trajectories of search processes could be very interesting in itself since it may reveal the breadth of data explored in Mirador, and perhaps elucidate the efficacy of the user interface.

Relying on Information Theory to quantify the separation between variables in a dataset, I carried out a few experiments in visualizing the search trajectories from submissions received during the Mirador Data Competition. I found the mathematical aspect of these experiments very interesting as well. It turns out we can actually formalize a notion of distance between random variables, and use it to construct a spatial representation of these trajectories (we will look into the details of this distance function in the next post).

In short, each trajectory is generated by taking the variables under inspection in Mirador, and placing springs between each pair of variables so that the rest lengths of the springs are equal to the “information distance” between the variables. In the video below, the paths of the variables in a selected pair are rendered in blue, while the text at the top indicates the subsamples set by the user during the exploration of the dataset:

The full videos for each trajectory are available following these links: Behavioural Risk Factors, Lahman’s Baseball Database, and World Bank Development Indicators. In these videos, the path of each particle representing a variable in the dataset disappears once the user removes that variable from the current view in Mirador. The static images below correspond to the same trajectories, but all the paths are aggregated so that the final result gives an overall representation of the entire search process:

The representation essentially lets us compare how a user’s perception of statistical relatedness vary with actual information distance. What I found striking from all these trajectories is that selected correlations are never the closest by distance. A reason for this could be that variables in very close proximity typically correspond to “trivial” associations (for example: age and education level). Potentially meaningful correlations exist in an intermediate range of distances that would make entirely automated search difficult, and thus require manual inspection from the user based on her expert knowledge or intuition.

While working on these visualizations, I kept thinking about the collision trajectories of atomic particles, as can be seen through a Bubble chamber:

Although the images and videos came out differently (more random than I would have preferred, but still with each dataset having unique patterns), I think this is a relevant visual reference for the idea of the trajectories in correlation space. One could think of the variables in a dataset as some kind of elementary “data particles,” with the search process being the laws that define the movement in this space.

In the next blogpost, I will go over the mathematics behind the information distance, with the aid of some interactive visual demonstrations implemented with p5.js!

# Recent talks

Though we’ve been busy with client projects in the office, we’ve also participated in a flurry of speaking events and conferences in the last month.

Mark and I recently spoke at Visualized in New York, which is a conference that brings together designers, storytellers, and technologists to explore the future of information communication. The conference organizers asked us to speak about our work on Scaled in Miles, which looks at the career and collaborations of Miles Davis. Keeping the talk to a single project was a great way for us to outline our initial passion for the topic, and describe the process of applying a single dataset to multiple mediums ranging from an interactive web app to a printed poster. A video of the talk will be posted soon, so stay tuned.

Last month, I spoke at Carnegie Mellon University to the Themed Entertainment Association (TEA) SATE Conference. It was fascinating to learn about the future of experience design within theme parks from speakers from Disney World and Universal. I was able to speak about how we design with data across diverse audiences, and how it can be applied across all scales — from a mobile device to an architectural installation. This talk will also be available soon.

And this Thursday Ben will be speaking at PopTech in Camden, Maine, which looks to be a really exciting conference. With this year’s theme of “hybrid,” the conference will explore how people, projects, and ideas bring about thoughtful and unexpected solutions that combine art and science, among other disciplines.

Stay on the lookout for upcoming events, or check out some of our past talks here.

# Clinton Global Initiative 2015

A few weeks ago, we released No Ceilings 2.0 in conjunction with the annual Clinton Global Initiative (CGI) meeting. Along with refurbishing the design on the landing page, we created a new visualization optimized for an installation setting. The visualization measures the change—or lack thereof—of the gender gap in labor force participation over the last twenty years. In addition, we released country snapshots, which provide an overview of the status of girls and women in each country.

With a large installation set up, we were excited to share the latest site with meeting attendees and other passersby. The new visualization enables users to see both women’s and men’s labor force participation organized by the size of the gender gap, or by participation levels of either gender. Clicking play at the bottom shows how participation has changed over time. There’s been notable progress in, say, the Maldives, where the gap has more than halved from 46% to 21% in the last two decades. Conversely, Afghanistan’s gap has crawled from 66% to a meager 64% in the last twenty years.

The country snapshots enable users to get a glance of the state of girls and women in areas of health, education, economic participation, security, and leadership. In addition to showing evident gains and setbacks, the snapshots also show areas where there is missing data. For the indicators surrounding women’s security, only a third of countries actually have information available.

We made a number of design decisions to accommodate the site for an installation setting. Understanding how the site would scale on a 70 inch touch display, function without hover, and cater to an audience ranging from four to seven feet tall (even former NBA star Dikembe Mutombo, at 7’2″, stopped by to give the site a whirl) were just a few of the many considerations we had to make when tweaking the design for the installation.

Overall CGI made for a wonderful opportunity to bring No Ceilings to a diverse audience, and a more tangible interactive setting. Stay tuned for more updates!

# Time and Place

We’ve had an ongoing interest in activity data from projects with the Nike FuelBand (Year in NikeFuel and NikeFuel Weather Activity) to more recently with Fathom Watch Faces for Android Wear. This work has inspired me to track every place I’ve been and how I’ve moved between locations with the Moves app. With about twenty months of data on my hands I began parsing, analyzing and creating sketches.

This project was a great way to put my preliminary knowledge of D3.js into practice. I’d also recommend the book Interactive Data Visualization for the Web by Scott Murray which was helpful for grasping the fundamentals of D3.js.

Check out the project here.

# Girls from low-income households have the least access to primary school

In honor of Women’s Equality Day, we released a new No Ceilings visualization exploring how disparities in wealth engender gaps in primary school completion. Girls from low-income households are often at the greatest disadvantage in their access to basic education, most predominantly in Middle Eastern and African countries. For all of the inequalities that exist in the U.S. school systems, there are millions of girls around the world who don’t have the opportunity to graduate elementary school, let alone attend it.

In Yemen, a country with one of the greatest educational disparities between genders and income levels, just one girl for every five boys in the lowest income group finishes primary school. The gap widens when comparing between Yemeni girls: seven girls from the highest income group complete primary school for every one girl at the lowest. The academic outlooks, career goals, and familial health of girls from low-income households will all suffer as a result.

While yes, the disparities in primary school enrollment are closing, many schools experience a large drop-off by the time students enter secondary school. Gender bias and cultural norms, in addition to economic disparities, often hinder students from making the transition, regardless of the policy protections in place.

Take a look and spread the word. We can’t change the numbers unless we know them.

# Place Poetry

As anyone who has recently taken a road trip can attest, there are a lot of places in the United States with very distinctive names. Many of us at Fathom are fascinated by geography and the subtle oddities around us, so it seemed only natural we create Place Poetry. The playful mobile application enables people to arrange strangely named cities into poems, while simultaneously plotting the location and distance of their journey.

The United States has the third-largest area of any country (second only to Russia and China), and as pioneers expanded outwards, many places were in need of names upon settlement. Some places were named after preexisting locations in Europe and around the world. Others were named after powerful leaders – kings and presidents, local mayors, and often for the founders themselves. Many place names across the United States are vestiges of the native cultures stamped out or displaced during colonization. Towns were also named after nature, colors, weather, famous authors, and relative geography. Yet with all these common themes to pull from, some town founders went above and beyond.

When Ben created his original zipdecode project, he worked with a list of every uniquely-named town in the United States. This list contains 19,053 names, and provided the framework for Place Poetry. Many names on the list are highly inventive. Some names tempt fate (Waterproof, Smackover, Tornado) while others are seeped in vanity (Superior, Radiant, Cashtown). Some places reflect a founder’s disregard for the town (Blowing Rock, Hurdle Mills, Idiotville) and others leave you wondering how it was that such a name came into use (Bitely, Peculiar, Cut and Shoot, Oblong).

We narrowed down the original list of names into a select set of favorites, and sorted them into thematic categories. As these words are dragged from different bins into the composition space, the city’s location is plotted on a map. When the poem is complete, the author can see how many miles long his or her poem runs, and can also share it on social media or via email. Take a look, share a poem, and may your inner poet flippin zap peculiar aromas.

Note: Place Poetry is built for mobile devices, so please visit placepoetry.us on your phone or tablet!

# Fathom Watch Faces: an Android Experiment

We’re excited to announce the launch of the Fathom Watch Faces, a collection of interactive watch face designs for the Android Wear collection, which is part of Google’s Android Experiments. The experiments are designed to bring developers together on a common platform to push the capabilities of Android tablets, phones, and watches. We focused on using the internal components of the watches, such as their accelerometers and pedometers, to create delightful user feedback at every glance, and to really explore the information people can gain from a wearable device attached to their wrist.

Given our previous work with activity tracking, we wanted to use the project to explore the role of wearables—specifically watches—in giving people useful information throughout the day. We played with the balance of information a user can gain from a watch, and looked at the stories they might learn from quantifying elements of their day-to-day. The designs we landed on experiment with ideas of activity tracking, play, and self-awareness.

Coubertin Rings
The Coubertin Rings watch face motivates users to improve their daily activity. The model uses built-in sensors on the watch to display playful interactive rings that represent daily step counts. As users hit step milestones, the rings get bigger, change color, and scale up to quantify higher step counts (e.g. 10 steps, 100 steps, 1,000 steps). Splashes of color reward users for achieving certain levels, and motivate them to get to the next ring,

Paying homage to Pierre de Coubertin, the father of the modern Olympic Games, the watch references the rings of the Olympic logo which was designed in 1912 by Coubertin himself. The Olympics are a symbol of goals and achievements, and the Coubertin Rings watch face is meant to promote daily activity performance.

Bouncing Isaac
Inspired by light and physics, the Bouncing Isaac uses the watch’s built-in sensors to display playful, interactive, geometric patterns and colors that change throughout the day. As users move their wrists, various color patterns and forms emerge. The background color changes every hour, and the triangles are based on a sliding spectrum of highly saturated colors. The colors overlap one another as the leading point of the triangle hits one of the walls of the watch face.

The watch design pays tribute to Isaac Newton’s laws of motion and his experiments using a prism to refract white light to create a spectrum of color.

Gaze Effect
The Gaze Effect watch face displays mysterious eyes that gaze back at users when they look at their watch. The more they check their watch, the more the eyes look back. In the later hours of the day, the eyes grow tired, and they move and blink less. At special times throughout the day, the eyes get especially eerie.

The design alludes to Jacques Lacan’s psychological term “Gaze,” or the realization that we are objects being looked at just as an inanimate object can be looked at. The Gaze Effect watch face is a quirky reference to self-awareness, and our relationship to personal devices and time.

The Android Experiments are open source, and the code for our watch collection is available here. We hope the source code makes it a little easier for others to get up and running with watch development, and acts as an example for using watch sensors.

Hardware and software are changing so quickly for wearables, which made the project feel like an introductory test-run. At this early phase of wearables, we designed to the limitations of the hardware. A year from now, the design considerations will change drastically. With the potential for improved battery life, faster processing power, and new user interfaces, this project will have an entirely new set of design guidelines. Regardless, working on the experiments at such an early phase enabled us to see what the watches are capable of, the subtle differences in each model, and the ways users can gain personal insight from wearable devices.

# Writing in Code

I found Fathom through a data visualization course at college that was taught by a statistics professor, so my first exposure to information design was through the lens of statistical analysis. I spent most of my time in that class making sure the data was not misrepresented, and working through particularly challenging pieces of code. The more complicated the analysis or the code, the better I felt about the project, and I wanted that complexity to show in my end product. If the code worked and did something cool, then I was happy.

Contrary to what Twitter suggests, I learned more this summer than just how to use a hammer drill. A couple weeks ago, I showed James some slides from a presentation of my work this summer. One of his initial comments was, “This needs some design love.” I had sort of known that, but it hadn’t felt like a priority to me. I was just trying to produce the necessary content clearly. For James, design is an integral part of any work. As a newbie to design thinking, this was one of the moments that started changing how I evaluate my work.

When I started work at Fathom in June, I was immediately struck by how different the work flow is from the environments I was used to. Instead of focusing on the technical or analytical side of information design, the big questions were always design ones: Who is the audience? What are they most interested in? How can this tool be more intuitive for them to explore? And most importantly, what is the story in the data? I had never thought about conveying data as storytelling before. Handling data was always fairly black and white to me; you look for an answer in the form of evidence to prove your hypothesis true or false. Information design at Fathom isn’t like that. The process obviously still involves rigorous exploration and analysis of the data, but it’s focused on allowing humans to navigate the data with ease.

Thinking of information design as storytelling was huge for me. I was originally drawn to computer science because it felt similar to creative writing; it’s the same sort of creation, just writing in a different language. Code produces apps in the same way that sentences conjure images. I have often thought that I should apply the same rigorous logic used in coding to writing essays, and now I am realizing that the creativity and insight involved with writing should also transfer to my code. Just as nobody wants to read a book with non-stop action and no plot, nobody gets anything out of a bouncy technicolor bar chart.

Growing up, my English teachers told me not to use contractions in my writing, so I was surprised when my English professor last semester went through my first short story and contracted words wherever appropriate. I had been blinded by what I thought was a “rule,” when in reality the lack of contractions sounded unnatural for the characters and was distracting from the story. Similarly, I realized that my instincts to follow statistical “rules” would overwhelm many users with distracting or unnecessary information, instead of empowering them to explore.

In school, my computer science classes are focused on code. The work we submit is source code, and evaluations are based more on how that code is written than on what it produces. Entrenched in the source code, the focus is on theory and algorithms. At Fathom it’s not enough to stop there. Code is a tool to create, but particularly cool or complex code should not be driving the process. The driving force is the vision of the end product, an insightful tool.

# Fathom goes to Salem

Last Thursday, we joined the studios of Design I/O and Sosolimited for a second annual gathering. All three teams met in Salem at the historic House of the Seven Gables, and spent the day sharing recent projects and discussing topics centered around the convergence of art and technology.

I worked on a visual identity for the conference, and wanted to develop something that combined technology and design with our location in Salem, Massachusetts. I started with sketches of bitmapped black cats, which evolved into a typeface and gifs of a cat that coughs out data hairballs and binary code. The event schedules were delivered in customized black envelopes, and the bitmapped cats made their way on to people’s name tags. This year the conference was entitled ‘DeSoFa,’ which combines the first two characters of the names of the studios.

It was really interesting to see what everyone has been working on. SOSO shared their Innovation Clock with us, which sparked an interesting discussion about sifting through the noise of Twitter for meaningful dialogue.

Design I/O spent an hour walking us through their interactive installation, Connected Worlds, which was recently unveiled at the New York Hall of Science. Design I/O and their collaborators created an entire digital ecosystem of an astonishing scale and scope.

I also really enjoyed hearing more about the Global Animal Trade interactive piece our team did for National Geographic. Though I had some familiarity with this project, it had wrapped shortly before I joined the office, and it was fun to hear the early stages of the project discussed in greater detail. A lot of the animal trade statistics were in completely different metrics, and there is no easy conversion between “centimeters of whale,” and “metric tons of caviar” – so a lot of work was needed at the outset to sift and organize the information. (Alex now has a fascinating reservoir of information on what strange and obscure animal parts are traded for even more strange and obscure reasons, which made for a very lively discussion at lunch.)

One of the highlights of the day was the Fast Ferry from Salem back to Boston, which was true to its name. It raced back to Long Wharf at such a fast clip that many of us were nearly blown overboard. Regardless, we clung to the rails of the prow to see the expansive views of the coastline and glowing summer sky.

With one more design gathering for the books, I’m looking forward to next year’s adventure!