Posts by Andrés
k
Metric spaces and information distance

In my last blogpost, I showed some visualizations generated by usage data from our tool Mirador. These visualizations rely on the calculation of a “distance” between variables in a dataset, and Information Theory allows us to define such distance, as we will see below.

The notion of distance is essential to most visual representations of data, and we are intuitively– possibly innately– familiar with it. If we are in two or three dimensional space, we can use the Euclidean distance between two points `p_{1}=(x_{1}, y_{1})` and `p2=(x_{2}, y_{2})`, defined as `d(p_{1}, p_{2}) = sqrt((x_{1} – x_{2})^2 + (y_{1} – y_{2})^2)`, to determine the distance between any pair of points in the space.

But how do we define “distance” between more abstract entities, such as random variables? Mathematically, a distance function in an arbitrary set is a function that gives a real number for any pair of objects from the set, and satisfies the following “metric” properties:

  • `d(p, p) = 0` for any `p`. The distance of any element with itself is always zero.
  • `d(p, q) = 0` if and only if `p = q`. The distance between two objects can only be zero when the two objets are identical, and vice versa.
  • `d(p, q) \leq d(p, w) + d(w, q)`. This last property is called the triangle inequality, and geometrically means that the distance traversed between two objects `p` and `q` is always less than traversing through an intermediate object `w`:

View this post on a larger screen for the interactive version of the visualization above.

Any function that satisfies these three properties is called a distance. The Euclidean distance discussed before is one such function, but there are other distance functions in 2-D or 3-D space that are not Euclidean, for example the Manhattan and Chebyshev distances.

Thus, if we are in the 2-D or 3-D spaces there are several distance functions we can use to quantify how far apart are pairs of elements from each other. However, if we are working with sets of elements that are not 2-D or 3-D vectors, it can be harder to get a sense of “distance” between “points” in the space. I found it very interesting that we can actually define a proper distance function between arbitrary random variables. In a previous post, I did an informal introduction of the Shannon Entropy `H(X)`, a mathematical measure of the amount of “surprise” received upon measuring a random variable `X`. This definition led us to the concept of mutual information `I(X, Y)`, which quantifies the level of statistical dependency between two variables `X` and `Y`.

We concluded that `I(X, Y) = H(X) + H(Y) – H(X, Y)`, which we can visualize as the area shared between the marginal entropies `H(X)` and `H(Y)`, as depicted in this diagram.

The mutual information varies between `0`, when the two variables are independent, and `H(X, Y)`, when they are statistically identical. So what about the remainder of subtracting `I(X, Y)` from the joint entropy `H(X, Y)`? It is `0` when the variables are identical, and takes the maximum value `H(X, Y)` when they are totally unrelated. Could it be then that the following quantity:

`D(X, Y) = H(X, Y) – I(X, Y)`

is our distance function? We can use a simple Venn diagram to represent this function graphically:

View this post on a larger screen for the interactive version of the visualization above.

The smaller the intersection is (the less correlated the variables are) then the larger the area of the disjoint pieces will be, and so the distance `D(X, Y)`. When the variables are entirely uncorrelated, then the intersection is empty and the distance reaches its maximum value `H(X, Y)`.

In order to find out, we need to prove that this function does indeed satisfy the three metric properties. From the Venn diagram itself we can quickly verify the first two: when the two circles are completely overlapping, then the difference between area of the intersection and the area of the union is exactly `0`, which means that `D(X, X) = 0`. We already discussed that if two variables are statistically identical then the mutual information is equal to the joint entropy and so `D(X, Y) = 0`. For the converse, we just need to note that if the area of the intersection is the same as that of the union, then the only possibility is that the two ellipses are coincident, hence `X = Y`.

The final part is to check the triangle inequality, meaning that we need to verify that:

`D(X, Y) \leq D(X, Z) + D(Z, Y)`

This looks like the most challenging step! However, we can put together a simple graphical proof inspired by the previous pictorial representation of our candidate distance function. Since this “informational distance” is precisely the portion of the joint entropy that is not shared between the two variables, we could represent the situation with three variables also via a Venn diagram as follows:

View this post on a larger screen for the interactive version of the visualization above.

By hovering over the elements of the inequality, we can see that the sum `D(X, Z) + D(Z, Y)` is greater or equal than `D(X, Y)` since it covers the entire area of the union of the three circles, with the exception of the intersection between all of them.

This visual demonstration relies on identifying the circles with the Shannon entropies of each variable, and the intersecting areas with the corresponding mutual informations. Do you think this identification is valid? Send me an email if you have some thoughts about these assumptions, or any other comments!

Additional reading
Check out this essay by Jim Bumgardner on Information Theory and Art, published on the Issue #3 of the Mungbeing online magazine.

And another good example of combining online text with interactive illustrations of statistical concepts, in this case on the topic of P-hacking.

Implementation details
I used Processing and Miralib to generate the videos and images in the previous post, p5.js for the interactive snippets embedded in the blogpost, and MathJax for the mathematical formulas.

k
Search processes in correlation space

This new post is the continuation of a series of writings (1, 2) on discovering correlations in complex datasets. Some of the ideas I discussed so far have made their way into Mirador, a tool for visual exploratory analysis developed in collaboration with the Sabeti Lab at Harvard University and the Broad Institute. By visualizing “information distance” to construct a geometric representation of statistical correlation, I will describe the usage patterns within the interface of Mirador. Keep reading for the details!

As part of the development process of Mirador, last year we organized a contest where users were invited to submit their correlation findings in four public datasets: Behavioral Risk Factor Surveillance System, National Health and Nutrition Examination Survey, Lahman’s Baseball Database, and the World Bank Development Indicators.

One key goal of Mirador is to enable users to find new hypotheses without prior knowledge of where the interesting correlations might be. In order to reach this goal, we created an interface to simplify the process of searching through arbitrary combinations of variables and defining any subsamples of interest. The interface of Mirador can be seen as a probe to navigate a very large, virtually infinite space of possible correlations. A visualization of the trajectories of search processes could be very interesting in itself since it may reveal the breadth of data explored in Mirador, and perhaps elucidate the efficacy of the user interface.

Relying on Information Theory to quantify the separation between variables in a dataset, I carried out a few experiments in visualizing the search trajectories from submissions received during the Mirador Data Competition. I found the mathematical aspect of these experiments very interesting as well. It turns out we can actually formalize a notion of distance between random variables, and use it to construct a spatial representation of these trajectories (we will look into the details of this distance function in the next post).

In short, each trajectory is generated by taking the variables under inspection in Mirador, and placing springs between each pair of variables so that the rest lengths of the springs are equal to the “information distance” between the variables. In the video below, the paths of the variables in a selected pair are rendered in blue, while the text at the top indicates the subsamples set by the user during the exploration of the dataset:

The full videos for each trajectory are available following these links: Behavioural Risk Factors, Lahman’s Baseball Database, and World Bank Development Indicators. In these videos, the path of each particle representing a variable in the dataset disappears once the user removes that variable from the current view in Mirador. The static images below correspond to the same trajectories, but all the paths are aggregated so that the final result gives an overall representation of the entire search process:

The representation essentially lets us compare how a user’s perception of statistical relatedness vary with actual information distance. What I found striking from all these trajectories is that selected correlations are never the closest by distance. A reason for this could be that variables in very close proximity typically correspond to “trivial” associations (for example: age and education level). Potentially meaningful correlations exist in an intermediate range of distances that would make entirely automated search difficult, and thus require manual inspection from the user based on her expert knowledge or intuition.

While working on these visualizations, I kept thinking about the collision trajectories of atomic particles, as can be seen through a Bubble chamber:

fermilab3

Although the images and videos came out differently (more random than I would have preferred, but still with each dataset having unique patterns), I think this is a relevant visual reference for the idea of the trajectories in correlation space. One could think of the variables in a dataset as some kind of elementary “data particles,” with the search process being the laws that define the movement in this space.

In the next blogpost, I will go over the mathematics behind the information distance, with the aid of some interactive visual demonstrations implemented with p5.js!

k
Mirador Data Competition: the winning entries

We recently organized the Mirador Data Competition, where participants were invited to explore public datasets in health, sports, and global development using the Mirador tool, submit their findings, and have a chance to win prizes. With the assistance of experts in the areas covered by the competition, we chose three winning entries, and today we have the pleasure to announce them.

The winners of the Mirador Data Competition are:

We believe these three correlations were worth choosing because they give us a glimpse of complex socio-economic processes, and highlight the potential of tools such as Mirador for generating tentative new hypothesis, as well as pointing to their limitations and possible improvements.

Findings were submitted as eikosogram plots, a representation that Mirador uses to explore many variables at once. I’ve gone into more detail about eikosograms in an earlier post. In order to elaborate on the winning entries I created three custom interactive versions that can be explored below. I tried to re-interpret them with a visualization that is better suited for each particular dataset.

First prize: Maria Fernanda Gándara. Outliers in Research and Development Expenditure

Although it might be somewhat expected, the more resources a country invests in R&D, the more people become researchers. But this submission reveals complex patterns of R&D investment and “resulting” number of researchers that vary widely across time and between countries. The visualization below shows a plot of the ratio of number of researchers per GDP percentage invested in R&D next to the original scatter plot. Are some countries more effective at training researchers for a given percentage of GDP investment?

Interact with the charts below to explore this question.

View this post on a larger screen for the interactive version of the visualization above.

This visualization includes all countries in the Europe, Central Asia, and North America regions, between 2000 and 2013. However, María Fernanda also considered the percentage of secondary female teachers as a covariate in her analysis. You can explore the effect of this variable by clicking the following links to update the plot above. Show countries where the percentage of female teachers is less than 50%, more than 50%, or without constraint. According to María Fernanda “I explored for possible covariates that could strengthen the relationships, and that is how I chose the additional covariate of percentage of female teachers in secondary education. Therefore, the constrain of “percentage of female secondary teachers > 50%” was statistically based (in an exploratory fashion). I do think it can be interpreted, though. Since researchers are typically men, the more women working in secondary education, the more men “available” to become researchers.”

Second prize: Yuliia Khodakivska. The Boys of Mid-Summer?

The second prize entry is a very interesting correlation pointing to the fact that player salaries in Baseball are influenced by many artificial effects, such as fixed pay scales and team caps. The data shows that players born in July have the highest median salaries in the league.

Somewhat related, an article from a few years ago shows that the month of birth for an American League player peaks on August, however this happens only for U.S. born players, non-U.S. players don’t seem to be born in August on significantly higher proportions.

In order to visualize all these patterns, I added the birth counts for U.S. and non-U.S. born players for each month of the year, alongside the median salary as a function of month of birth. All these numbers were derived from the 2013 release of the Lahman’s database.

Interact with the chart below to explore further.

View this post on a larger screen for the interactive version of the visualization above.

Strangely enough, the peak in the median salary occurs in July, not August as one would expect following the argument in the article. Is this a real effect, or the result of a bug in the code or data? You can download our scripts to check for yourself. Our winner told us about her initial motivation to look at this correlation. Yuliia writes, “an article called How Common Is Your Birthday? (and data source) […] It contains infographics showing that July, August, and September seem to have more births, comparing to winter times.”

Third prize: Ching-Hsing Wang. Exercise and Health

A correlation between exercise and health could also be considered “expected”, as people who exercise regularly are probably in better health than those who do not (although the link to specific causal factors is less straightforward). The source is the CDC’s Behavioral Risk Factor Surveillance System, which is a phone-based health survey conducted every year to collect data on a variety of factors (demographics, alcohol consumption, employment, etc.)

Here, I thought it would be useful to take Ching-Hsing’s original submission and make it more general in this visualization, allowing readers to explore different association patterns. In order to control, at least to some extent, for confounding effects that might be influencing both general health and exercise, I restricted the visualization to respondents younger than 50 years who don’t report activity limitation due to health problems. It is interesting to see how the proportions of health levels change between males and females and across income groups.

Interact with the chart below and toggle between sex and income.

View this post on a larger screen for the interactive version of the visualization above.

For instance, the proportion of exercising females who report excellent health is 4% higher than for exercising males in the top earning group. However, this difference reverts for lower income respondents: exercising males report higher excellent health than females. Are these patterns simply the result of random fluctuations in the sample data or due to real effects?

Representing correlations using eikosograms

In Mirador, correlations are primarily visualized with eikosograms. I’ve gone into more detail about them in an earlier post and online documentation. The figure below summarizes how to interpret an eikosogram plot:

The eikosogram on the left represents the correlation between two categorical variables. The height of the vertical columns indicates conditional probability. The eikosogram on the right depicts two numerical variables, in which case the vertical elements are boxplots for each bin in X
The eikosogram on the left represents the correlation between two categorical variables. The height of the vertical columns indicates conditional probability. The eikosogram on the right depicts two numerical variables, in which case the vertical elements are boxplots for each bin in X

The eikosogram is constructed differently depending on whether the Y variable is nominal or numerical. In the former case, the vertical columns represent the conditional probability of each value of the Y variable given the X category. In the former, a boxplot is constructed for each value of X, and the entire eikosogram is formed by all these boxplots placed next to each other. The dark blue box contains values one standard deviation around the mean, while the light blue extends up to two standard deviations.

Although the interpretation of the eikosogram is different depending on the variable type, in all cases it is easy to visually identify a correlation: unrelated variables have a flat eikosogram, because either the conditional probabilities or the boxplots are independent of X. It is also important to note that the scale of the X variable is not linear: the width of each X bin is proportional to the number of samples falling within that bin.

Users in Mirador can also define arbitrary data ranges in order to control by various covariates or to stratify the sample
into subpopulations of interest. Some of the submissions used covariates, while others were reported on the entire sample. The next gallery shows the three winning correlations as they were displayed in Mirador:

Some concluding thoughts

More rigorous analysis would need to be conducted in order to interpret these correlations, but the goal of exploratory tools such as Mirador is to reveal plausible patterns of association, and let users quickly visualize hypothesis based on their intuition and prior knowledge. Any correlation discovered with these tools should be regarded only as a suggestion for further analysis, which is also contingent on the context where one is carrying out the use of these tools: whether in education, research, or applied practice.

It could be argued that one can play with covariates in Mirador until finding a statistically significant association, but as contestant María Fernanda pointed out, this is valid practice in “data-driven” exploratory analysis: the interpretation stage comes later, at which point one can discard the correlation altogether, or conduct further analysis using more powerful tools or better datasets. The feedback received from users so far has been very positive, highlighting both Mirador’s advantages (free availability, ease of use, interactive correlation analysis) and the areas where it could be improved (inclusions of other datasets, better scatter plot functionality, more advanced statistical analysis).

Data and code availability

The code that generates the data files used in this blog post and the JavaScript visualizations is hosted on this repository. Follow the next links to download the individual data files for the first, second, and third entries.

Acknowledgments

We would like to recognize all the participants of the Mirador Data Competition for their submissions, and Gregory Piatetsky-Shapiro for helping us announce the competition. I would also like to thank the feedback from Tariq Khokhar on the World Bank submissions, Sean Lahman on the baseball correlations, and Pearly Dhingra for the insightful discussions about associations in health data and confounding effects. Finally, many thanks to Lauren McCarthy and the rest of the p5.js team. All the interactive visualizations were created with Processing and ported to p5js.

k
Marriage, Health, and Jobs

Public data is increasingly available from multiple sources: governments, economists, and research communities, to name a few. Open access is a fundamental prerequisite for civic participation and transparency, but freely-available and intuitive tools that allow users to extract meaningful narratives from the data are also crucial. That was our central motivation to develop the visualization tool Mirador, and also for the Mirador Data Competition we launched last month. The richness of public datasets is often extraordinary, and many of them are the result of the continued efforts of data collection teams, statisticians, and researchers over several years, sometimes decades. In this post, I would like to share some associations I found using Mirador on a large dataset of behavioral risk factors. These associations stand here simply as suggestive hints or directions that one can use to delve further into the data using more rigorous statistical analyses. This highlights the main purpose of Mirador as a visual exploratory tool.

Many others have recognized the importance of open data and public participation, and had organized similar data challenges or competitions in the past to spur the interest of various audiences in data visualization and analysis. Around the time we launched our own competition, I came across the HHS VizRisk, an event organized by the U.S. Department of Health & Human Services that seeks for visualizations of behavioral data to inform personal and policy decisions. The main dataset in VizRisk comes from the Behavioral Risk Factor Surveillance System (BRFSS), a nation-wide phone survey that collects information about health risk behaviors.

I compiled the BRFSS data made available for VizRisk into Mirador’s format (which is basically a CSV table plus some additional metadata) and did some quick explorations of my own. The screen capture below shows Mirador after loading the 2011 BRFSS data, comprised of around 500,000 respondents:

mirador-mhj

The happiness of the self-employed

A variable in the BRFSS dataset that I believe is reasonable to choose as a global indicator of well being is “General Health”. Respondents are asked to characterize their health status using 5 options: poor, fair, good, very good, and excellent. So, it would be interesting to look at association patterns between General Health and other socio-behavioral indicators. One association that stood out for me is between General Health and Employment status. This other variable records if the respondent is employed for wages, self-employed, student, unemployed, etc. I call this association the “happiness of the self-employed”, because for the entire sample of 500,000 respondents you can see that self-employment relates with a slight increase in reported excellent general health:

GENHLTH-EMPLOY-UI

We can conclude that self-employed people are more likely to respond that they have excellent General Health, although the difference is only of a few percentage points. Before going any further, lets first make clear what this plot (called eikosogram) means: the percentage 25.43% is highlighted for the “self-employed” category in the column (corresponding to the Employment Status variable), and the “excellent” category in the row (corresponding to the General Health variable). This means that 25.43% of the self-employed respondents answer that they have excellent general health. In other words, it is a conditional probability that can be denoted in mathematical notation as:

P(excellent health|self-employed) = 0.2543

Mirador is designed for interactive visualization, so only the labels of the hovered items are shown in the interface. For clarity, I have saved the health-employment eikosogram and added the labels for some the categories -employed (for wages), self-employed, homemaker and student:

GENHLTH-EMPLOY-labels

Next, we can explore what factors might effect this association. Variables such as sex, age, income and ethnic group would probably have an impact on it. It is easy to check with Mirador the effect of any of these factors. For example, the percentage of women reporting excellent health when they are self-employed in relation to employed for wages is higher than for men: 27% versus 24%, while both report similar levels of excellent health for the employed status:

employment-gender

Of course: correlation does not imply causation, but it is worth noting nonetheless. Since we can easily adjust by other socio-economical factors, I searched for a combination of factors that maximize the “happiness” among self-employed respondents.

Age and income have a large effect, with middle age individuals in higher income brackets reporting excellent health among those self-employed. After fixing the covariates in those ranges (age: 35-54, income 35k+), I started looking at the association among different ethnicities, as classified in the BRFSS data: white, black, asian, hispanic, pacific islander, native american, and other/multiracial. What I found is that the highest levels of excellent health for self-employed respondents occurs for the asian ethnicity. The difference between employed for wages and self-employed is quite substantial for this group, approximately 25% versus 44%:

employment-ethnicity

What can we conclude from this pattern in the data? Again, correlation is not causation, but we can wonder if this pattern is due to cultural or economic factors. It is not possible to say from the data, but at least we have a tentative hypothesis we can test further. We also have to be careful with the fact that when control by several factors (age, income, ethnicity), then the sample size decreases dramatically, which makes our conclusions weaker. For example, the number of respondents in the Asian, 35-54 years of age, income higher than 35k, subgroup is of only 2,086. For a visual illustration of the so called “curse of dimensionality,” check this interactive web app.

Better growing old alone… if we have enough money?

Another factor that clearly affects “happiness” is the relationship status of individuals. BRFSS includes a Marital Status variable with several categories, but in order to keep the plots simple I restricted the visualization to Married, Divorced, Widowed, and Never Married. The eikosogram between General Health and Marital Status looks as follows:

MARITAL-GENHLTH-labels

The health levels are suspiciously high among the never married category. However, this plot was generated using the entire population sample, which covers all ages starting at 18. By differentiating between age groups and also gender we get a better representation of the change in health patterns among subpopulations with different marital status:

marriage-age

Some of the patterns are expected or known, for example health levels decrease as people age, and the fraction of married women up to 34 years of age is higher than that for men. In addition to that, the fraction of men in the 25-34 age bracket reporting excellent health among non-married individuals is higher (in fact, similar to those married) than for women. Is this a manifestation of the social pressure acting on women to get married before their mid-thirties? Again, we cannot draw these causal conclusions from the data, but at least we can use the visual patterns as a guide for more detailed analyses.

It is also not surprising to find that income levels having a strong correlation with health. But perhaps more interesting is to see how the association between health and marital status dramatically changes its direction when discriminating between high and low earners. When we aggregated all the data for people 55 years or older, we saw in the previous animation a marked decrease in health among individuals that ended up single, either due to divorce, death of partner, or simply by not getting married. But if we now restrict the analysis to people with income levels above $50,000, then there is no longer a decrease, specially among divorced individuals:

marriage-income

Conclusions

I think that these “non-rigorous” findings are a good illustration of the usefulness of exploratory data analysis as a first step. By quickly defining cross-sections of the data and controlling by multiple factors (always within the limits of what the sample size allows) we can use interactive visualization to guide our intuition and find new tentative hypothesis.

If you are interested in exploring the BRFSS and other similar datasets with Mirador, remember that the Mirador Data Competition is still open until next week, and you can win some prizes by submitting your findings!

Finally, over the past months I compiled a list of several publicly available datasets that I included in this public list. Feel free to add more links to the list.

k
Launching the Mirador Data Competition!

Today we are announcing the Mirador Data Competition, the goal of which is to make discoveries in large and complex public datasets. The good news is we have been developing a program to help you make these discoveries, it’s called Mirador.

The competition is from September 28th to October 28th, mark your calendars. During this time you can continue to upload findings to your user account from the app. Visit the competition page for complete instructions on how to get started.

The Sabeti Lab is offering cash prizes for the top three findings, which will be chosen by a jury of experts in the respective domains of each dataset.

The official Mirador Data Competition video, check it out below:

We have chosen four public datasets in the areas of health, sports, and global development:

Each one of these datasets is very rich in complex relationships between literally thousands of variables, and even though some of them have been extensively studied by specialists, there is more to be discovered. We also want to highlight the importance of open data as an enabler for transparency and public participation in research, governance, journalism, and economics, just to name a few areas. Please visit the competition website, create an account, and start exploring correlations to win cash prizes!

Last, but not least, we would like to thank the work of our summer interns at the Broad Institute, Mahan Nekoui, who implemented the user submission system, and Tom Silver, who created the intro video.

Founded in 2010 by Ben Fry, Fathom Information Design works with clients to understand complex data through interactive tools and software for mobile devices, the web, and large format installations. Out of its studio in Boston, Fathom partners with Fortune 500s and non-profit organizations across sectors, including health care, education, financial services, media, technology, and consumer products.

How can we help? hello@fathom.info.