- About
- Work
- Notebook
- Careers

Mirador Data Competition: the winning entries

We recently organized the Mirador Data Competition, where participants were invited to explore public datasets in health, sports, and global development using the Mirador tool, submit their findings, and have a chance to win prizes. With the assistance of experts in the areas covered by the competition, we chose three winning entries, and today we have the pleasure to announce them.

The winners of the Mirador Data Competition are:

**First prize:**Maria Fernanda Gándara. Correlation between Researchers in R&D and Research and development expenditure, from the World Bank Development Indicators.**Second prize:**Yuliia Khodakivska. Correlation between Salary and Month of Birth in the 2013 Lahman's Baseball Database.**Third prize:**Ching-Hsing Wang. Correlation between General Health and Exercise in the past 30 days, from the 2012 Behavioral Risk Factor Surveillance System.

Findings were submitted as *eikosogram* plots, a representation that Mirador uses to explore many variables at once. I've gone into more detail about eikosograms in an earlier post. In order to elaborate on the winning entries I created three custom interactive versions that can be explored below. I tried to re-interpret them with a visualization that is better suited for each particular dataset.

*Interact with the charts below to explore this question.*

This visualization includes all countries in the Europe, Central Asia, and North America regions, between 2000 and 2013. However, María Fernanda also considered the percentage of secondary female teachers as a covariate in her analysis. You can explore the effect of this variable by clicking the following links to update the plot above. Show countries where the percentage of female teachers is less than 50%, more than 50%, or without constraint. According to María Fernanda "I explored for possible covariates that could strengthen the relationships, and that is how I chose the additional covariate of percentage of female teachers in secondary education. Therefore, the constrain of “percentage of female secondary teachers > 50%” was statistically based (in an exploratory fashion). I do think it can be interpreted, though. Since researchers are typically men, the more women working in secondary education, the more men "available" to become researchers."

Somewhat related, an article from a few years ago shows that the month of birth for an American League player peaks on August, however this happens only for U.S. born players, non-U.S. players don't seem to be born in August on significantly higher proportions.

In order to visualize all these patterns, I added the birth counts for U.S. and non-U.S. born players for each month of the year, alongside the median salary as a function of month of birth. All these numbers were derived from the 2013 release of the Lahman's database.

*Interact with the chart below to explore further.*

Strangely enough, the peak in the median salary occurs in July, not August as one would expect following the argument in the article. Is this a real effect, or the result of a bug in the code or data? You can download our scripts to check for yourself. Our winner told us about her initial motivation to look at this correlation. Yuliia writes, "an article called How Common Is Your Birthday? (and data source) [...] It contains infographics showing that July, August, and September seem to have more births, comparing to winter times."

Here, I thought it would be useful to take Ching-Hsing's original submission and make it more general in this visualization, allowing readers to explore different association patterns. In order to control, at least to some extent, for confounding effects that might be influencing both general health and exercise, I restricted the visualization to respondents younger than 50 years who don't report activity limitation due to health problems. It is interesting to see how the proportions of health levels change between males and females and across income groups.

*Interact with the chart below and toggle between sex and income.*

For instance, the proportion of exercising females who report excellent health is 4% higher than for exercising males in the top earning group. However, this difference reverts for lower income respondents: exercising males report higher excellent health than females. Are these patterns simply the result of random fluctuations in the sample data or due to real effects?

The eikosogram is constructed differently depending on whether the Y variable is nominal or numerical. In the former case, the vertical columns represent the conditional probability of each value of the Y variable given the X category. In the former, a boxplot is constructed for each value of X, and the entire eikosogram is formed by all these boxplots placed next to each other. The dark blue box contains values one standard deviation around the mean, while the light blue extends up to two standard deviations.

Although the interpretation of the eikosogram is different depending on the variable type, in all cases it is easy to visually identify a correlation: unrelated variables have a flat eikosogram, because either the conditional probabilities or the boxplots are independent of X. It is also important to note that the scale of the X variable is not linear: the width of each X bin is proportional to the number of samples falling within that bin.

Users in Mirador can also define arbitrary data ranges in order to control by various covariates or to stratify the sample into subpopulations of interest. Some of the submissions used covariates, while others were reported on the entire sample. The next gallery shows the three winning correlations as they were displayed in Mirador:

It could be argued that one can play with covariates in Mirador until finding a statistically significant association, but as contestant María Fernanda pointed out, this is valid practice in "data-driven" exploratory analysis: the interpretation stage comes later, at which point one can discard the correlation altogether, or conduct further analysis using more powerful tools or better datasets. The feedback received from users so far has been very positive, highlighting both Mirador's advantages (free availability, ease of use, interactive correlation analysis) and the areas where it could be improved (inclusions of other datasets, better scatter plot functionality, more advanced statistical analysis).