In my last blogpost, I showed some visualizations generated by usage data from our tool Mirador. These visualizations rely on the calculation of a “distance” between variables in a dataset, and Information Theory allows us to define such distance, as we will see below.

The notion of distance is essential to most visual representations of data, and we are intuitively– possibly innately– familiar with it. If we are in two or three dimensional space, we can use the Euclidean distance between two points `p_{1}=(x_{1}, y_{1})` and `p2=(x_{2}, y_{2})`, defined as `d(p_{1}, p_{2}) = sqrt((x_{1} – x_{2})^2 + (y_{1} – y_{2})^2)`, to determine the distance between any pair of points in the space.

But how do we define “distance” between more abstract entities, such as random variables? Mathematically, a distance function in an arbitrary set is a function that gives a real number for any pair of objects from the set, and satisfies the following “metric” properties:

- `d(p, p) = 0` for any `p`. The distance of any element with itself is always zero.
- `d(p, q) = 0` if and only if `p = q`. The distance between two objects can only be zero when the two objets are identical, and vice versa.
- `d(p, q) \leq d(p, w) + d(w, q)`. This last property is called the
*triangle inequality*, and geometrically means that the distance traversed between two objects `p` and `q` is always less than traversing through an intermediate object `w`:

Any function that satisfies these three properties is called a distance. The Euclidean distance discussed before is one such function, but there are other distance functions in 2-D or 3-D space that are not Euclidean, for example the Manhattan and Chebyshev distances.

Thus, if we are in the 2-D or 3-D spaces there are several distance functions we can use to quantify how far apart are pairs of elements from each other. However, if we are working with sets of elements that are not 2-D or 3-D vectors, it can be harder to get a sense of “distance” between “points” in the space. I found it very interesting that we can actually define a proper distance function between arbitrary random variables. In a previous post, I did an informal introduction of the Shannon Entropy `H(X)`, a mathematical measure of the amount of “surprise” received upon measuring a random variable `X`. This definition led us to the concept of mutual information `I(X, Y)`, which quantifies the level of statistical dependency between two variables `X` and `Y`.

We concluded that `I(X, Y) = H(X) + H(Y) – H(X, Y)`, which we can visualize as the area shared between the marginal entropies `H(X)` and `H(Y)`, as depicted in this diagram.

The mutual information varies between `0`, when the two variables are independent, and `H(X, Y)`, when they are statistically identical. So what about the remainder of subtracting `I(X, Y)` from the joint entropy `H(X, Y)`? It is `0` when the variables are identical, and takes the maximum value `H(X, Y)` when they are totally unrelated. Could it be then that the following quantity:

`D(X, Y) = H(X, Y) – I(X, Y)`

is our distance function? We can use a simple Venn diagram to represent this function graphically:

The smaller the intersection is (the less correlated the variables are) then the larger the area of the disjoint pieces will be, and so the distance `D(X, Y)`. When the variables are entirely uncorrelated, then the intersection is empty and the distance reaches its maximum value `H(X, Y)`.

In order to find out, we need to prove that this function does indeed satisfy the three metric properties. From the Venn diagram itself we can quickly verify the first two: when the two circles are completely overlapping, then the difference between area of the intersection and the area of the union is exactly `0`, which means that `D(X, X) = 0`. We already discussed that if two variables are statistically identical then the mutual information is equal to the joint entropy and so `D(X, Y) = 0`. For the converse, we just need to note that if the area of the intersection is the same as that of the union, then the only possibility is that the two ellipses are coincident, hence `X = Y`.

The final part is to check the triangle inequality, meaning that we need to verify that:

`D(X, Y) \leq D(X, Z) + D(Z, Y)`

This looks like the most challenging step! However, we can put together a simple graphical proof inspired by the previous pictorial representation of our candidate distance function. Since this “informational distance” is precisely the portion of the joint entropy that is not shared between the two variables, we could represent the situation with three variables also via a Venn diagram as follows:

By hovering over the elements of the inequality, we can see that the sum `D(X, Z) + D(Z, Y)` is greater or equal than `D(X, Y)` since it covers the entire area of the union of the three circles, with the exception of the intersection between all of them.

This visual demonstration relies on identifying the circles with the Shannon entropies of each variable, and the intersecting areas with the corresponding mutual informations. Do you think this identification is valid? Send me an email if you have some thoughts about these assumptions, or any other comments!

# Additional reading

Check out this essay by Jim Bumgardner on Information Theory and Art, published on the Issue #3 of the Mungbeing online magazine.

And another good example of combining online text with interactive illustrations of statistical concepts, in this case on the topic of P-hacking.

# Implementation details

I used Processing and Miralib to generate the videos and images in the previous post, p5.js for the interactive snippets embedded in the blogpost, and MathJax for the mathematical formulas.