Entropy

by JS

Entropy has always struck me as a somewhat puzzling concept. The definition is simple enough

 H(X) = - \sum_i p_i \log p_i

but raises all kinds of puzzling questions. What is with the minus sign? Why use the logarithm? Imagine for a moment that you do not know anything about logarithms, and you were trying to understand the definition above

 H(X) = - \sum_i p_i f(p_i)

where  f \equiv \log . Of course, we still have this mysterious minus sign, so lets fold that into our mystery function  g \equiv - f . Now

 H(X) = \sum_i p_i g(p_i)

looks like a weighted average of some mystery function  g .

Let’s go one step further and rewrite this definition in terms of outcomes where P(X = w_i) = p_i and h(w_i) \equiv g(P(X=w_i)) is a mystery function of the outcomes w_i of X. Then we have

 H(X) = \sum_i P(X = w_i) h(w_i).

The weights, quite naturally, are the probabilities of various outcomes. Compare this, for instance, with the expected value

E(X) = \sum_i P(X = w_i) w_i .

Now we need to introduce some intuition for entropy. We want to capture a certain common sense notion of “disorder.” In the language of probability theory, a random variable is maximally disordered if all outcomes are equally likely.

But if what if all outcomes are not equally likely? Some outcomes may be surprising (because they have low probability). Some outcomes may be unsurprising (because they have high probability). This notion of “surprise” is precisely what the mystery function is trying to measure. And entropy is the weighted average of surprise.

So, here are some properties we’d like our surprise function to have:

  • h(w_i) should be large when p_i (the probability of w_i) is small.
  • h(w_i) should be small when p_i (the probability of w_i) is large.

We can probably think of a number of functions  h that have these two properties. It turns out that we need to satisfy another property as well.

  • Suppose w_i is an event that is composed of two mutual independent events a_i and b_i. Then we’d like h(w_i) = h(a_i) + h(b_i).

Intuitively, an event should be as surprising as the sum of the surprise of any sub events. Consider the following function of an event w_i

h(w_i) = \frac{1}{P(X = w_i)}.

For large  P(X = w_i) , h(w_i) is small. Similarly, for small  P(X = w_i) , h(w_i) is large. Unfortunately, this function does not satisfy property 3 since for mutually independent sub-events, we have

h(w_i) =  \frac{1}{P(X = a_i)P(X=b_i)} \ne   \frac{1}{P(X = a_i)} + \frac{1}{P(X=b_i)}.

By using the logarithm we can define a function h that satisfies all the properties we want, including the summation property:

h(w_i) =  \log(\frac{1}{P(X = a_i)P(X=b_i)}) =   \log(\frac{1}{P(X = a_i)}) + \log(\frac{1}{P(X=b_i)}).

Now observe that

h(w_i) = \log(\frac{1}{P(X = w_i)}) = - \log(P(X = w_i)) = -\log(p_i)

leading to the definition of entropy that we started with!