Jenssen's Inequality

This post is all about Jensen’s inequality, an elementary but powerful result about convex functions. It comes up all over the place in mathematics, probably because the concept of convexity is itself so fundamental.

Convex functions

Before even stating Jensen’s inequality, we need to go over the basics of the theory of convex functions. A function is said to be convex is the set of points in the plane which lie above its graph form a convex set. More formally:

Let $f \colon I \to \R$ be a function defined on an interval $I \subseteq R$. $f$ is _convex_ if $f(t a + (1-t) b) \leq t f(a) + (1-t) f(b)$ for all $a, b \in I$ and all $t \in [0,1]$.

The left-hand side of the inequality represents the values of $f$ at points between $a$ and $b$, while the right-hand side represents values of the line joining the points $(a, f(a))$ and $(b, f(b))$; thus the inequality says that the graph lies entirely below any line joining two of its points.

Jensen’s inequality

We are now ready to formulate and prove Jensen’s inequality. It is an assertion about how convex functions interact with expected values of random variables, and we will formulate it on an abstract measure space $(\Omega, \Sigma, P)$ where $\Omega$ is a set, $\Sigma$ is a $\sigma$-algebra of subsets of $\Sigma$, and $P$ is a probability measure on $(\Omega, \Sigma)$. Readers uncomfortable with this formalism are welcome to restrict to the case where $\Omega = \br{\omega_1, \ldots, \omega}$ is a finite set for which we assign probabilities $P(\omega_i) = p_i$ where the $p_i$’s are nonnegative and sum to $1$. This is enough for the applications in this post!

Given a random variable $X \colon \Omega \to R$, recall that the expectation of $X$ is by definition:

\[E(X) = \int_\Omega X\, dP\]

In the case where $\Omega$ is finite, this just reduces to $E(X) = \sum_{i=1}^n p_i X(\omega_i)$. We will only really use three properties of the expectation operator:

$E$ is monotone: if $X \leq Y$ almost everywhere with respect to $P$ then $E(X) \leq E(Y)$
$E$ is linear: if $X$ and $Y$ are random variables and $a, b \in R$ then $E(a X + b Y) = a E(X) + b E(Y)$
$E$ is positive definite: if $X \geq 0$ and $E(X) = 0$ then $X = 0$ almost everywhere with respect to $P$.

Finally, given a random variable $X \colon \Omega \to R$ and a function $f \colon R \to R$, recall that $f(X)$ is by definition the random variable $f \circ X \colon \Omega \to R$. This still makes sense even if $f$ is defined only on the range of $X$.

Let $X$ be a random variable defined on a probability space $(\Omega, \Sigma, P)$ and let $f \colon I \to R$ be a convex function defined on an interval $I \subseteq R$ containing the range of $X$. Then: $$f(E(X)) \leq E(f(X))$$ with equality if and only if either $X$ is constant or $f$ coincides with a line on an interval which contains the range of $X$ (up to sets of measure zero).

Since $f$ is convex it admits a supporting line $\ell$ at the point $x_0 = E(X)$. This means $\ell(E(X)) = f(E(X))$ and $\ell(x) \leq f(x)$ for all $x \in I$; in particular, $\ell(X) \leq f(X)$. By monotonicity and linearity of expectation, we get: $$E(f(X)) \geq E(\ell(X)) = \ell(E(X)) = f(E(X))$$ To handle the equality case, observe that: $$E(f(X)) - E(\ell(X)) = E((f-\ell)(X)) = \int_\Omega (f - \ell) \circ X\, dP$$ But the integrand is nonnegative since $\ell(x) \leq f(x)$, so the equality case is equivalent to $(f - \ell) \circ X = 0$ almost everywhere, or in other words $f = \ell$ almost everywhere on the range of $X$. Let $R$ denote the set of all points in the range of $X$ at which $f = \ell$. If $R$ contains only one point then $X$ is constant almost everywhere; otherwise let $a < b$ be two distinct points in $R$. Any point between $a$ and $b$ can be expressed as $x_t = t a + (1-t)b$ for some $t \in [0,1]$, and we have: $$\ell(x_t) \leq f(x_t) \leq t f(a) + (1-t) f(b) = t \ell(a) + (1-t) \ell(b) = \ell(t a + (1-t) b) = \ell(x_t)$$ Plainly, this forces $f(x_t) = \ell(x_t)$. It follows that $f = \ell$ on the interval $[a,b]$, and by taking the union over all such pairs we see that $f = \ell$ on the interval $(\inf R, \sup R)$ which contains the range of $X$ up to a set of measure zero.

The proof of this result was relatively straightforward, but it depended crucially on all of the setup work that we did involving convex functions and supporting lines. We’ll see this pay off in the next section.

Applications

Let $\mu$ denote the arithmetic mean of the numbers $x_1, \ldots, x_n$, let $m$ denote their median, and let $\sigma = \sqrt{E((X - \mu)^2)}$ denote their standard deviation. Then: $$\abs{\mu - m} \leq \sigma$$ with equality if and only if the $x_i$'s are all equal.

Let $X$ denote the random variable whose values are $x_1, \ldots, x_n$ as above. Clearly $\mu - m = \E(X) - m = \E(X - m)$. The function $x \mapsto \abs{X}$ is convex ($\ell(x) = x$ is a supporting line at all $x_0 \geq 0$, and $\ell(x) = -x$ is a supporting line at all $x_0 \leq 0$), so by Jensen's inequality we have: $$\abs{\mu - m} = \abs{\E(X - m)} \leq \E(\abs{X - m})$$ According to the previous proposition $\E(\abs{X - m})$ is the minimum value of the function $t \mapsto \E(\abs{X - t})$, so plugging in $t = \mu$ we get: $$\E(\abs{X - m}) \leq \E(\abs{X - \mu})$$ The function $x \mapsto x^2$ is convex (it's second drivative is $2$ at all points), so by Jensen's inequality again we have: $$\E(\abs{X - \mu})^2 \leq \E((X - \mu)^2)$$ Taking square roots and putting it all together, we get: $$\abs{\mu - m} \leq \E(\abs{X - \mu}) \leq \sqrt{\E((X - \mu)^2)} = \sigma$$ Note that the equality case for the second application of Jensen's inequality requires that $\abs{X - \mu}$ is constant since $x^2$ is not linear on any interval, and this random variable can only be constant if $X$ is constant.

Both applications of Jensen’s inequality in this argument were a little cheap; the first one is really just the triangle inequality, and the second one is the Cauchy-Schwarz inequality. But expressing the argument in this way buys a lot of generality for free; for instance if $x_1, \ldots, x_n$ are points in $\R^d$ instead of $\R$ then one can define the spatial median to be any point which minimizes the absolute deviation function $t \mapsto \E(\abs{X - t})$, and the argument above gives an inequality between the mean, spatial median, and standard deviation almost verbatim. The argument also has generalizations to weighted and continuous probability distributions, all of which come for free without having to fuss about what the standard inequalities should look like.