Jenssen's Inequality
This post is all about Jensen’s inequality, an elementary but powerful result about convex functions. It comes up all over the place in mathematics, probably because the concept of convexity is itself so fundamental.
Convex functions
Before even stating Jensen’s inequality, we need to go over the basics of the theory of convex functions. A function is said to be convex is the set of points in the plane which lie above its graph form a convex set. More formally:
The left-hand side of the inequality represents the values of $f$ at points between $a$ and $b$, while the right-hand side represents values of the line joining the points $(a, f(a))$ and $(b, f(b))$; thus the inequality says that the graph lies entirely below any line joining two of its points.
Jensen’s inequality
We are now ready to formulate and prove Jensen’s inequality. It is an assertion about how convex functions interact with expected values of random variables, and we will formulate it on an abstract measure space $(\Omega, \Sigma, P)$ where $\Omega$ is a set, $\Sigma$ is a $\sigma$-algebra of subsets of $\Sigma$, and $P$ is a probability measure on $(\Omega, \Sigma)$. Readers uncomfortable with this formalism are welcome to restrict to the case where $\Omega = \br{\omega_1, \ldots, \omega}$ is a finite set for which we assign probabilities $P(\omega_i) = p_i$ where the $p_i$’s are nonnegative and sum to $1$. This is enough for the applications in this post!
Given a random variable $X \colon \Omega \to R$, recall that the expectation of $X$ is by definition:
\[E(X) = \int_\Omega X\, dP\]In the case where $\Omega$ is finite, this just reduces to $E(X) = \sum_{i=1}^n p_i X(\omega_i)$. We will only really use three properties of the expectation operator:
- $E$ is monotone: if $X \leq Y$ almost everywhere with respect to $P$ then $E(X) \leq E(Y)$
- $E$ is linear: if $X$ and $Y$ are random variables and $a, b \in R$ then $E(a X + b Y) = a E(X) + b E(Y)$
- $E$ is positive definite: if $X \geq 0$ and $E(X) = 0$ then $X = 0$ almost everywhere with respect to $P$.
Finally, given a random variable $X \colon \Omega \to R$ and a function $f \colon R \to R$, recall that $f(X)$ is by definition the random variable $f \circ X \colon \Omega \to R$. This still makes sense even if $f$ is defined only on the range of $X$.
The proof of this result was relatively straightforward, but it depended crucially on all of the setup work that we did involving convex functions and supporting lines. We’ll see this pay off in the next section.
Applications
Both applications of Jensen’s inequality in this argument were a little cheap; the first one is really just the triangle inequality, and the second one is the Cauchy-Schwarz inequality. But expressing the argument in this way buys a lot of generality for free; for instance if $x_1, \ldots, x_n$ are points in $\R^d$ instead of $\R$ then one can define the spatial median to be any point which minimizes the absolute deviation function $t \mapsto \E(\abs{X - t})$, and the argument above gives an inequality between the mean, spatial median, and standard deviation almost verbatim. The argument also has generalizations to weighted and continuous probability distributions, all of which come for free without having to fuss about what the standard inequalities should look like.