Tail Bounds, Convergence and Limit Theorems¶

Tail Bounds : Inequalities¶

Why do we need inequalities?

Note

If we know the density directly, then, theoretically, we can compute exactly how much probability lie on the “tails”.
However, often we have to work with scenarios where either the density is not known or the exact calculation is cumbersome.
These inequalities bound the probability on the tails based on how much we know about the underlying distribution.

Mean known: Markov¶

Note

Let \(X\) be a non-negative rv with well defined \(\mathbb{E}[X]=\mu\).
Markov’s inequality states that the tail probability goes down inversely as we move further right.

\[\mathbb{P}(X\geq t)\leq \frac{\mu}{t}\]
Proof:

\[\mathbb{P}(X\geq t)=\int\limits_t^\infty f_X(x)\mathop{dx}=\frac{1}{t}\int\limits_t^\infty t\cdot f_X(x)\mathop{dx}\leq \frac{1}{t}\int\limits_t^\infty x\cdot f_X(x)\mathop{dx}\leq \frac{1}{t}\int\limits_{-\infty}^\infty x\cdot f_X(x)\mathop{dx}\]

Mean and variance known: Chebyshev¶

Note

Let \(X\) be any rv with well defined \(\mathbb{E}[X]=\mu\) and \(\mathbb{V}(X)=\sigma^2\).
Chebyshev’s inequality states that the tail probability of \(t\) being away from \(\mu\) goes down quadratically.

\[\mathbb{P}(|X-\mu|\geq t)\leq \frac{\sigma^2}{t^2}\]
Proof:

\[Y=(X-\mu)\implies\mathbb{E}[Y^2]=\sigma^2\implies\mathbb{P}(|Y|\geq t)=\mathbb{P}(Y^2\geq t^2)\leq\frac{\sigma^2}{t^2}\]
This is a tighter bound than Markov’s.

MGF known: Chernoff’s Bound¶

Note

Let \(X\) be any rv with well defined MGF, \(M_X(s)\).
Chernoff’s bound states that the probability of the tails goes down exponentially as we move further right. So, for any \(s\),

\[\mathbb{P}(X\geq t)\leq \frac{M_X(s)}{e^{st}}\]
We can recover Markov’s and Chebyshev’s from this one.

Distribution known (Gaussian): Mill¶

Note

Let \(Z\) be a standard normal rv.
Mills equality calculates directly from density how much probability lies on the tails.

\[\mathbb{P}(|Z|\geq t)\leq \frac{2e^{-t^2/2}}{t}\]

Convergence of sequences involving rvs¶

The concept of convergence of sequences involving rvs is more subtle than convergence of normal sequences. We consider a sequence of rvs, \(X_1,X_2,\cdots\) and \(X\) as the limiting rv. There are multiple notions of convergence for such sequences, some being stronger than others (as in, convergence in one sense might imply convergence in another sense, but not vice versa).

Convergence in distribution¶

Note

\((X_n)_{n=1}^\infty\) is said to be converging to \(X\) in distribution, \(X_n\xrightarrow[]{D}X\), if

\[\lim\limits_{n\to\infty}\mathbb{P}(X_n\leq t)=\mathbb{P}(X\leq t)\]
If \(X_i\sim F_i\) and \(X\sim F\), then the above can be written in terms of CDF as

\[\lim\limits_{n\to\infty}F_n(t)=F(t)\]

Convergence in probability¶

Note

\((X_n)_{n=1}^\infty\) is said to be converging to \(X\) in probability, \(X_n\xrightarrow[]{P}X\), if

\[\lim\limits_{n\to\infty}\mathbb{P}(|X_n-X|\geq\epsilon)=0\]
It can be restated using notions similar to convergence from calculus as follows: for a given accuracy level \(\epsilon>0\) and a given confidence level \(\delta>0\),

\[\exists N_{(\epsilon,\delta)} . n>N_{(\epsilon,\delta)}\implies\mathbb{P}(|X_n-X|\geq\epsilon)\leq\delta\]
Convergence in probability implies convergence in distribution.

Convergence in \(L_1\)¶

Note

\((X_n)_{n=1}^\infty\) is said to be converging to \(X\) in \(L_1\), \(X_n\xrightarrow[]{L_1}X\), if

\[\lim\limits_{n\to\infty}\mathbb{E}[|X_n-X|]=0\]
Convergence in \(L_1\) implies convergence in probability.

Convergence in quadratic mean¶

Note

\((X_n)_{n=1}^\infty\) is said to be converging to \(X\) in quadratic mean, \(X_n\xrightarrow[]{qm}X\), if

\[\lim\limits_{n\to\infty}\mathbb{E}[(X_n-X)^2]=0\]
Convergence in quadratic mean implies convergence in \(L_1\).

Almost surely convergence¶

Note

\((X_n)_{n=1}^\infty\) is said to be converging to \(X\) almost surely (with probability 1), \(X_n\xrightarrow[]{as}X\), if

\[\mathbb{P}(\lim\limits_{n\to\infty} X_n=X)=1\]
This can be restated as follows: for any \(\epsilon>0\)

\[\mathbb{P}(\lim\limits_{n\to\infty}|X_n-X|\geq\epsilon)=0\]
Interpretation:
- We note that the limit is inside. Hence it’s talking about probability about the convergence of the values of the rvs in standard calculus sense.
- We can think that the sample space is represented as the set of sequences \(\{(x_n)_{n=1}^\infty\}\).
- In this case, almost surely convergence would mean that there are only finite number of elements in this set where the limit doesn’t converge to the value of the rv \(X\).
Almost surely convergence implies convergence in quadratic mean.

Convergence of sequences involving parametric models¶

Convergence of functions¶

Let \(\left(f_n(x)\right)_{i=1}^n\) be a sequence of functions where \(f_n:E\to\mathbb{R}\). Let \(f:E\to\mathbb{R}\) is the limit function.

Point-wise convergence¶

\[\forall x\in E, \lim\limits_{n\to\infty}|f_n(x)-f(x)|=0\]

Note

Interpretation: For every \(\epsilon>0\), there is a \(N_{\epsilon,x}\) for each specific \(x\in E\), such that \(n> N_{\epsilon,x}\implies |f_n(x)-f(x)|<\epsilon\).
We note that the speed of convergence is dependent on the value of \(x\).

Uniform convergence¶

\[\lim\limits_{n\to\infty}\sup_{x\in E}|f_n(x)-f(x)|=0\]

Note

Interpretation: For every \(\epsilon>0\), there is a universal \(N_\epsilon\), such that \(n> N_\epsilon\implies |f_n(x)-f(x)|<\epsilon\) holds for any \(x\in E\).
We note that the speed of convergence is independent on the value of \(x\). This is more stricter.

Convergence of statistical functionals¶

Let \(\left(f_n(\theta)\right)_{i=1}^n\) be a sequence functions evaluated on observed data \(\left(X_i\right)_{i=1}^n\), where \(X_i\sim f_X(x_i;\theta)\) in a parametric model.

Point-wise convergence in probability¶

\[\forall \theta\in\Theta, |f_n(\theta)-f(\theta)|\xrightarrow[]{P}0\]

Uniform convergence in probability¶

\[\sup_{\theta\in\Theta} |f_n(\theta)-f(\theta)|\xrightarrow[]{P}0\]

Limit Theorems¶

Here we deal with rvs of 3 special kind for a given sequence of rvs \((X_n)_{n=1}^\infty\). Let the rvs be independent and have common, well defined mean \(\mu\) and variance \(\sigma^2\).

Note

Let the sum rv be \(S_n=\sum_{i=1}^n X_i\) for a given \(n\). We can think of a sequence of this as \((S_n)_{n=1}^\infty\).
- We note that \(\mathbb{E}[S_n]=n\mu\) \(\mathbb{V}(S_n)=n\sigma^2\).
Let the sample mean rv be \(M_n=\frac{S_n}{n}\) for a given \(n\). We can think of a sequence of this as \((M_n)_{n=1}^\infty\).
- We note that \(\mathbb{E}[M_n]=\mu\) and \(\mathbb{V}(M_n)=\sigma^2/n\).
Let the standardised rv be \(Z_n=\frac{S_n-n\mu}{\sigma\sqrt{n}}\) for a given \(n\). We can think of a sequence of this as \((Z_n)_{n=1}^\infty\).
- We note that \(\mathbb{E}[Z_n]=0\) and \(\mathbb{V}(M_n)=1\).

Weak Law of Large Number¶

Note

This talks about the convergence properties of \(M_n\).
Recall that \(\mathbb{E}[M_n]=\mu\) and \(\mathbb{V}(M_n)=\frac{\sigma^2}{n}\).
Applying Chebyshev’s inequality, we obtain \(\mathbb{P}(|M_n-\mu|\geq \epsilon)\leq \frac{\sigma^2}{n\epsilon^2}\).
Therefore \(\lim\limits_{n\to\infty}\mathbb{P}(|M_n-\mu|\geq \epsilon)=0\).
WLLN: For a sequence of rvs \((X_n)_{n=1}^\infty\), independent with common, well defined mean and variance, \(M_n\xrightarrow[]{P}\mu\).

Attention

It doesn’t require the rvs to be identically distributed.

Warning

It doesn’t talk about how quickly the sample mean converges.

Special case: bounded rvs¶

Tail bounds from Chebyshev’s inequality¶

If we know that the rvs are bounded, i.e. \(\forall i, a\leq X_i\leq b\), then we know that \(\mathbb{V}(X_i)\leq \frac{(b-a)^2}{4}\) (see note in random variable chapter TODO add link).

Note

From Chebyshev’s inequality, we can obtain a bound which goes down inversely with \(n\).

\[\mathbb{P}(|M_n-\mu|\geq \epsilon)\leq \frac{\sigma^2}{n\epsilon^2}\leq \frac{(b-a)^2}{4n\epsilon^2}\]

Tigher bounds from Hoeffding’s inequality¶

Attention

For bounded rvs, Hoeffding’s inequality gives an even tigher bound which goes down exponentially with \(n\).

\[\mathbb{P}(|M_n-\mu|\geq \epsilon)\leq 2\exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right)\]

Note

Concentration Inequality

Strong Law of Large Number¶

Note

SLLN: For a sequence of rvs \((X_n)_{n=1}^\infty\), iid with well defined moments till at least 4th moment, \(M_n\xrightarrow[]{as}\mu\).

Central Limit Theorem¶

Note

CLT: For a sequence of rvs \((X_n)_{n=1}^\infty\), iid with well defined mean and variance, \(Z_n\xrightarrow[]{D}\mathcal{N}(0,1)\).
Since \(S_n\) can be expressed as a linear transformation of \(Z_n\), it also converges to some normal distribution with ever increasing mean \(n\mu\) and variance \(\sigma\sqrt{n}\).

Warning

It doesn’t talk about how quickly the sum converges to normal.
The speed of this convergence depends on the actual underlying distribution.
- Uniform: very quickly resembles a normal.
- Exponential: takes a long time.

The Delta Method¶

Note

Let \(X_n\xrightarrow[]{D}\mathcal{N}(\mu,\frac{\sigma}{\sqrt{n}})\)
Let \(g\) be a differentiable function.
Then \(g(X_n)\xrightarrow[]{D}\mathcal{N}(g(\mu),\frac{\sigma}{\sqrt{n}}\left(g'(\mu)^2\right))\).

Tip

A multivariate version can be obtained by observing that \(\sigma\left(g'(\mu)^2\right)\) becomes \(\nabla_g(\mu)^\top\Sigma\nabla_g(\mu)\).