Tail Bounds, Convergence and Limit Theorems

Tail Bounds : Inequalities

Why do we need inequalities?

Note

  • If we know the density directly, then, theoretically, we can compute exactly how much probability lie on the “tails”.

  • However, often we have to work with scenarios where either the density is not known or the exact calculation is cumbersome.

  • These inequalities bound the probability on the tails based on how much we know about the underlying distribution.

Mean known: Markov

Note

  • Let \(X\) be a non-negative rv with well defined \(\mathbb{E}[X]=\mu\).

  • Markov’s inequality states that the tail probability goes down inversely as we move further right.

    \[\mathbb{P}(X\geq t)\leq \frac{\mu}{t}\]
  • Proof:

    \[\mathbb{P}(X\geq t)=\int\limits_t^\infty f_X(x)\mathop{dx}=\frac{1}{t}\int\limits_t^\infty t\cdot f_X(x)\mathop{dx}\leq \frac{1}{t}\int\limits_t^\infty x\cdot f_X(x)\mathop{dx}\leq \frac{1}{t}\int\limits_{-\infty}^\infty x\cdot f_X(x)\mathop{dx}\]

Mean and variance known: Chebyshev

Note

  • Let \(X\) be any rv with well defined \(\mathbb{E}[X]=\mu\) and \(\mathbb{V}(X)=\sigma^2\).

  • Chebyshev’s inequality states that the tail probability of \(t\) being away from \(\mu\) goes down quadratically.

    \[\mathbb{P}(|X-\mu|\geq t)\leq \frac{\sigma^2}{t^2}\]
  • Proof:

    \[Y=(X-\mu)\implies\mathbb{E}[Y^2]=\sigma^2\implies\mathbb{P}(|Y|\geq t)=\mathbb{P}(Y^2\geq t^2)\leq\frac{\sigma^2}{t^2}\]
  • This is a tighter bound than Markov’s.

MGF known: Chernoff’s Bound

Note

  • Let \(X\) be any rv with well defined MGF, \(M_X(s)\).

  • Chernoff’s bound states that the probability of the tails goes down exponentially as we move further right. So, for any \(s\),

    \[\mathbb{P}(X\geq t)\leq \frac{M_X(s)}{e^{st}}\]
  • We can recover Markov’s and Chebyshev’s from this one.

Distribution known (Gaussian): Mill

Note

  • Let \(Z\) be a standard normal rv.

  • Mills equality calculates directly from density how much probability lies on the tails.

    \[\mathbb{P}(|Z|\geq t)\leq \frac{2e^{-t^2/2}}{t}\]

Convergence of sequences involving rvs

The concept of convergence of sequences involving rvs is more subtle than convergence of normal sequences. We consider a sequence of rvs, \(X_1,X_2,\cdots\) and \(X\) as the limiting rv. There are multiple notions of convergence for such sequences, some being stronger than others (as in, convergence in one sense might imply convergence in another sense, but not vice versa).

Convergence in distribution

Note

  • \((X_n)_{n=1}^\infty\) is said to be converging to \(X\) in distribution, \(X_n\xrightarrow[]{D}X\), if

    \[\lim\limits_{n\to\infty}\mathbb{P}(X_n\leq t)=\mathbb{P}(X\leq t)\]
  • If \(X_i\sim F_i\) and \(X\sim F\), then the above can be written in terms of CDF as

    \[\lim\limits_{n\to\infty}F_n(t)=F(t)\]

Convergence in probability

Note

  • \((X_n)_{n=1}^\infty\) is said to be converging to \(X\) in probability, \(X_n\xrightarrow[]{P}X\), if

    \[\lim\limits_{n\to\infty}\mathbb{P}(|X_n-X|\geq\epsilon)=0\]
  • It can be restated using notions similar to convergence from calculus as follows: for a given accuracy level \(\epsilon>0\) and a given confidence level \(\delta>0\),

    \[\exists N_{(\epsilon,\delta)} . n>N_{(\epsilon,\delta)}\implies\mathbb{P}(|X_n-X|\geq\epsilon)\leq\delta\]
  • Convergence in probability implies convergence in distribution.

Convergence in \(L_1\)

Note

  • \((X_n)_{n=1}^\infty\) is said to be converging to \(X\) in \(L_1\), \(X_n\xrightarrow[]{L_1}X\), if

    \[\lim\limits_{n\to\infty}\mathbb{E}[|X_n-X|]=0\]
  • Convergence in \(L_1\) implies convergence in probability.

Convergence in quadratic mean

Note

  • \((X_n)_{n=1}^\infty\) is said to be converging to \(X\) in quadratic mean, \(X_n\xrightarrow[]{qm}X\), if

    \[\lim\limits_{n\to\infty}\mathbb{E}[(X_n-X)^2]=0\]
  • Convergence in quadratic mean implies convergence in \(L_1\).

Almost surely convergence

Note

  • \((X_n)_{n=1}^\infty\) is said to be converging to \(X\) almost surely (with probability 1), \(X_n\xrightarrow[]{as}X\), if

    \[\mathbb{P}(\lim\limits_{n\to\infty} X_n=X)=1\]
  • This can be restated as follows: for any \(\epsilon>0\)

    \[\mathbb{P}(\lim\limits_{n\to\infty}|X_n-X|\geq\epsilon)=0\]
  • Interpretation:

    • We note that the limit is inside. Hence it’s talking about probability about the convergence of the values of the rvs in standard calculus sense.

    • We can think that the sample space is represented as the set of sequences \(\{(x_n)_{n=1}^\infty\}\).

    • In this case, almost surely convergence would mean that there are only finite number of elements in this set where the limit doesn’t converge to the value of the rv \(X\).

  • Almost surely convergence implies convergence in quadratic mean.

Convergence of sequences involving parametric models

Convergence of functions

Let \(\left(f_n(x)\right)_{i=1}^n\) be a sequence of functions where \(f_n:E\to\mathbb{R}\). Let \(f:E\to\mathbb{R}\) is the limit function.

Point-wise convergence

\[\forall x\in E, \lim\limits_{n\to\infty}|f_n(x)-f(x)|=0\]

Note

  • Interpretation: For every \(\epsilon>0\), there is a \(N_{\epsilon,x}\) for each specific \(x\in E\), such that \(n> N_{\epsilon,x}\implies |f_n(x)-f(x)|<\epsilon\).

  • We note that the speed of convergence is dependent on the value of \(x\).

Uniform convergence

\[\lim\limits_{n\to\infty}\sup_{x\in E}|f_n(x)-f(x)|=0\]

Note

  • Interpretation: For every \(\epsilon>0\), there is a universal \(N_\epsilon\), such that \(n> N_\epsilon\implies |f_n(x)-f(x)|<\epsilon\) holds for any \(x\in E\).

  • We note that the speed of convergence is independent on the value of \(x\). This is more stricter.

Convergence of statistical functionals

Let \(\left(f_n(\theta)\right)_{i=1}^n\) be a sequence functions evaluated on observed data \(\left(X_i\right)_{i=1}^n\), where \(X_i\sim f_X(x_i;\theta)\) in a parametric model.

See also

For example, \(f_n(\theta)=\frac{1}{n}\sum_{i=1}^n X_i\) and \(f(\theta)=\mathbb{E}_\theta[X]\).

Point-wise convergence in probability

\[\forall \theta\in\Theta, |f_n(\theta)-f(\theta)|\xrightarrow[]{P}0\]

Uniform convergence in probability

\[\sup_{\theta\in\Theta} |f_n(\theta)-f(\theta)|\xrightarrow[]{P}0\]

Limit Theorems

Here we deal with rvs of 3 special kind for a given sequence of rvs \((X_n)_{n=1}^\infty\). Let the rvs be independent and have common, well defined mean \(\mu\) and variance \(\sigma^2\).

Note

  • Let the sum rv be \(S_n=\sum_{i=1}^n X_i\) for a given \(n\). We can think of a sequence of this as \((S_n)_{n=1}^\infty\).

    • We note that \(\mathbb{E}[S_n]=n\mu\) \(\mathbb{V}(S_n)=n\sigma^2\).

  • Let the sample mean rv be \(M_n=\frac{S_n}{n}\) for a given \(n\). We can think of a sequence of this as \((M_n)_{n=1}^\infty\).

    • We note that \(\mathbb{E}[M_n]=\mu\) and \(\mathbb{V}(M_n)=\sigma^2/n\).

  • Let the standardised rv be \(Z_n=\frac{S_n-n\mu}{\sigma\sqrt{n}}\) for a given \(n\). We can think of a sequence of this as \((Z_n)_{n=1}^\infty\).

    • We note that \(\mathbb{E}[Z_n]=0\) and \(\mathbb{V}(M_n)=1\).

Weak Law of Large Number

Note

  • This talks about the convergence properties of \(M_n\).

  • Recall that \(\mathbb{E}[M_n]=\mu\) and \(\mathbb{V}(M_n)=\frac{\sigma^2}{n}\).

  • Applying Chebyshev’s inequality, we obtain \(\mathbb{P}(|M_n-\mu|\geq \epsilon)\leq \frac{\sigma^2}{n\epsilon^2}\).

  • Therefore \(\lim\limits_{n\to\infty}\mathbb{P}(|M_n-\mu|\geq \epsilon)=0\).

  • WLLN: For a sequence of rvs \((X_n)_{n=1}^\infty\), independent with common, well defined mean and variance, \(M_n\xrightarrow[]{P}\mu\).

Attention

It doesn’t require the rvs to be identically distributed.

Warning

It doesn’t talk about how quickly the sample mean converges.

Special case: bounded rvs

Tail bounds from Chebyshev’s inequality

If we know that the rvs are bounded, i.e. \(\forall i, a\leq X_i\leq b\), then we know that \(\mathbb{V}(X_i)\leq \frac{(b-a)^2}{4}\) (see note in random variable chapter TODO add link).

Note

  • From Chebyshev’s inequality, we can obtain a bound which goes down inversely with \(n\).

    \[\mathbb{P}(|M_n-\mu|\geq \epsilon)\leq \frac{\sigma^2}{n\epsilon^2}\leq \frac{(b-a)^2}{4n\epsilon^2}\]
Tigher bounds from Hoeffding’s inequality

Attention

  • For bounded rvs, Hoeffding’s inequality gives an even tigher bound which goes down exponentially with \(n\).

    \[\mathbb{P}(|M_n-\mu|\geq \epsilon)\leq 2\exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right)\]

Strong Law of Large Number

Note

  • SLLN: For a sequence of rvs \((X_n)_{n=1}^\infty\), iid with well defined moments till at least 4th moment, \(M_n\xrightarrow[]{as}\mu\).

Central Limit Theorem

Note

  • CLT: For a sequence of rvs \((X_n)_{n=1}^\infty\), iid with well defined mean and variance, \(Z_n\xrightarrow[]{D}\mathcal{N}(0,1)\).

  • Since \(S_n\) can be expressed as a linear transformation of \(Z_n\), it also converges to some normal distribution with ever increasing mean \(n\mu\) and variance \(\sigma\sqrt{n}\).

Warning

  • It doesn’t talk about how quickly the sum converges to normal.

  • The speed of this convergence depends on the actual underlying distribution.

    • Uniform: very quickly resembles a normal.

    • Exponential: takes a long time.

The Delta Method

Note

  • Let \(X_n\xrightarrow[]{D}\mathcal{N}(\mu,\frac{\sigma}{\sqrt{n}})\)

  • Let \(g\) be a differentiable function.

  • Then \(g(X_n)\xrightarrow[]{D}\mathcal{N}(g(\mu),\frac{\sigma}{\sqrt{n}}\left(g'(\mu)^2\right))\).

Tip

A multivariate version can be obtained by observing that \(\sigma\left(g'(\mu)^2\right)\) becomes \(\nabla_g(\mu)^\top\Sigma\nabla_g(\mu)\).