Statistical Inference¶

Note

Statistical Functionals: The functions of this form, \(T(F)\), such as

density \(T(F)=f_F=\frac{\mathop{d}}{\mathop{dx}}=\mathop{dF}\)
expectation \(T(F)=\mathbb{E}_F[X]=\int x \mathop{dF}\)
variance \(T(F)=\mathbb{V}_F(X)=\int(x-\mathbb{E}[X])^2\mathop{dF}\)
moments: \(T(F)=M_F(X)=\int e^{sx}\mathop{dF}\)
median: \(T(F)=F^{-1}(1/2)\)

Attention

We have a sample of size \(n\), \(X_1,\cdots X_n\) from an unknown CDF \(F\).
The task for statistical inference is to infer \(F\) or some \(T(F)\), that best explains the data, for some criteria of best chosen beforehand.
The inferred values based on data are called estimates of the quantities of interest.
Estimates are rv as their values may change subject to a different sample.
The rv associated with these estimates is called an estimator.

Attention

Statistic: Any function of the data \(g(X_1,\cdots,X_n)\) is called a statistic.
Any estimator is a statistic.

Statistical Model¶

Note

A statistical model \(\mathcal{F}\) is set of distributions or other statistical functionals of interest.

Types of Statistical Model¶

The following categories of models are based on the dimensionality of \(\mathcal{F}\).

Parametric Model¶

Note

\(\mathcal{F}\) can be spanned by a finitely many parameters.

Non-parametric Model¶

Note

\(\mathcal{F}\) cannot be spanned by a finitely many parameters.

Different Approaches to Inference¶

Bayesian Inference¶

Note

The quantity that we want to estimate is assumed to be a rv on its own, \(\Theta\).
Before observing any data, we have a prior notion of what its distribution. This is expressed as the prior probability \(f_\Theta(\theta)\).
The likelihood is the PDF of the data conditioned on \(\Theta\), \(f_{X|\Theta}(x|\theta)\).
The posterior is obtained by applying Bayes rule which gives a single probability model for the quantity after observation \(f_{\Theta|X}(\theta|x)\).
We perform inference about \(\Theta\) based on this distribution directly.

Frequentist (Classical) Inference¶

Note

The quantity that we want to estimate is assumed to be an unknown constant, \(\theta\).
No prior knowledge is assumed about this.
We assume that our underlying probability model is dependent on \(\theta\) in some way.
Therefore, the probability model here is the collection of PDFs \(f_\theta(x;\theta)\) for each possible values of \(\theta\).
In order to perform inference, our statements must apply to all possible values of \(\theta\).

Types of Inference¶

Point Estimation¶

Note

A single best estimate (point) within the model for
- Classical: the unknown constant \(\theta\)
- Bayesian: the rv \(\Theta=\theta\)
This estimate of \(\theta\) is expressed as a statistic \(\widehat{\theta}_n=g(x_1,\cdots,x_n)\)
The estimator \(\widehat{\Theta}_n\) is always a rv as it evaluates to a different value with a different sample.
Examples:
1. a single distribution/density function (parameterised/non-parameterised)
2. a single regression function
3. a single value for expectation/variance/other moments
4. a single prediction for a dependent variable with a given independent variable. etc.

Some useful terminology¶

Note

Sampling Distribution: The distribution of \(\widehat{\Theta}_n\) over different samples.
Estimation Error:
- Classical: \(\tilde{\Theta}_n=\widehat{\Theta}_n-\theta\)
- Bayesian: \(\tilde{\Theta}_n=\widehat{\Theta}_n-\Theta\)
Bias:
- Classical: \(\text{b}(\widehat{\Theta}_n)=\mathbb{E}_{\theta}[\tilde{\Theta}_n]=\mathbb{E}_{\theta}[\widehat{\Theta}_n]-\theta\)
- Bayesian: \(\text{b}(\widehat{\Theta}_n)=\mathbb{E}[\tilde{\Theta}_n]=\mathbb{E}[\widehat{\Theta}_n]-\mathbb{E}[\Theta]\)
Standard Error:
- Classical: \(\text{se}(\widehat{\Theta}_n)=\sqrt{\mathbb{V}_{\theta}(\widehat{\Theta}_n)}\)
- Bayesian: \(\text{se}(\widehat{\Theta}_n)=\sqrt{\mathbb{V}(\widehat{\Theta}_n)}\)
If the variance in above is also an estimate (as it often is), then we estimate SE as \(\widehat{\text{se}}=\widehat{\text{se}}(\widehat{\Theta}_n)=\sqrt{\widehat{\mathbb{V}}_{\theta}(\widehat{\Theta}_n)}\).
Mean-Squared Error:
- Classical: \(\text{mse}(\widehat{\Theta}_n)=\mathbb{E}_{\theta}[\tilde{\Theta}_n^2]=\mathbb{E}_{\theta}[(\widehat{\Theta}_n-\theta)^2]=\text{b}^2(\widehat{\Theta}_n)+\text{se}^2(\widehat{\Theta}_n)\)
- Bayesian: \(\text{mse}(\widehat{\Theta}_n)=\mathbb{E}[\tilde{\Theta}_n^2]=\mathbb{E}[(\widehat{\Theta}_n-\Theta)^2]=\mathbb{E}[\widehat{\Theta}_n^2]+\mathbb{E}[\Theta^2]-2\mathbb{E}[\widehat{\Theta}_n\Theta]\)

Note

Unbiased Estimator: If \(\text{b}(\widehat{\Theta}_n)=0\).
Asymptotically Unbiased Estimator: If \(\widehat{\Theta}_n\xrightarrow[]{L_1}\theta\) (or \(\Theta\)).
Consistent Estimator: If \(\widehat{\Theta}_n\xrightarrow[]{P}\theta\) (or \(\Theta\)).
Asymptotically Normal Estimator:
- Classical: \(\frac{\widehat{\Theta}_n-\theta}{\widehat{\text{se}}}\xrightarrow[]{D}\mathcal{N}(0,1)\).
- Bayesian: \(\frac{\widehat{\Theta}_n-\Theta}{\widehat{\text{se}}}\xrightarrow[]{D}\mathcal{N}(0,1)\).

Attention

Theorem: If \(\lim\limits_{n\to\infty}\text{b}_\theta(\widehat{\Theta}_n)=0\) and \(\lim\limits_{n\to\infty}\text{se}(\widehat{\Theta}_n)=0\) then \(\widehat{\Theta}_n\) is consistent.

Confidence Set Estimation¶

Attention

In Bayesian setting, the point estimate is already associated with a probability distribution which convey the degree of belief about the true quantity being the same as the estimated quantity.
On the other hand, confidence set estimation is a technique used in a classical setting. However, this makes probabilitic statement about the estimated set, not the quantity itself.

Note

An estimated set which traps the fixed, unknown value of our quality of interest with a pre-determined probability.
A 95% confidence set means that if we repeatedly estimate it from multiple samples (works even if samples are from completely unrelated experiments), then around 95% of the times the estimated set contains the true quantity.

Attention

A \(1-\alpha\) confidence interval (CI) for a real qualtity of interest \(\theta\) is defined as \(\widehat{C}_n=(a,b)\) where \(\mathbb{P}(\theta\in\widehat{C}_n)\ge 1-\alpha\).
The task is to estimate \(\widehat{a}=a(X_1,\cdots,X_n)\) and \(\widehat{b}=b(X_1,\cdots,X_n)\) such that the above holds.
For vector quantities, this is expressed with sets instead of intervals.
In regression setting, a confidence interval around the regression function can be thought of the set of functions which contains the true function with certain probabilty. However, this is usually never measured.

Some useful terminology¶

Note

Pointwise Asymptotic CI: \(\forall\theta,\liminf\limits_{n\to\infty}\mathbb{P}_{\theta}(\theta\in\widehat{C}_n)\ge 1-\alpha\)
- Given any \(\theta\), the smallest probability that \(\widehat{C}_n\) captures \(\theta\) is at least \(1-\alpha\) asymptotically as \(n\to\infty\).
- The rate of this convergence depends on the value of \(\theta\).
Uniform Asymptotic CI: \(\liminf\limits_{n\to\infty}\inf\limits_{\theta\in\Theta}\mathbb{P}_{\theta}(\theta\in\widehat{C}_n)\ge 1-\alpha\)
- Given any \(n\), we consider the smallest probability that \(\widehat{C}_n\) captures \(\theta\) for any \(\theta\in\Theta\).
- This probability is at least \(1-\alpha\) asymptotically as \(n\to\infty\).
- Uniform Asymptotic CI is stricter, as in, satisfying this condition automatically implies the former.
Normal-based CI: If \(\widehat{\Theta}_n\) is an aysmptotically normal estimator of \(\theta\), then a \(1-\alpha\) confidence interval is given by

\[(\widehat{\Theta}_n-z_{\alpha/2}\widehat{\text{se}},\widehat{\Theta}_n+z_{\alpha/2}\widehat{\text{se}})\]
- The above is a pointwise asymptotic CI.

Hypothesis Testing¶

Note

We have 2 or more unknown hypothesis about the probability model, \(H_0\) (null) and \(H_1\) (alternate), which are exclusively T/F.
- We might have 1 hypothesis which we can convert into 2 as \(H_1=\not H_0\).
We assume that this unknown hypothesis determines the distribution of the data.
- Bayesian: * Here we assume that the hypothesis themselves are Bernoulli rv, \(H_0=T\implies\Theta=1.\) * We have some prior \(p_{\Theta}(\theta)\)
- Classical: * We assume that we have a different probability model under each hypothesis, \(f_X(x; H_0)\) and \(f_X(x; H_1)\). * No prior knowledge is assumed
Inferring about \(H_0\) and \(H_1\) then becomes similar to point estimation.

Attention

We create a \(1-\alpha\) confidence set for the estimated quantity.

If the quantity as-per-model doesn’t fall within this set, then we reject the null hypothesis with significance level \(\alpha\).

If it does, then we fail to reject the null hypothesis.

Note

TODO - write common definitions, significance level, rejection region, critical point, type-I type-II errors

Machine Learning as a Statistical Inference¶

Note

We have iid samples from an unknown joint CDF, e.g. \((X_i,Y_i)_{i=1}^n\sim F_{X,Y}\).
Model inference: Model inference means estmating the conditional expectation corresponding to \(F_{Y|X}\) with a regression function \(r(X)\) such that

\[T(F_{Y|X})=\mathbb{E}[Y|X]=r(X)+\epsilon\]

where \(\mathbb{E}[\epsilon]=0\).
- This inference is known as learning in Machine Learning and curve estimation in statistics.
Variable inference: In the above case, a variable inference means estimating an unseen \(Y|X=x\) by \(\widehat{Y}=\widehat{y}=r(x)\) for a given \(X=x\).
- This is known as inference in Machine Learning and prediction in statistics.

Note

Dependent and Independent Variable:

Attention

The process that decides the model, such as choice of function-class or number of parameters, is independent of the inference and is performed separately beforehand. In ML, these are called hyper-parameters.
Since there are multiple items to choose before performing inference, it is useful to clarify the sequence:
1. A metric of goodness of an estimator is chosen first.
2. A model is chosen (such as, hyperparameters).
3. Inference is performed using computation involving the samples.
4. Quality of model is judged by evaluating the model on the inference data.
5. (Optional) A different model is chosen and the process repeats.
\(X\) is called the independent variable (features) and \(Y\) called as dependent variable (target).
Independent variables are often multidimensional vectors \(X=\mathbf{x}\in\mathbb{R}^d\) for some \(d>1\).