Statistical Inference

Note

Statistical Functionals: The functions of this form, \(T(F)\), such as

  • density \(T(F)=f_F=\frac{\mathop{d}}{\mathop{dx}}=\mathop{dF}\)

  • expectation \(T(F)=\mathbb{E}_F[X]=\int x \mathop{dF}\)

  • variance \(T(F)=\mathbb{V}_F(X)=\int(x-\mathbb{E}[X])^2\mathop{dF}\)

  • moments: \(T(F)=M_F(X)=\int e^{sx}\mathop{dF}\)

  • median: \(T(F)=F^{-1}(1/2)\)

Attention

  • We have a sample of size \(n\), \(X_1,\cdots X_n\) from an unknown CDF \(F\).

  • The task for statistical inference is to infer \(F\) or some \(T(F)\), that best explains the data, for some criteria of best chosen beforehand.

  • The inferred values based on data are called estimates of the quantities of interest.

  • Estimates are rv as their values may change subject to a different sample.

  • The rv associated with these estimates is called an estimator.

Attention

  • Statistic: Any function of the data \(g(X_1,\cdots,X_n)\) is called a statistic.

  • Any estimator is a statistic.

Statistical Model

Note

A statistical model \(\mathcal{F}\) is set of distributions or other statistical functionals of interest.

Types of Statistical Model

The following categories of models are based on the dimensionality of \(\mathcal{F}\).

Parametric Model

Note

  • \(\mathcal{F}\) can be spanned by a finitely many parameters.

See also

Example:

  • Let the parameter vector be \(\boldsymbol{\theta}=(\theta_1,\cdots,\theta_k)^\top\).

  • The model here is the set of distributions \(\mathcal{F}=\{F_\boldsymbol{\theta}\}=\{F_X(x;\theta_1,\cdots,\theta_k)\}\).

Non-parametric Model

Note

\(\mathcal{F}\) cannot be spanned by a finitely many parameters.

See also

Example: Set of all possible CDFs.

Different Approaches to Inference

Bayesian Inference

Note

  • The quantity that we want to estimate is assumed to be a rv on its own, \(\Theta\).

  • Before observing any data, we have a prior notion of what its distribution. This is expressed as the prior probability \(f_\Theta(\theta)\).

  • The likelihood is the PDF of the data conditioned on \(\Theta\), \(f_{X|\Theta}(x|\theta)\).

  • The posterior is obtained by applying Bayes rule which gives a single probability model for the quantity after observation \(f_{\Theta|X}(\theta|x)\).

  • We perform inference about \(\Theta\) based on this distribution directly.

Frequentist (Classical) Inference

Note

  • The quantity that we want to estimate is assumed to be an unknown constant, \(\theta\).

  • No prior knowledge is assumed about this.

  • We assume that our underlying probability model is dependent on \(\theta\) in some way.

  • Therefore, the probability model here is the collection of PDFs \(f_\theta(x;\theta)\) for each possible values of \(\theta\).

  • In order to perform inference, our statements must apply to all possible values of \(\theta\).

Types of Inference

Point Estimation

Note

  • A single best estimate (point) within the model for

    • Classical: the unknown constant \(\theta\)

    • Bayesian: the rv \(\Theta=\theta\)

  • This estimate of \(\theta\) is expressed as a statistic \(\widehat{\theta}_n=g(x_1,\cdots,x_n)\)

  • The estimator \(\widehat{\Theta}_n\) is always a rv as it evaluates to a different value with a different sample.

  • Examples:

    1. a single distribution/density function (parameterised/non-parameterised)

    2. a single regression function

    3. a single value for expectation/variance/other moments

    4. a single prediction for a dependent variable with a given independent variable. etc.

Some useful terminology

Note

  • Sampling Distribution: The distribution of \(\widehat{\Theta}_n\) over different samples.

  • Estimation Error:

    • Classical: \(\tilde{\Theta}_n=\widehat{\Theta}_n-\theta\)

    • Bayesian: \(\tilde{\Theta}_n=\widehat{\Theta}_n-\Theta\)

  • Bias:

    • Classical: \(\text{b}(\widehat{\Theta}_n)=\mathbb{E}_{\theta}[\tilde{\Theta}_n]=\mathbb{E}_{\theta}[\widehat{\Theta}_n]-\theta\)

    • Bayesian: \(\text{b}(\widehat{\Theta}_n)=\mathbb{E}[\tilde{\Theta}_n]=\mathbb{E}[\widehat{\Theta}_n]-\mathbb{E}[\Theta]\)

  • Standard Error:

    • Classical: \(\text{se}(\widehat{\Theta}_n)=\sqrt{\mathbb{V}_{\theta}(\widehat{\Theta}_n)}\)

    • Bayesian: \(\text{se}(\widehat{\Theta}_n)=\sqrt{\mathbb{V}(\widehat{\Theta}_n)}\)

  • If the variance in above is also an estimate (as it often is), then we estimate SE as \(\widehat{\text{se}}=\widehat{\text{se}}(\widehat{\Theta}_n)=\sqrt{\widehat{\mathbb{V}}_{\theta}(\widehat{\Theta}_n)}\).

  • Mean-Squared Error:

    • Classical: \(\text{mse}(\widehat{\Theta}_n)=\mathbb{E}_{\theta}[\tilde{\Theta}_n^2]=\mathbb{E}_{\theta}[(\widehat{\Theta}_n-\theta)^2]=\text{b}^2(\widehat{\Theta}_n)+\text{se}^2(\widehat{\Theta}_n)\)

    • Bayesian: \(\text{mse}(\widehat{\Theta}_n)=\mathbb{E}[\tilde{\Theta}_n^2]=\mathbb{E}[(\widehat{\Theta}_n-\Theta)^2]=\mathbb{E}[\widehat{\Theta}_n^2]+\mathbb{E}[\Theta^2]-2\mathbb{E}[\widehat{\Theta}_n\Theta]\)

Note

  • Unbiased Estimator: If \(\text{b}(\widehat{\Theta}_n)=0\).

  • Asymptotically Unbiased Estimator: If \(\widehat{\Theta}_n\xrightarrow[]{L_1}\theta\) (or \(\Theta\)).

  • Consistent Estimator: If \(\widehat{\Theta}_n\xrightarrow[]{P}\theta\) (or \(\Theta\)).

  • Asymptotically Normal Estimator:

    • Classical: \(\frac{\widehat{\Theta}_n-\theta}{\widehat{\text{se}}}\xrightarrow[]{D}\mathcal{N}(0,1)\).

    • Bayesian: \(\frac{\widehat{\Theta}_n-\Theta}{\widehat{\text{se}}}\xrightarrow[]{D}\mathcal{N}(0,1)\).

Attention

Theorem: If \(\lim\limits_{n\to\infty}\text{b}_\theta(\widehat{\Theta}_n)=0\) and \(\lim\limits_{n\to\infty}\text{se}(\widehat{\Theta}_n)=0\) then \(\widehat{\Theta}_n\) is consistent.

Confidence Set Estimation

Attention

  • In Bayesian setting, the point estimate is already associated with a probability distribution which convey the degree of belief about the true quantity being the same as the estimated quantity.

  • On the other hand, confidence set estimation is a technique used in a classical setting. However, this makes probabilitic statement about the estimated set, not the quantity itself.

Note

  • An estimated set which traps the fixed, unknown value of our quality of interest with a pre-determined probability.

  • A 95% confidence set means that if we repeatedly estimate it from multiple samples (works even if samples are from completely unrelated experiments), then around 95% of the times the estimated set contains the true quantity.

Attention

  1. A \(1-\alpha\) confidence interval (CI) for a real qualtity of interest \(\theta\) is defined as \(\widehat{C}_n=(a,b)\) where \(\mathbb{P}(\theta\in\widehat{C}_n)\ge 1-\alpha\).

  2. The task is to estimate \(\widehat{a}=a(X_1,\cdots,X_n)\) and \(\widehat{b}=b(X_1,\cdots,X_n)\) such that the above holds.

  3. For vector quantities, this is expressed with sets instead of intervals.

  4. In regression setting, a confidence interval around the regression function can be thought of the set of functions which contains the true function with certain probabilty. However, this is usually never measured.

Some useful terminology

Note

  • Pointwise Asymptotic CI: \(\forall\theta,\liminf\limits_{n\to\infty}\mathbb{P}_{\theta}(\theta\in\widehat{C}_n)\ge 1-\alpha\)

    • Given any \(\theta\), the smallest probability that \(\widehat{C}_n\) captures \(\theta\) is at least \(1-\alpha\) asymptotically as \(n\to\infty\).

    • The rate of this convergence depends on the value of \(\theta\).

  • Uniform Asymptotic CI: \(\liminf\limits_{n\to\infty}\inf\limits_{\theta\in\Theta}\mathbb{P}_{\theta}(\theta\in\widehat{C}_n)\ge 1-\alpha\)

    • Given any \(n\), we consider the smallest probability that \(\widehat{C}_n\) captures \(\theta\) for any \(\theta\in\Theta\).

    • This probability is at least \(1-\alpha\) asymptotically as \(n\to\infty\).

    • Uniform Asymptotic CI is stricter, as in, satisfying this condition automatically implies the former.

  • Normal-based CI: If \(\widehat{\Theta}_n\) is an aysmptotically normal estimator of \(\theta\), then a \(1-\alpha\) confidence interval is given by

    \[(\widehat{\Theta}_n-z_{\alpha/2}\widehat{\text{se}},\widehat{\Theta}_n+z_{\alpha/2}\widehat{\text{se}})\]
    • The above is a pointwise asymptotic CI.

Hypothesis Testing

Note

  • We have 2 or more unknown hypothesis about the probability model, \(H_0\) (null) and \(H_1\) (alternate), which are exclusively T/F.

    • We might have 1 hypothesis which we can convert into 2 as \(H_1=\not H_0\).

  • We assume that this unknown hypothesis determines the distribution of the data.

    • Bayesian: * Here we assume that the hypothesis themselves are Bernoulli rv, \(H_0=T\implies\Theta=1.\) * We have some prior \(p_{\Theta}(\theta)\)

    • Classical: * We assume that we have a different probability model under each hypothesis, \(f_X(x; H_0)\) and \(f_X(x; H_1)\). * No prior knowledge is assumed

  • Inferring about \(H_0\) and \(H_1\) then becomes similar to point estimation.

Attention

We create a \(1-\alpha\) confidence set for the estimated quantity.

  • If the quantity as-per-model doesn’t fall within this set, then we reject the null hypothesis with significance level \(\alpha\).

  • If it does, then we fail to reject the null hypothesis.

Note

  • TODO - write common definitions, significance level, rejection region, critical point, type-I type-II errors

Machine Learning as a Statistical Inference

Note

  • We have iid samples from an unknown joint CDF, e.g. \((X_i,Y_i)_{i=1}^n\sim F_{X,Y}\).

  • Model inference: Model inference means estmating the conditional expectation corresponding to \(F_{Y|X}\) with a regression function \(r(X)\) such that

    \[T(F_{Y|X})=\mathbb{E}[Y|X]=r(X)+\epsilon\]

    where \(\mathbb{E}[\epsilon]=0\).

    • This inference is known as learning in Machine Learning and curve estimation in statistics.

  • Variable inference: In the above case, a variable inference means estimating an unseen \(Y|X=x\) by \(\widehat{Y}=\widehat{y}=r(x)\) for a given \(X=x\).

    • This is known as inference in Machine Learning and prediction in statistics.

Note

Dependent and Independent Variable:

Attention

  • The process that decides the model, such as choice of function-class or number of parameters, is independent of the inference and is performed separately beforehand. In ML, these are called hyper-parameters.

  • Since there are multiple items to choose before performing inference, it is useful to clarify the sequence:

    1. A metric of goodness of an estimator is chosen first.

    2. A model is chosen (such as, hyperparameters).

    3. Inference is performed using computation involving the samples.

    4. Quality of model is judged by evaluating the model on the inference data.

    5. (Optional) A different model is chosen and the process repeats.

  • \(X\) is called the independent variable (features) and \(Y\) called as dependent variable (target).

  • Independent variables are often multidimensional vectors \(X=\mathbf{x}\in\mathbb{R}^d\) for some \(d>1\).