Parametric Point Estimation¶

Classical Infernece¶

Method of Moments Estimator (MOM)¶

Note

Let the parameter vector be \(\boldsymbol{\theta}=(\theta_1,\cdots,\theta_k)\).
Let the estimator be \(\widehat{\Theta}_n=(\widehat{\theta_1},\cdots,\widehat{\theta_k})\).
Let \(\alpha_j=\alpha_j({\boldsymbol{\theta}})=\mathbb{E}_{\boldsymbol{\theta}}[X^j]=\int x^j\mathop{dF_{\boldsymbol{\theta}}}(x)\) for \(1\leq j\leq k\).
We assume that the moments exist and can be expressed in closed form as equations involving the parameters \(\theta_j\).
Estimate moments with sample moments as

\[\widehat{\alpha_j}({\boldsymbol{\theta}})=\alpha(\widehat{\Theta}_n)=\frac{1}{n}\sum_{i=1}^n X_i^j\]
We’d have a system of equations k equations with k unknowns involving \((\widehat{\theta}_j)_{j=1}^k\).

Properties¶

Common Estimators¶

Bernoulli¶

Note

We have samples \(X=(X_1,\cdots,X_n)\) from a Bernoulli with unknown \(p\).
\(\widehat{\alpha_0}=p=\frac{1}{n}\sum_{i=1}^n X_i\).

Normal¶

Note

We have samples \(X=(X_1,\cdots,X_n)\) from a Normal with unknown \(\mu,\sigma\).
\(\widehat{\alpha_0}=\mu=\frac{1}{n}\sum_{i=1}^n X_i\).
\(\widehat{\alpha_1}=\mu^2+\sigma^2=\frac{1}{n}\sum_{i=1}^n X^2_i\).

Maximum Likelihood Estimator (MLE)¶

Likelihood function¶

Note

We assume that we have samples of size \(n\), \(X=(X_1,\cdots,X_n)\) such that \(X_i\sim f_{X_i}(x_i; \theta)\).
Likelihood function is defined as \(\mathcal{L}_n(\theta)=f_X(x; \theta)=f_{X_1,\cdots,X_n}(x_1,\cdots,x_n;\theta)\).

Warning

Given a particular observation \(X=x=(x_1,\cdots,x_n)\), the function \(\mathcal{L}_n(\theta)=f_X(x; \theta)\) is no longer a density, but just a function of \(\theta\).
For discrete case, \(\mathcal{L}_n(\theta)=p_X(x; \theta)=\mathbb{P}(X_1=x_1,\cdots,X_n=x_n;\theta)\).
- This is the probability that the observation would match current data under a particular \(\theta\).
- If this probability is higher when \(\theta=\theta_i\) compared to \(\theta=\theta_j\), it is more likely that the underlying parameter has value \(\theta_i\).

Note

We estimate \(\widehat{\Theta}_n=\widehat{\Theta}_{\text{ML}}=\mathop{\underset{\theta}{\mathrm{argmax}}}\mathcal{L}(\theta)\).

Log likelihood¶

Independence assumption:

\[\mathcal{L}_n(\theta)=f_{X_1,\cdots,X_n}(x_1,\cdots,x_n;\theta)=\prod_{i=1}^n f_{X_i}(x_i;\theta)\]

Identical distribution assumption:

\[\mathcal{L}_n(\theta)=f_{X_1,\cdots,X_n}(x_1,\cdots,x_n;\theta)=\prod_{i=1}^n f_X(x_i;\theta)\]

Log likelihood is defined as

\[\mathcal{l}_n(\theta)=\log{\mathcal{L}(\theta)}=\sum_{i=1}^n \log(f_X(x_i;\theta))\]

As log is a monotonic increasing function

\[\mathop{\underset{\theta}{\mathrm{argmax}}}\mathcal{l}_n(\theta)=\mathop{\underset{\theta}{\mathrm{argmax}}}\mathcal{L}_n(\theta)\]

Properties¶

Note

Consistent: \(\widehat{\Theta}_{\text{ML}}\xrightarrow[]{P}\theta\).
- Proof Hint: Involve KL distance between the true value of \(\theta\), \(\theta_{\text{true}}\) and any arbitrary \(\theta\).
  - The likelihood function with the true value \(l_n(\theta_{\text{true}})\) evaluates to a constant.
  - Maximising \(l_n(\theta)\) is the same as maximising
    
    \[M_n(\theta)=\frac{1}{n}\left(l_n(\theta)-l_n(\theta_{\text{true}})\right)=\frac{1}{n}\sum_{i=1}^n\log\left(\frac{f_X(x_i;\theta)}{f_X(x_i;\theta_{\text{true}})}\right)\]
  - Let \(M(\theta)\) be defined as the expectation of this rv
    
    \[M(\theta)=\mathbb{E}_{\theta_\text{true}}\left[\log\left(\frac{f_X(x;\theta)}{f_X(x;\theta_{\text{true}})}\right)\right]=\int\log\left(\frac{f_X(x;\theta)}{f_X(x;\theta_{\text{true}})}\right)f_X(x;\theta_{\text{true}})\mathop{dx}=-D_{KL}(\theta_{\text{true}},\theta)\]
  - Maximum value of \(M(\theta)\) is 0.
  - For all \(\theta\), \(M_n(\theta)\xrightarrow[]{P}M(\theta)\)
  - Technically, we need uniform convergence to prove this formally.
Equivariant: If \(\widehat{\Theta}_{\text{ML}}\) is the MLE for \(\theta\), then \(g(\widehat{\Theta}_{\text{ML}})\) is the MLE for \(g(\theta)\).
- TODO proof
Asymptotically normal: \(\frac{\widehat{\Theta}_{\text{ML}}-\theta}{\widehat{\text{se}}}\xrightarrow[]{D}\mathcal{N}(0,1)\)
- Score function: \(s(X;\theta)=\frac{\partial}{\partial\theta}\log f(X;\theta)\)
- Fisher information: \(I_n(\theta)=\mathbb{V}_\theta\left(\sum_{i=1}^n s(X_i;\theta)\right)=\sum_{i=1}^n\mathbb{V}_\theta\left(s(X_i;\theta)\right)\)
Asymptotically optimal: Estimator has least variance for large sample size.
- TODO proof

Computing CI for MLE¶

Common Estimators¶

Bernoulli¶

Uniform¶

Binomial¶

Geometric¶

Multinomial¶

Exponential¶

Normal¶

Iterative Method of Computation¶

Note

For complicated or composite rvs, computation of likelihood in a closed form might be challenging.
We can approximate MLE estimates by iterative methods.

Newton Raphson¶

Note

We gather an initial estimate as a starting point, \(\theta'\).
- MOM can give us a good starting point.
We assume that the true optimal value \(\theta^*\) lie in the vicinity of this initial guess.
We apply first order taylor approximation

Parametric Point Estimation¶

Classical Infernece¶

Method of Moments Estimator (MOM)¶

Properties¶

Common Estimators¶

Bernoulli¶

Normal¶

Maximum Likelihood Estimator (MLE)¶

Likelihood function¶

Log likelihood¶

Properties¶

Computing CI for MLE¶

Common Estimators¶

Bernoulli¶

Uniform¶

Binomial¶

Geometric¶

Multinomial¶

Exponential¶

Normal¶

Iterative Method of Computation¶

Newton Raphson¶

The EM Algorithm¶

Bayesian Inference¶

Maximum A Posterior Estimator (MAP)¶

Common Estimators¶

Bernoulli¶

Normal¶

Minimum Mean Squared Error Estimator (MMSE)¶