Parametric Point Estimation

Classical Infernece

Method of Moments Estimator (MOM)

Note

  • Let the parameter vector be \(\boldsymbol{\theta}=(\theta_1,\cdots,\theta_k)\).

  • Let the estimator be \(\widehat{\Theta}_n=(\widehat{\theta_1},\cdots,\widehat{\theta_k})\).

  • Let \(\alpha_j=\alpha_j({\boldsymbol{\theta}})=\mathbb{E}_{\boldsymbol{\theta}}[X^j]=\int x^j\mathop{dF_{\boldsymbol{\theta}}}(x)\) for \(1\leq j\leq k\).

  • We assume that the moments exist and can be expressed in closed form as equations involving the parameters \(\theta_j\).

  • Estimate moments with sample moments as

    \[\widehat{\alpha_j}({\boldsymbol{\theta}})=\alpha(\widehat{\Theta}_n)=\frac{1}{n}\sum_{i=1}^n X_i^j\]
  • We’d have a system of equations k equations with k unknowns involving \((\widehat{\theta}_j)_{j=1}^k\).

Properties

See also

  • Consistent: \(\widehat{\Theta}_n\xrightarrow[]{P}\boldsymbol{\theta}\)

  • Asymptotically normal:

    • TODO write expression

Common Estimators

Bernoulli

Note

  • We have samples \(X=(X_1,\cdots,X_n)\) from a Bernoulli with unknown \(p\).

  • \(\widehat{\alpha_0}=p=\frac{1}{n}\sum_{i=1}^n X_i\).

Normal

Note

  • We have samples \(X=(X_1,\cdots,X_n)\) from a Normal with unknown \(\mu,\sigma\).

  • \(\widehat{\alpha_0}=\mu=\frac{1}{n}\sum_{i=1}^n X_i\).

  • \(\widehat{\alpha_1}=\mu^2+\sigma^2=\frac{1}{n}\sum_{i=1}^n X^2_i\).

Maximum Likelihood Estimator (MLE)

Likelihood function

Note

  • We assume that we have samples of size \(n\), \(X=(X_1,\cdots,X_n)\) such that \(X_i\sim f_{X_i}(x_i; \theta)\).

  • Likelihood function is defined as \(\mathcal{L}_n(\theta)=f_X(x; \theta)=f_{X_1,\cdots,X_n}(x_1,\cdots,x_n;\theta)\).

Warning

  • Given a particular observation \(X=x=(x_1,\cdots,x_n)\), the function \(\mathcal{L}_n(\theta)=f_X(x; \theta)\) is no longer a density, but just a function of \(\theta\).

  • For discrete case, \(\mathcal{L}_n(\theta)=p_X(x; \theta)=\mathbb{P}(X_1=x_1,\cdots,X_n=x_n;\theta)\).

    • This is the probability that the observation would match current data under a particular \(\theta\).

    • If this probability is higher when \(\theta=\theta_i\) compared to \(\theta=\theta_j\), it is more likely that the underlying parameter has value \(\theta_i\).

Note

We estimate \(\widehat{\Theta}_n=\widehat{\Theta}_{\text{ML}}=\mathop{\underset{\theta}{\mathrm{argmax}}}\mathcal{L}(\theta)\).

Log likelihood

  • Independence assumption:

    \[\mathcal{L}_n(\theta)=f_{X_1,\cdots,X_n}(x_1,\cdots,x_n;\theta)=\prod_{i=1}^n f_{X_i}(x_i;\theta)\]
  • Identical distribution assumption:

    \[\mathcal{L}_n(\theta)=f_{X_1,\cdots,X_n}(x_1,\cdots,x_n;\theta)=\prod_{i=1}^n f_X(x_i;\theta)\]
  • Log likelihood is defined as

    \[\mathcal{l}_n(\theta)=\log{\mathcal{L}(\theta)}=\sum_{i=1}^n \log(f_X(x_i;\theta))\]
  • As log is a monotonic increasing function

    \[\mathop{\underset{\theta}{\mathrm{argmax}}}\mathcal{l}_n(\theta)=\mathop{\underset{\theta}{\mathrm{argmax}}}\mathcal{L}_n(\theta)\]

Properties

Note

  • Consistent: \(\widehat{\Theta}_{\text{ML}}\xrightarrow[]{P}\theta\).

    • Proof Hint: Involve KL distance between the true value of \(\theta\), \(\theta_{\text{true}}\) and any arbitrary \(\theta\).

      • The likelihood function with the true value \(l_n(\theta_{\text{true}})\) evaluates to a constant.

      • Maximising \(l_n(\theta)\) is the same as maximising

        \[M_n(\theta)=\frac{1}{n}\left(l_n(\theta)-l_n(\theta_{\text{true}})\right)=\frac{1}{n}\sum_{i=1}^n\log\left(\frac{f_X(x_i;\theta)}{f_X(x_i;\theta_{\text{true}})}\right)\]
      • Let \(M(\theta)\) be defined as the expectation of this rv

        \[M(\theta)=\mathbb{E}_{\theta_\text{true}}\left[\log\left(\frac{f_X(x;\theta)}{f_X(x;\theta_{\text{true}})}\right)\right]=\int\log\left(\frac{f_X(x;\theta)}{f_X(x;\theta_{\text{true}})}\right)f_X(x;\theta_{\text{true}})\mathop{dx}=-D_{KL}(\theta_{\text{true}},\theta)\]
      • Maximum value of \(M(\theta)\) is 0.

      • For all \(\theta\), \(M_n(\theta)\xrightarrow[]{P}M(\theta)\)

      • Technically, we need uniform convergence to prove this formally.

  • Equivariant: If \(\widehat{\Theta}_{\text{ML}}\) is the MLE for \(\theta\), then \(g(\widehat{\Theta}_{\text{ML}})\) is the MLE for \(g(\theta)\).

    • TODO proof

  • Asymptotically normal: \(\frac{\widehat{\Theta}_{\text{ML}}-\theta}{\widehat{\text{se}}}\xrightarrow[]{D}\mathcal{N}(0,1)\)

    • Score function: \(s(X;\theta)=\frac{\partial}{\partial\theta}\log f(X;\theta)\)

    • Fisher information: \(I_n(\theta)=\mathbb{V}_\theta\left(\sum_{i=1}^n s(X_i;\theta)\right)=\sum_{i=1}^n\mathbb{V}_\theta\left(s(X_i;\theta)\right)\)

  • Asymptotically optimal: Estimator has least variance for large sample size.

    • TODO proof

Computing CI for MLE

Common Estimators

Bernoulli
Uniform
Binomial
Geometric
Multinomial
Exponential
Normal

Iterative Method of Computation

Note

  • For complicated or composite rvs, computation of likelihood in a closed form might be challenging.

  • We can approximate MLE estimates by iterative methods.

Newton Raphson

Note

  • We gather an initial estimate as a starting point, \(\theta'\).

    • MOM can give us a good starting point.

  • We assume that the true optimal value \(\theta^*\) lie in the vicinity of this initial guess.

  • We apply first order taylor approximation

The EM Algorithm

Note

  • TODO add more details

  • Assume hidden variables - likelihood computation is easier for joint

Bayesian Inference

Maximum A Posterior Estimator (MAP)

Common Estimators

Bernoulli
Normal

Minimum Mean Squared Error Estimator (MMSE)