Parametric Point Estimation¶
Classical Infernece¶
Method of Moments Estimator (MOM)¶
Note
Let the parameter vector be \(\boldsymbol{\theta}=(\theta_1,\cdots,\theta_k)\).
Let the estimator be \(\widehat{\Theta}_n=(\widehat{\theta_1},\cdots,\widehat{\theta_k})\).
Let \(\alpha_j=\alpha_j({\boldsymbol{\theta}})=\mathbb{E}_{\boldsymbol{\theta}}[X^j]=\int x^j\mathop{dF_{\boldsymbol{\theta}}}(x)\) for \(1\leq j\leq k\).
We assume that the moments exist and can be expressed in closed form as equations involving the parameters \(\theta_j\).
Estimate moments with sample moments as
\[\widehat{\alpha_j}({\boldsymbol{\theta}})=\alpha(\widehat{\Theta}_n)=\frac{1}{n}\sum_{i=1}^n X_i^j\]We’d have a system of equations k equations with k unknowns involving \((\widehat{\theta}_j)_{j=1}^k\).
Properties¶
See also
Consistent: \(\widehat{\Theta}_n\xrightarrow[]{P}\boldsymbol{\theta}\)
Asymptotically normal:
TODO write expression
Common Estimators¶
Bernoulli¶
Note
We have samples \(X=(X_1,\cdots,X_n)\) from a Bernoulli with unknown \(p\).
\(\widehat{\alpha_0}=p=\frac{1}{n}\sum_{i=1}^n X_i\).
Normal¶
Note
We have samples \(X=(X_1,\cdots,X_n)\) from a Normal with unknown \(\mu,\sigma\).
\(\widehat{\alpha_0}=\mu=\frac{1}{n}\sum_{i=1}^n X_i\).
\(\widehat{\alpha_1}=\mu^2+\sigma^2=\frac{1}{n}\sum_{i=1}^n X^2_i\).
Maximum Likelihood Estimator (MLE)¶
Likelihood function¶
Note
We assume that we have samples of size \(n\), \(X=(X_1,\cdots,X_n)\) such that \(X_i\sim f_{X_i}(x_i; \theta)\).
Likelihood function is defined as \(\mathcal{L}_n(\theta)=f_X(x; \theta)=f_{X_1,\cdots,X_n}(x_1,\cdots,x_n;\theta)\).
Warning
Given a particular observation \(X=x=(x_1,\cdots,x_n)\), the function \(\mathcal{L}_n(\theta)=f_X(x; \theta)\) is no longer a density, but just a function of \(\theta\).
For discrete case, \(\mathcal{L}_n(\theta)=p_X(x; \theta)=\mathbb{P}(X_1=x_1,\cdots,X_n=x_n;\theta)\).
This is the probability that the observation would match current data under a particular \(\theta\).
If this probability is higher when \(\theta=\theta_i\) compared to \(\theta=\theta_j\), it is more likely that the underlying parameter has value \(\theta_i\).
Note
We estimate \(\widehat{\Theta}_n=\widehat{\Theta}_{\text{ML}}=\mathop{\underset{\theta}{\mathrm{argmax}}}\mathcal{L}(\theta)\).
Log likelihood¶
Independence assumption:
\[\mathcal{L}_n(\theta)=f_{X_1,\cdots,X_n}(x_1,\cdots,x_n;\theta)=\prod_{i=1}^n f_{X_i}(x_i;\theta)\]Identical distribution assumption:
\[\mathcal{L}_n(\theta)=f_{X_1,\cdots,X_n}(x_1,\cdots,x_n;\theta)=\prod_{i=1}^n f_X(x_i;\theta)\]Log likelihood is defined as
\[\mathcal{l}_n(\theta)=\log{\mathcal{L}(\theta)}=\sum_{i=1}^n \log(f_X(x_i;\theta))\]As log is a monotonic increasing function
\[\mathop{\underset{\theta}{\mathrm{argmax}}}\mathcal{l}_n(\theta)=\mathop{\underset{\theta}{\mathrm{argmax}}}\mathcal{L}_n(\theta)\]
Properties¶
Note
Consistent: \(\widehat{\Theta}_{\text{ML}}\xrightarrow[]{P}\theta\).
Proof Hint: Involve KL distance between the true value of \(\theta\), \(\theta_{\text{true}}\) and any arbitrary \(\theta\).
The likelihood function with the true value \(l_n(\theta_{\text{true}})\) evaluates to a constant.
Maximising \(l_n(\theta)\) is the same as maximising
\[M_n(\theta)=\frac{1}{n}\left(l_n(\theta)-l_n(\theta_{\text{true}})\right)=\frac{1}{n}\sum_{i=1}^n\log\left(\frac{f_X(x_i;\theta)}{f_X(x_i;\theta_{\text{true}})}\right)\]Let \(M(\theta)\) be defined as the expectation of this rv
\[M(\theta)=\mathbb{E}_{\theta_\text{true}}\left[\log\left(\frac{f_X(x;\theta)}{f_X(x;\theta_{\text{true}})}\right)\right]=\int\log\left(\frac{f_X(x;\theta)}{f_X(x;\theta_{\text{true}})}\right)f_X(x;\theta_{\text{true}})\mathop{dx}=-D_{KL}(\theta_{\text{true}},\theta)\]Maximum value of \(M(\theta)\) is 0.
For all \(\theta\), \(M_n(\theta)\xrightarrow[]{P}M(\theta)\)
Technically, we need uniform convergence to prove this formally.
Equivariant: If \(\widehat{\Theta}_{\text{ML}}\) is the MLE for \(\theta\), then \(g(\widehat{\Theta}_{\text{ML}})\) is the MLE for \(g(\theta)\).
TODO proof
Asymptotically normal: \(\frac{\widehat{\Theta}_{\text{ML}}-\theta}{\widehat{\text{se}}}\xrightarrow[]{D}\mathcal{N}(0,1)\)
Score function: \(s(X;\theta)=\frac{\partial}{\partial\theta}\log f(X;\theta)\)
Fisher information: \(I_n(\theta)=\mathbb{V}_\theta\left(\sum_{i=1}^n s(X_i;\theta)\right)=\sum_{i=1}^n\mathbb{V}_\theta\left(s(X_i;\theta)\right)\)
Asymptotically optimal: Estimator has least variance for large sample size.
TODO proof
Computing CI for MLE¶
Iterative Method of Computation¶
Note
For complicated or composite rvs, computation of likelihood in a closed form might be challenging.
We can approximate MLE estimates by iterative methods.
Newton Raphson¶
Note
We gather an initial estimate as a starting point, \(\theta'\).
MOM can give us a good starting point.
We assume that the true optimal value \(\theta^*\) lie in the vicinity of this initial guess.
We apply first order taylor approximation
The EM Algorithm¶
Note
TODO add more details
Assume hidden variables - likelihood computation is easier for joint