Functions of Random Variable¶
Density of a function of a rv¶
Let \(Y=g(X)\) be a function of an rv \(X\).
Note
If \(X\) is discrete, this is discussed in the random variable section (TODO: add hyperlink)
If \(X\) is continuous with a PDF \(f_X(x)\), then the process for finding \(f_Y(y)\) is as follows:
Compute the CDF as
\[F_Y(y)=\mathbb{P}(Y\leq y)=\mathbb{P}(g(X)\leq y)=\int\limits_{\{x|g(x)\leq y\}}f_X(x) \mathop{dx}\]Compute the PDF as \(f_Y(y)=F'_Y(y)\).
Tip
Some effort is required to compute the set \(\{x|g(x)\leq y\}\).
Find \(f_Y(y)\) where \(Y=X^2\).
Special cases¶
Linear functions¶
Let \(Y=g(X)=aX+b\).
Tip
If \(a=0\), then \(Y=b\) with probability 1 and it’s no longer a continuous rv.
If \(a\neq 0\), then we have
\[\begin{split}F_Y(y)=\mathbb{P}(Y\leq y)=\mathbb{P}(g(X)\leq y)=\mathbb{P}(aX+b\leq y)=\begin{cases}\mathbb{P}\left(X\leq\frac{y-b}{a}\right) & \text{if $a>0$} \\ \mathbb{P}\left(X\geq\frac{y-b}{a}\right) & \text{if $a<0$}\end{cases}=\begin{cases}F_X(\frac{y-b}{a}) & \text{if $a>0$} \\ 1-F_X(\frac{y-b}{a}) & \text{if $a<0$}\end{cases}\end{split}\]We can recover the PDF in both cases as
\[\begin{split}f_Y(y)=\begin{cases}\frac{1}{a}f_X(\frac{y-b}{a}) & \text{if $a>0$} \\ -\frac{1}{a}f_X(\frac{y-b}{a}) & \text{if $a<0$}\end{cases}=\frac{1}{\left| a \right|}f_X(\frac{y-b}{a})\end{split}\]
Monotonic functions¶
Note
If \(g(y)=x\) is a monotonic function, then it has an inverse, \(x=g^{-1}(y)\).
Therefore, we have
\[\begin{split}F_Y(y)=\mathbb{P}(Y\leq y)=\mathbb{P}(g(X)\leq y)=\begin{cases}\mathbb{P}(X\leq g^{-1}(y)) & \text{if $g(X)$ is monotonic increasing}\\\mathbb{P}(X\geq g^{-1}(y)) & \text{if $g(X)$ is monotonic decreasing}\end{cases}=\begin{cases}F_X(g^{-1}(y)) & \text{if $g(X)$ is monotonic increasing}\\1-F_X(g^{-1}(y)) & \text{if $g(X)$ is monotonic decreasing}\end{cases}\end{split}\]We can recover the PDF in both cases as
\[f_Y(y)=|f_X(g^{-1}(y))|\cdot\frac{\mathop{d}}{\mathop{dy}}\left[g^{-1}(y)\right]\]We note that the linear case is a special case of monotonic functions.
Density of a function of multiple jointly distributed rvs¶
Let \(Z=g(X,Y)\) be a function of 2 jointly distributed rvs, \(X\) and \(Y\). In this case, we follow the same process as before.
Tip
Compute the CDF as
\[F_Z(z)=\mathbb{P}(Z\leq z)=\mathbb{P}(g(X,Y)\leq z)=\iint\limits_{\{(x,y)|g(x,y)\leq z\}}f_{X,Y}(x,y)\mathop{dx}\mathop{dy}\]Compute the PDF as \(f_Z(z)=F'_Z(z)\).
Extends naturally for more than 2 rvs.
See also
Find the PDF of \(Z=X/Y\), where \(X\) and \(Y\) are independent and uniformly distributed in \([0,1]\).
Two people join a call but they are late by an amount, independent of the other, that follows an exponential distribution with parameter \(\lambda\). Find the PDF of the difference in their joining time.
Special cases¶
Sum of independent rvs: Convolution¶
We want the PDF (or PMF) of the sum of two independent rvs, \(X\) and \(Y\), \(Z=X+Y\).
Discrete case¶
Tip
We note that
\[p_{Z|X}(z|x)=\mathbb{P}(Z=z|X=x)=\mathbb{P}(X+Y=z|X=x)=\mathbb{P}(x+Y=z)=\mathbb{P}(Y=z-x)=p_{Y}(z-x)\]Therefore, the joint mass between \(X\) and \(Z\) factorises as
\[p_{X,Z}(x,z)=p_X(x)p_{Z|X}(z|x)=p_X(x)p_{Y}(z-x)\]Marginalising, we obtain
\[p_Z(z)=\sum_x p_{X,Z}(x,z)=\sum_x p_X(x)p_{Y}(z-x)=(p_X \ast p_Y)[z]\]
Continuous case¶
Tip
We note that
\[F_{Z|X}(z|x)=\mathbb{P}(Z\leq z|X=x)=\mathbb{P}(X+Y\leq z|X=x)=\mathbb{P}(x+Y\leq z)=\mathbb{P}(Y\leq z-x)=F_{Y}(z-x)\]Differentiating both sides, \(f_{Z|X}(z|x)=f_{Y}(z-x)\).
Therefore, the joint density between \(X\) and \(Z\) factorises as
\[f_{X,Z}(x,z)=f_X(x)f_{Z|X}(z|x)=f_X(x)f_{Y}(z-x)\]Marginalising, we obtain
\[f_Z(z)=\int\limits_{-\infty}^\infty f_{X,Z}(x,z)\mathop{dx}=\int\limits_{-\infty}^\infty f_X(x)f_{Y}(z-x)\mathop{dx}=(f_X \ast f_Y)[z]\]
See also
Find the PDF of the sum of two independent normals.
Covariance and correlation¶
Scalar valued rvs¶
Covariance is defined between two scalar valued rvs as \(\sigma_{X,Y}=\mathrm{Cov}(X,Y)=\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]\).
Note
\(\mathrm{Cov}(X,Y)=\mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y]\).
Proof follows from expanding the expression in definition.
\(\mathrm{Cov}(X,X)=\mathbb{V}(X)\).
\(\mathrm{Cov}(X,aY+b)=a\cdot\mathrm{Cov}(X,Y)\).
\(\mathrm{Cov}(X,Y+Z)=\mathrm{Cov}(X,Y)+\mathrm{Cov}(X,Z)\).
\(\mathbb{V}(X+Y)=\mathbb{V}(X)+\mathbb{V}(Y)+\mathrm{Cov}(X,Y)\).
In general
\[\mathbb{V}\left(\sum_{i=1}^n X_i\right)=\sum_{i=1}^n \mathbb{V}(X_i)+\sum_{i=1}^n\sum_{j=1, i\neq j}^n\mathrm{Cov}(X_i,Y_j)\]
Note
Correlation coefficient is defined as the normalised version of covariance
\[\rho(X,Y)=\frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathbb{V}(X)\mathbb{V}(Y)}}.\]We have \(|\rho(X,Y)|\leq 1\).
Let \(\tilde{X}=X-\mathbb{E}[X]\) and \(\tilde{Y}=Y-\mathbb{E}[Y]\) be the centered rvs.
The correlation coefficient then becomes
\[\rho(X,Y)=\frac{\mathbb{E}[\tilde{X}\tilde{Y}]}{\sqrt{\mathbb{E}[\tilde{X}^2]\cdot \mathbb{E}[\tilde{Y}^2]}}\]The proof follows from Cauchy-Schwarz inequality.
The equality holds only when \(\tilde{X}=c\cdot \tilde{Y}\) for some \(c\).
See also
We can solve the hat problem using covariance.
Vector valued rvs¶
Let us consider vector values rvs \(\mathbf{X}\in\mathbb{R}^n\) and \(\mathbf{Y}\in\mathbb{R}^m\) which takes values \(\mathbf{X}=\mathbf{x}\implies(X_1=x_1,\cdots,X_n=x_n)^\top\) and \(\mathbf{Y}=\mathbf{y}\implies(Y_1=y_1,\cdots,Y_m=y_m)^\top\).
Attention
Expectation: \(\mathbb{E}[\mathbf{X}]=\{\mathbb{E}[X_1],\cdots\mathbb{E}[X_2]\}^\top\in\mathbb{R}^n\) (similarly for \(\mathbf{Y}\)).
Note
Auto-covariance matrix: \(\mathbb{V}(\mathbf{X})=\mathrm{Cov}(\mathbf{X},\mathbf{X})=\mathbf{K}_{\mathbf{X,X}}=\mathbb{E}\left[\left(\mathbf{X}-\mathbb{E}[\mathbf{X}]\right)\left(\mathbf{X}-\mathbb{E}[\mathbf{X}]\right)^\top\right]\).
This is also known as just variance matrix or variance-covariance matrix.
\(\mathbf{K}_{\mathbf{X,X}}\in\mathbb{R}^{n\times n}\).
The entries of this matrix are \(\mathrm{Cov}(X_i,X_j)=\sigma_{X_i,X_j}\).
We note that when \(n=1\) this reduces to the single rv case.
\(\mathbf{K}_{\mathbf{X,Y}}\) is positive-semidefinite and symmetric.
Linearity: For a constant matrix \(\mathbf{A}\) and a constant vector \(\mathbf{b}\) of appropriate dimension
Auto-correlation matrix: \(\mathbf{R}_{\mathbf{X,X}}=\mathbb{E}[\mathbf{X}\mathbf{X}^\top]\).
It is connected with auto-covariance as \(\mathbf{K}_{\mathbf{X,X}}=\mathbf{R}_{\mathbf{X,X}}-\mathbb{E}[\mathbf{X}]\mathbb{E}[\mathbf{X}]^\top\).
The entries of this matrix are \(\rho(X_i,X_j)=\frac{\sigma_{X_i,X_j}}{\sigma_{X_i}\sigma_{X_j}}\).
Let \(\bar{\mathbf{X}}=\mathbf{X}-\mathbb{E}[\mathbf{X}]\) be the centered rv.
We note that in this case: \(\mathbf{K}_{\mathbf{X,X}}=\mathbf{R}_{\mathbf{X,X}}\).
Precision matrix: If it exists, \(\mathbf{K}_{\mathbf{X,X}}^{-1}\) is known as precision matrix.
Cross-covariance matrix: \(\mathrm{Cov}(\mathbf{X},\mathbf{Y})=\mathbf{K}_{\mathbf{X,Y}}=\mathbb{E}\left[\left(\mathbf{X}-\mathbb{E}[\mathbf{X}]\right)\left(\mathbf{Y}-\mathbb{E}[\mathbf{Y}]\right)^\top\right]\).
\(\mathbf{K}_{\mathbf{X,Y}}\in\mathbb{R}^{n\times m}\).
The entries of this matrix are \(\mathrm{Cov}(X_i,Y_j)=\sigma_{X_i,Y_j}\).
If \(\mathbf{X}\) and \(\mathbf{Y}\) are of same dimension
Correlation matrix: \(\mathrm{\rho}(\mathbf{X},\mathbf{Y})=\mathbb{E}[\mathbf{X}\mathbf{Y}^\top]\).
The entries of this matrix are \(\rho(X_i,Y_j)=\frac{\sigma_{X_i,Y_j}}{\sigma_{X_i}\sigma_{Y_j}}\).
Fundamentals of Point Estimation¶
Note
Estimate: If we do not know the exact value of a rv \(Y\), or an unknown, constant, parameter \(\theta\), we can use a guess (estimate).
The guess is a rv which can be observed or calculated based on other rvs.
Estimator: The rv which takes estimates as values is known as the estimator.
Estimator for \(Y\) is usually written as \(\hat{Y}\).
Estimates are the values that this rv can take, \(\hat{Y}=\hat{y}\).
Standard error: \(\text{se}(\hat{Y})=\sqrt{\mathbb{V}_Y(\hat{Y})}\).
Estimation error: \(\tilde{Y}=\hat{Y}-Y\).
Bias of an estimator: \(\text{bias}(\hat{Y})=\mathbb{E}_Y[\tilde{Y}]\).
Mean squared error: \(\text{mse}(\hat{Y})=\mathbb{E}_Y[\tilde{Y}^2]\).
We note that \(\mathbb{V}_Y(\tilde{Y})=\mathbb{E}_Y[\tilde{Y}^2]-\left(\mathbb{E}_Y[\tilde{Y}]\right)^2=\text{mse}(\hat{Y})-\text{bias}(\hat{Y})^2\).
This can be rewritten as \(\text{mse}(\hat{Y})=\text{bias}(\hat{Y})^2+\mathbb{V}_Y(\tilde{Y})\).
If the quantity we’re estimating is an unknown constant \(\theta\) instead of being a rv (as in classical statistical estimation of an unknown parameter),
\[\text{mse}(\hat{\theta})=\text{bias}(\hat{\theta})^2+\mathbb{V}_\theta(\hat{\theta}-\theta)=\text{bias}(\hat{\theta})^2+\mathbb{V}_\theta(\hat{\theta})=\text{bias}(\hat{\theta})^2+\text{se}(\hat{\theta})^2\]
Bayesian point estimation using conditional expectation¶
Note
We assume that knowing \(X\), we can infer about an rv \(Y\) (or, equivalently, an unknown constant \(\theta\)).
We assume that conditional density \(f_{Y|X}(y|x)\) is known.
We might have access to the conditional density directly.
We might have access to a prior \(f_Y(y)\) and the likelihood \(f_{X|Y}(x|y)\) and we can compute the posterior with Bayes theorem.
From law of iterated expectation, we have \(\mathbb{E}[Y]=\mathbb{E}[\mathbb{E}[Y|X]]\).
This is a Bayesian estimator for \(Y\).
Therefore
Estimator: \(\hat{Y}=\mathbb{E}[Y|X]\) can be thought of as an estimator of \(X\) as their expected values are the same.
For a given value of \(X=x\), the estimation is \(\hat{y}=\mathbb{E}[Y|X=x]=r(x)\).
The function \(r(x)\) is known called regression function.
Bias: Since \(\tilde{Y}\) is expected to be 0
\[\text{bias}(\hat{Y})=\mathbb{E}[\tilde{Y}]=\mathbb{E}[\mathbb{E}[Y|X]]-\mathbb{E}[Y]=0\implies\text{mse}(\hat{Y})=\text{se}(\hat{Y})^2\]MMSE: It can be shown that the conditional expectation estimator minimises the MSE. This is also known as a Minimum Mean Square Error Estimator (MMSE).
Orthogonality Principle: This error is uncorrelated with the estimator.
We note that
\[\mathrm{Cov}(\hat{Y},\tilde{Y})=\mathbb{E}[\hat{Y}\tilde{Y}]-\mathbb{E}[\hat{Y}]\mathbb{E}[\tilde{Y}]=\mathbb{E}[\hat{Y}\tilde{Y}]\]Invoking law of iterated expectation
\[\mathbb{E}[\hat{Y}\tilde{Y}]=\mathbb{E}[\mathbb{E}[\hat{Y}\tilde{Y}|X]]\]Given \(X\), \(\hat{Y}\) is constant.
\[\mathbb{E}[\mathbb{E}[\hat{Y}\tilde{Y}|X]]=\mathbb{E}[\hat{Y}\cdot\mathbb{E}[\tilde{Y}|X]]=\mathbb{E}[\hat{Y}\cdot\mathbb{E}[(\hat{Y}-Y)|X]]=\mathbb{E}[\hat{Y}\cdot\mathbb{E}[\hat{Y}|X]]-\mathbb{E}[\hat{Y}\cdot\mathbb{E}[Y|X]]=\mathbb{E}[\hat{Y}^2]-\mathbb{E}[\hat{Y}^2]=0\]
Therefore, we have \(\mathbb{V}(Y)=\mathbb{V}(\hat{Y})+\mathbb{V}(\tilde{Y})=\text{se}(\hat{Y})^2+\text{mse}(\hat{Y})\).
Conditional variance¶
Note
We can define conditional variance as \(\mathbb{V}(X|Y)=\mathbb{E}[(X-\mathbb{E}[X|Y])^2|Y]\) such that
\[\mathbb{E}[\mathbb{V}(X|Y)]=\mathbb{E}[\mathbb{E}[(X-\mathbb{E}[X|Y])^2|Y]]=\mathbb{E}[(X-\mathbb{E}[X|Y])^2]=\mathrm{E}[\tilde{X}^2]=\mathbb{V}(\tilde{X})\]
Law of iterated variance¶
Note
We can rewrite the variance relation using this new notation
\[\mathbb{V}(X)=\mathbb{V}(\mathbb{E}[X|Y])+\mathbb{E}[\mathbb{V}(X|Y)]\]
Tip
The iterated law of expectation and variance allows us to tackle complicated cases by taking help in conditioning.
See also
A coin with unknown probability of head is tossed \(n\) times. The probability is known to be uniform in \([0,1]\). Let \(X\) is the total number of heads. Find \(\mathbb{E}[X]\) and \(\mathbb{V}(X)\).
Transforms of rv¶
Moment Generating Function¶
Note
Moment generating function (MGF) of a rv is defined as a function of another parameter \(s\)
\[M_X(s)=\mathbb{E}[e^{sX}]\]This closely relates to the Laplace Transform (see stat stackexchange post here)
We note that
\[M_X(s)=\mathbb{E}[e^{sX}]=\int\left(1+sx+\frac{s^2x^2}{2!}+\cdots\right)\mathop{dx}=1+s\cdot\mathbb{E}[X]+\frac{s^2}{2!}\cdot\mathbb{E}[X^2]+\cdots\]From this, we establish that \(\frac{\mathop{d}^n}{\mathop{ds}^n}\left(M_X(s)\right)|_{s=0}=\mathbb{E}[X^n]\).
Extends to the multivariate case as
\[M_{X_1,X_2,\cdots,X_n}(s_1,s_2,\cdots,s_n)=\mathbb{E}[e^{\sum_{i=1}^n s_i X_i}]\]For two independent rvs \(X\) and \(Y\), the MGF of their sum \(Z=X+Y\) is given by
\[M_{Z}(s)=\mathbb{E}[e^{sX+sY}]=\mathbb{E}[e^{sX}e^{sY}]=\mathbb{E}[e^{sX}]\mathbb{E}[e^{sY}]=M_{X}(s)\cdot M_{Y}(s)\]The above extends for multiple independent rvs.
Attention
MGFs completely determines the CDFs and densities/mass functions.
Tip
Knowing MGF often helps us find the moments easier than direct approach.
Find the expectation and variance of exponential distribution in normal way and using MGF.
See also
Find the expectation, variance and the transform of the sum of independent rvs where the number of terms is also a rv.
Integral Transforms¶
Let \(p\) and \(q\) be two densities over rv \(x\in\mathcal{X}\) with finite Borel measure.
KL Divergence¶
Note
\(D_{KL}(p\parallel q)\geq 0\) (proof follows from Jensen’s inequality since \(-\log\) is a convex function).
\(p=q\implies D_{KL}(p\parallel q)= 0\) (other direction does not hold)
This is not a metric as \(D_{KL}(p\parallel q)\neq D_{KL}(q\parallel p)\).
See also
Note that entropy \(H(p)\) and cross-entropy \(H(p, q)\) can be defined as
\(H(p)=-\mathbb{E}_{x\sim p}[\log p(x)]\)
\(H(p\parallel q)=-\mathbb{E}_{x\sim p}[\log q(x)]\)
Therefore \(D_{KL}(p\parallel q)=H(p\parallel q)-H(p)\)
[Gibb’s inequality] \(D_{KL}(p\parallel q)\geq 0\implies H(p\parallel q)\ge H(p)\)
Attention
Say \(x\sim p\) but unknown, and we approximate \(p\) with some \(q^*\in\mathcal{Q}\) such that
\[q^*=\underset{q\in\mathcal{Q}}{\arg\min}\left(D_{KL}(p\parallel q)\right)\]We disregard the inherent randomness associated with \(p\) itself (i.e. \(H(p)\)).
Minimising \(H(p\parallel q)\) is the same as minimising \(D_{KL}(p\parallel q)\).
Finite sample case:
We use the empirical distribution \(\hat{p}\) from a iid sample \(\{x_i\}_{i=1}^N\).
Using WLLN, as \(N\to\infty\), \(H(\hat{p},q)\overset{P}\to H(p\parallel q)\).
\(H(p\parallel q)\) then becomes the same as negative log-likelihood (NLL)
\[H(p\parallel q)\approx H(\hat{p},q)=-\mathbb{E}_{x\sim \hat{p}}[\log q(x)]=-\frac{1}{N}\sum_{i=1}^N\log q(x_i)\]
For jointly distributed rvs \(x,y\sim p(x,y)\), we define
Note
Conditional entropy
\[H(X∣Y)=−\mathbb{E}_{x,y\sim p(x,y)}[\log p(x|y)]=-\mathbb{E}_{y\sim p(y)}\left[\mathbb{E}_{x\sim x|y}\left[\log p(x|y)\right]\right]\]Mutual information
\[I(X;Y)=H(X)−H(X∣Y)\]
Integral Probability Metric: Wasserstein Distance¶
Integral Probability Metric: Maximum Mean Discrepancy (MMD)¶
See also