Functions of Random Variable

Density of a function of a rv

Let \(Y=g(X)\) be a function of an rv \(X\).

Note

  • If \(X\) is discrete, this is discussed in the random variable section (TODO: add hyperlink)

  • If \(X\) is continuous with a PDF \(f_X(x)\), then the process for finding \(f_Y(y)\) is as follows:

    • Compute the CDF as

      \[F_Y(y)=\mathbb{P}(Y\leq y)=\mathbb{P}(g(X)\leq y)=\int\limits_{\{x|g(x)\leq y\}}f_X(x) \mathop{dx}\]
    • Compute the PDF as \(f_Y(y)=F'_Y(y)\).

Tip

  • Some effort is required to compute the set \(\{x|g(x)\leq y\}\).

  • Find \(f_Y(y)\) where \(Y=X^2\).

Special cases

Linear functions

Let \(Y=g(X)=aX+b\).

Tip

  • If \(a=0\), then \(Y=b\) with probability 1 and it’s no longer a continuous rv.

  • If \(a\neq 0\), then we have

    \[\begin{split}F_Y(y)=\mathbb{P}(Y\leq y)=\mathbb{P}(g(X)\leq y)=\mathbb{P}(aX+b\leq y)=\begin{cases}\mathbb{P}\left(X\leq\frac{y-b}{a}\right) & \text{if $a>0$} \\ \mathbb{P}\left(X\geq\frac{y-b}{a}\right) & \text{if $a<0$}\end{cases}=\begin{cases}F_X(\frac{y-b}{a}) & \text{if $a>0$} \\ 1-F_X(\frac{y-b}{a}) & \text{if $a<0$}\end{cases}\end{split}\]
  • We can recover the PDF in both cases as

    \[\begin{split}f_Y(y)=\begin{cases}\frac{1}{a}f_X(\frac{y-b}{a}) & \text{if $a>0$} \\ -\frac{1}{a}f_X(\frac{y-b}{a}) & \text{if $a<0$}\end{cases}=\frac{1}{\left| a \right|}f_X(\frac{y-b}{a})\end{split}\]

Monotonic functions

Note

  • If \(g(y)=x\) is a monotonic function, then it has an inverse, \(x=g^{-1}(y)\).

  • Therefore, we have

    \[\begin{split}F_Y(y)=\mathbb{P}(Y\leq y)=\mathbb{P}(g(X)\leq y)=\begin{cases}\mathbb{P}(X\leq g^{-1}(y)) & \text{if $g(X)$ is monotonic increasing}\\\mathbb{P}(X\geq g^{-1}(y)) & \text{if $g(X)$ is monotonic decreasing}\end{cases}=\begin{cases}F_X(g^{-1}(y)) & \text{if $g(X)$ is monotonic increasing}\\1-F_X(g^{-1}(y)) & \text{if $g(X)$ is monotonic decreasing}\end{cases}\end{split}\]
  • We can recover the PDF in both cases as

    \[f_Y(y)=|f_X(g^{-1}(y))|\cdot\frac{\mathop{d}}{\mathop{dy}}\left[g^{-1}(y)\right]\]
  • We note that the linear case is a special case of monotonic functions.

Density of a function of multiple jointly distributed rvs

Let \(Z=g(X,Y)\) be a function of 2 jointly distributed rvs, \(X\) and \(Y\). In this case, we follow the same process as before.

Tip

  • Compute the CDF as

    \[F_Z(z)=\mathbb{P}(Z\leq z)=\mathbb{P}(g(X,Y)\leq z)=\iint\limits_{\{(x,y)|g(x,y)\leq z\}}f_{X,Y}(x,y)\mathop{dx}\mathop{dy}\]
  • Compute the PDF as \(f_Z(z)=F'_Z(z)\).

  • Extends naturally for more than 2 rvs.

See also

  • Find the PDF of \(Z=X/Y\), where \(X\) and \(Y\) are independent and uniformly distributed in \([0,1]\).

  • Two people join a call but they are late by an amount, independent of the other, that follows an exponential distribution with parameter \(\lambda\). Find the PDF of the difference in their joining time.

Special cases

Sum of independent rvs: Convolution

We want the PDF (or PMF) of the sum of two independent rvs, \(X\) and \(Y\), \(Z=X+Y\).

Discrete case

Tip

  • We note that

    \[p_{Z|X}(z|x)=\mathbb{P}(Z=z|X=x)=\mathbb{P}(X+Y=z|X=x)=\mathbb{P}(x+Y=z)=\mathbb{P}(Y=z-x)=p_{Y}(z-x)\]
  • Therefore, the joint mass between \(X\) and \(Z\) factorises as

    \[p_{X,Z}(x,z)=p_X(x)p_{Z|X}(z|x)=p_X(x)p_{Y}(z-x)\]
  • Marginalising, we obtain

    \[p_Z(z)=\sum_x p_{X,Z}(x,z)=\sum_x p_X(x)p_{Y}(z-x)=(p_X \ast p_Y)[z]\]
Continuous case

Tip

  • We note that

    \[F_{Z|X}(z|x)=\mathbb{P}(Z\leq z|X=x)=\mathbb{P}(X+Y\leq z|X=x)=\mathbb{P}(x+Y\leq z)=\mathbb{P}(Y\leq z-x)=F_{Y}(z-x)\]
  • Differentiating both sides, \(f_{Z|X}(z|x)=f_{Y}(z-x)\).

  • Therefore, the joint density between \(X\) and \(Z\) factorises as

    \[f_{X,Z}(x,z)=f_X(x)f_{Z|X}(z|x)=f_X(x)f_{Y}(z-x)\]
  • Marginalising, we obtain

    \[f_Z(z)=\int\limits_{-\infty}^\infty f_{X,Z}(x,z)\mathop{dx}=\int\limits_{-\infty}^\infty f_X(x)f_{Y}(z-x)\mathop{dx}=(f_X \ast f_Y)[z]\]

See also

  • Find the PDF of the sum of two independent normals.

Covariance and correlation

Scalar valued rvs

Covariance is defined between two scalar valued rvs as \(\sigma_{X,Y}=\mathrm{Cov}(X,Y)=\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]\).

Note

  • \(\mathrm{Cov}(X,Y)=\mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y]\).

    • Proof follows from expanding the expression in definition.

  • \(\mathrm{Cov}(X,X)=\mathbb{V}(X)\).

  • \(\mathrm{Cov}(X,aY+b)=a\cdot\mathrm{Cov}(X,Y)\).

  • \(\mathrm{Cov}(X,Y+Z)=\mathrm{Cov}(X,Y)+\mathrm{Cov}(X,Z)\).

  • \(\mathbb{V}(X+Y)=\mathbb{V}(X)+\mathbb{V}(Y)+\mathrm{Cov}(X,Y)\).

  • In general

    \[\mathbb{V}\left(\sum_{i=1}^n X_i\right)=\sum_{i=1}^n \mathbb{V}(X_i)+\sum_{i=1}^n\sum_{j=1, i\neq j}^n\mathrm{Cov}(X_i,Y_j)\]

Note

  • Correlation coefficient is defined as the normalised version of covariance

    \[\rho(X,Y)=\frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathbb{V}(X)\mathbb{V}(Y)}}.\]
  • We have \(|\rho(X,Y)|\leq 1\).

    • Let \(\tilde{X}=X-\mathbb{E}[X]\) and \(\tilde{Y}=Y-\mathbb{E}[Y]\) be the centered rvs.

    • The correlation coefficient then becomes

      \[\rho(X,Y)=\frac{\mathbb{E}[\tilde{X}\tilde{Y}]}{\sqrt{\mathbb{E}[\tilde{X}^2]\cdot \mathbb{E}[\tilde{Y}^2]}}\]
    • The proof follows from Cauchy-Schwarz inequality.

  • The equality holds only when \(\tilde{X}=c\cdot \tilde{Y}\) for some \(c\).

See also

  • We can solve the hat problem using covariance.

Vector valued rvs

Let us consider vector values rvs \(\mathbf{X}\in\mathbb{R}^n\) and \(\mathbf{Y}\in\mathbb{R}^m\) which takes values \(\mathbf{X}=\mathbf{x}\implies(X_1=x_1,\cdots,X_n=x_n)^\top\) and \(\mathbf{Y}=\mathbf{y}\implies(Y_1=y_1,\cdots,Y_m=y_m)^\top\).

Attention

  • Expectation: \(\mathbb{E}[\mathbf{X}]=\{\mathbb{E}[X_1],\cdots\mathbb{E}[X_2]\}^\top\in\mathbb{R}^n\) (similarly for \(\mathbf{Y}\)).

Note

  • Auto-covariance matrix: \(\mathbb{V}(\mathbf{X})=\mathrm{Cov}(\mathbf{X},\mathbf{X})=\mathbf{K}_{\mathbf{X,X}}=\mathbb{E}\left[\left(\mathbf{X}-\mathbb{E}[\mathbf{X}]\right)\left(\mathbf{X}-\mathbb{E}[\mathbf{X}]\right)^\top\right]\).

    • This is also known as just variance matrix or variance-covariance matrix.

    • \(\mathbf{K}_{\mathbf{X,X}}\in\mathbb{R}^{n\times n}\).

    • The entries of this matrix are \(\mathrm{Cov}(X_i,X_j)=\sigma_{X_i,X_j}\).

    • We note that when \(n=1\) this reduces to the single rv case.

    • \(\mathbf{K}_{\mathbf{X,Y}}\) is positive-semidefinite and symmetric.

    • Linearity: For a constant matrix \(\mathbf{A}\) and a constant vector \(\mathbf{b}\) of appropriate dimension

  • Auto-correlation matrix: \(\mathbf{R}_{\mathbf{X,X}}=\mathbb{E}[\mathbf{X}\mathbf{X}^\top]\).

    • It is connected with auto-covariance as \(\mathbf{K}_{\mathbf{X,X}}=\mathbf{R}_{\mathbf{X,X}}-\mathbb{E}[\mathbf{X}]\mathbb{E}[\mathbf{X}]^\top\).

    • The entries of this matrix are \(\rho(X_i,X_j)=\frac{\sigma_{X_i,X_j}}{\sigma_{X_i}\sigma_{X_j}}\).

    • Let \(\bar{\mathbf{X}}=\mathbf{X}-\mathbb{E}[\mathbf{X}]\) be the centered rv.

      • We note that in this case: \(\mathbf{K}_{\mathbf{X,X}}=\mathbf{R}_{\mathbf{X,X}}\).

    • Precision matrix: If it exists, \(\mathbf{K}_{\mathbf{X,X}}^{-1}\) is known as precision matrix.

  • Cross-covariance matrix: \(\mathrm{Cov}(\mathbf{X},\mathbf{Y})=\mathbf{K}_{\mathbf{X,Y}}=\mathbb{E}\left[\left(\mathbf{X}-\mathbb{E}[\mathbf{X}]\right)\left(\mathbf{Y}-\mathbb{E}[\mathbf{Y}]\right)^\top\right]\).

    • \(\mathbf{K}_{\mathbf{X,Y}}\in\mathbb{R}^{n\times m}\).

    • The entries of this matrix are \(\mathrm{Cov}(X_i,Y_j)=\sigma_{X_i,Y_j}\).

    • If \(\mathbf{X}\) and \(\mathbf{Y}\) are of same dimension

  • Correlation matrix: \(\mathrm{\rho}(\mathbf{X},\mathbf{Y})=\mathbb{E}[\mathbf{X}\mathbf{Y}^\top]\).

    • The entries of this matrix are \(\rho(X_i,Y_j)=\frac{\sigma_{X_i,Y_j}}{\sigma_{X_i}\sigma_{Y_j}}\).

Fundamentals of Point Estimation

Note

  • Estimate: If we do not know the exact value of a rv \(Y\), or an unknown, constant, parameter \(\theta\), we can use a guess (estimate).

    • The guess is a rv which can be observed or calculated based on other rvs.

  • Estimator: The rv which takes estimates as values is known as the estimator.

    • Estimator for \(Y\) is usually written as \(\hat{Y}\).

    • Estimates are the values that this rv can take, \(\hat{Y}=\hat{y}\).

    • Standard error: \(\text{se}(\hat{Y})=\sqrt{\mathbb{V}_Y(\hat{Y})}\).

  • Estimation error: \(\tilde{Y}=\hat{Y}-Y\).

    • Bias of an estimator: \(\text{bias}(\hat{Y})=\mathbb{E}_Y[\tilde{Y}]\).

    • Mean squared error: \(\text{mse}(\hat{Y})=\mathbb{E}_Y[\tilde{Y}^2]\).

      • We note that \(\mathbb{V}_Y(\tilde{Y})=\mathbb{E}_Y[\tilde{Y}^2]-\left(\mathbb{E}_Y[\tilde{Y}]\right)^2=\text{mse}(\hat{Y})-\text{bias}(\hat{Y})^2\).

      • This can be rewritten as \(\text{mse}(\hat{Y})=\text{bias}(\hat{Y})^2+\mathbb{V}_Y(\tilde{Y})\).

      • If the quantity we’re estimating is an unknown constant \(\theta\) instead of being a rv (as in classical statistical estimation of an unknown parameter),

        \[\text{mse}(\hat{\theta})=\text{bias}(\hat{\theta})^2+\mathbb{V}_\theta(\hat{\theta}-\theta)=\text{bias}(\hat{\theta})^2+\mathbb{V}_\theta(\hat{\theta})=\text{bias}(\hat{\theta})^2+\text{se}(\hat{\theta})^2\]

Bayesian point estimation using conditional expectation

Note

  • We assume that knowing \(X\), we can infer about an rv \(Y\) (or, equivalently, an unknown constant \(\theta\)).

    • We assume that conditional density \(f_{Y|X}(y|x)\) is known.

      • We might have access to the conditional density directly.

      • We might have access to a prior \(f_Y(y)\) and the likelihood \(f_{X|Y}(x|y)\) and we can compute the posterior with Bayes theorem.

  • From law of iterated expectation, we have \(\mathbb{E}[Y]=\mathbb{E}[\mathbb{E}[Y|X]]\).

    • This is a Bayesian estimator for \(Y\).

  • Therefore

    • Estimator: \(\hat{Y}=\mathbb{E}[Y|X]\) can be thought of as an estimator of \(X\) as their expected values are the same.

      • For a given value of \(X=x\), the estimation is \(\hat{y}=\mathbb{E}[Y|X=x]=r(x)\).

      • The function \(r(x)\) is known called regression function.

    • Bias: Since \(\tilde{Y}\) is expected to be 0

      \[\text{bias}(\hat{Y})=\mathbb{E}[\tilde{Y}]=\mathbb{E}[\mathbb{E}[Y|X]]-\mathbb{E}[Y]=0\implies\text{mse}(\hat{Y})=\text{se}(\hat{Y})^2\]
    • MMSE: It can be shown that the conditional expectation estimator minimises the MSE. This is also known as a Minimum Mean Square Error Estimator (MMSE).

    • Orthogonality Principle: This error is uncorrelated with the estimator.

      • We note that

        \[\mathrm{Cov}(\hat{Y},\tilde{Y})=\mathbb{E}[\hat{Y}\tilde{Y}]-\mathbb{E}[\hat{Y}]\mathbb{E}[\tilde{Y}]=\mathbb{E}[\hat{Y}\tilde{Y}]\]
      • Invoking law of iterated expectation

        \[\mathbb{E}[\hat{Y}\tilde{Y}]=\mathbb{E}[\mathbb{E}[\hat{Y}\tilde{Y}|X]]\]
      • Given \(X\), \(\hat{Y}\) is constant.

        \[\mathbb{E}[\mathbb{E}[\hat{Y}\tilde{Y}|X]]=\mathbb{E}[\hat{Y}\cdot\mathbb{E}[\tilde{Y}|X]]=\mathbb{E}[\hat{Y}\cdot\mathbb{E}[(\hat{Y}-Y)|X]]=\mathbb{E}[\hat{Y}\cdot\mathbb{E}[\hat{Y}|X]]-\mathbb{E}[\hat{Y}\cdot\mathbb{E}[Y|X]]=\mathbb{E}[\hat{Y}^2]-\mathbb{E}[\hat{Y}^2]=0\]
    • Therefore, we have \(\mathbb{V}(Y)=\mathbb{V}(\hat{Y})+\mathbb{V}(\tilde{Y})=\text{se}(\hat{Y})^2+\text{mse}(\hat{Y})\).

Conditional variance

Note

We can define conditional variance as \(\mathbb{V}(X|Y)=\mathbb{E}[(X-\mathbb{E}[X|Y])^2|Y]\) such that

\[\mathbb{E}[\mathbb{V}(X|Y)]=\mathbb{E}[\mathbb{E}[(X-\mathbb{E}[X|Y])^2|Y]]=\mathbb{E}[(X-\mathbb{E}[X|Y])^2]=\mathrm{E}[\tilde{X}^2]=\mathbb{V}(\tilde{X})\]

Law of iterated variance

Note

We can rewrite the variance relation using this new notation

\[\mathbb{V}(X)=\mathbb{V}(\mathbb{E}[X|Y])+\mathbb{E}[\mathbb{V}(X|Y)]\]

Tip

The iterated law of expectation and variance allows us to tackle complicated cases by taking help in conditioning.

See also

  • A coin with unknown probability of head is tossed \(n\) times. The probability is known to be uniform in \([0,1]\). Let \(X\) is the total number of heads. Find \(\mathbb{E}[X]\) and \(\mathbb{V}(X)\).

Transforms of rv

Moment Generating Function

Note

  • Moment generating function (MGF) of a rv is defined as a function of another parameter \(s\)

    \[M_X(s)=\mathbb{E}[e^{sX}]\]
  • This closely relates to the Laplace Transform (see stat stackexchange post here)

  • We note that

    \[M_X(s)=\mathbb{E}[e^{sX}]=\int\left(1+sx+\frac{s^2x^2}{2!}+\cdots\right)\mathop{dx}=1+s\cdot\mathbb{E}[X]+\frac{s^2}{2!}\cdot\mathbb{E}[X^2]+\cdots\]
    • From this, we establish that \(\frac{\mathop{d}^n}{\mathop{ds}^n}\left(M_X(s)\right)|_{s=0}=\mathbb{E}[X^n]\).

  • Extends to the multivariate case as

    \[M_{X_1,X_2,\cdots,X_n}(s_1,s_2,\cdots,s_n)=\mathbb{E}[e^{\sum_{i=1}^n s_i X_i}]\]
  • For two independent rvs \(X\) and \(Y\), the MGF of their sum \(Z=X+Y\) is given by

    \[M_{Z}(s)=\mathbb{E}[e^{sX+sY}]=\mathbb{E}[e^{sX}e^{sY}]=\mathbb{E}[e^{sX}]\mathbb{E}[e^{sY}]=M_{X}(s)\cdot M_{Y}(s)\]
  • The above extends for multiple independent rvs.

Attention

MGFs completely determines the CDFs and densities/mass functions.

Tip

  • Knowing MGF often helps us find the moments easier than direct approach.

  • Find the expectation and variance of exponential distribution in normal way and using MGF.

See also

Find the expectation, variance and the transform of the sum of independent rvs where the number of terms is also a rv.

Integral Transforms

Let \(p\) and \(q\) be two densities over rv \(x\in\mathcal{X}\) with finite Borel measure.

KL Divergence

\[D_{KL}(p\parallel q)=\mathbb{E}_{x\sim p}\left[\log\frac{p(x)}{q(x)}\right]\]

Note

  • \(D_{KL}(p\parallel q)\geq 0\) (proof follows from Jensen’s inequality since \(-\log\) is a convex function).

  • \(p=q\implies D_{KL}(p\parallel q)= 0\) (other direction does not hold)

  • This is not a metric as \(D_{KL}(p\parallel q)\neq D_{KL}(q\parallel p)\).

See also

  • Note that entropy \(H(p)\) and cross-entropy \(H(p, q)\) can be defined as

    • \(H(p)=-\mathbb{E}_{x\sim p}[\log p(x)]\)

    • \(H(p\parallel q)=-\mathbb{E}_{x\sim p}[\log q(x)]\)

  • Therefore \(D_{KL}(p\parallel q)=H(p\parallel q)-H(p)\)

  • [Gibb’s inequality] \(D_{KL}(p\parallel q)\geq 0\implies H(p\parallel q)\ge H(p)\)

Attention

  • Say \(x\sim p\) but unknown, and we approximate \(p\) with some \(q^*\in\mathcal{Q}\) such that

    \[q^*=\underset{q\in\mathcal{Q}}{\arg\min}\left(D_{KL}(p\parallel q)\right)\]
  • We disregard the inherent randomness associated with \(p\) itself (i.e. \(H(p)\)).

  • Minimising \(H(p\parallel q)\) is the same as minimising \(D_{KL}(p\parallel q)\).

  • Finite sample case:

    • We use the empirical distribution \(\hat{p}\) from a iid sample \(\{x_i\}_{i=1}^N\).

    • Using WLLN, as \(N\to\infty\), \(H(\hat{p},q)\overset{P}\to H(p\parallel q)\).

    • \(H(p\parallel q)\) then becomes the same as negative log-likelihood (NLL)

      \[H(p\parallel q)\approx H(\hat{p},q)=-\mathbb{E}_{x\sim \hat{p}}[\log q(x)]=-\frac{1}{N}\sum_{i=1}^N\log q(x_i)\]

For jointly distributed rvs \(x,y\sim p(x,y)\), we define

Note

  • Conditional entropy

    \[H(X∣Y)=−\mathbb{E}_{x,y\sim p(x,y)}​[\log p(x|y)]=-\mathbb{E}_{y\sim p(y)}\left[\mathbb{E}_{x\sim x|y}\left[\log p(x|y)\right]\right]\]
  • Mutual information

    \[I(X;Y)=H(X)−H(X∣Y)\]

Integral Probability Metric: Wasserstein Distance

Integral Probability Metric: Maximum Mean Discrepancy (MMD)