Random Variable¶

Random variables (rvs) are real-valued functions of the outcome of an experiment.

Note

A function of a rv is another rv.
We can associate certain central measures/averages (such as mean, variance) with each rv, subject to certain condition on their existence.
We can condition an rv on an event or another rv.
We can define the notion of independence of an rv w.r.t an event or another rv.

Discrete Random Variable¶

Discrete = values are from a finite/countably infinite subset of $\mathbb{R}$.

Probability mass function¶

Note

For a discrete rv, $X$:

We can define a probability mass function (PMF), $p_X(x)$, associated with $X$, as follows: For each value $x$ of $X$
1. Collect all possible outcomes that give rise to the event $\{X=x\}$.
2. Add their probabilities to obtain the mass $p_X(x)=\mathbb{P}(\{X=x\})$.
A function $g(X)$ of $X$ is another rv, $Y$, whose PMF can be obtained as follows: For each value $y$ of $Y$
1. Collect all possible values for which $\{x | g(x)=y\}$
2. Utilize axiom 3 to obtain $p_Y(y)=\sum_{\{x | g(x)=y\}} p_X(x)$

Expectation and Variance¶

Note

We can define Expectation of $X$ as $\mathbb{E}[X]=\sum_x x p_X(x)$ (assuming that the sum exists).
Elementary properties of expectation:
- If $X>0$, then $\mathbf{E}[X]>0$.
- If $a\leq X\leq b$, then $a\leq \mathbf{E}[X]\leq b$.
- If $X=c$, then $\mathbf{E}[X]=c$.
We can define Variance of $X$ as $\mathbb{V}(X)=\mathbb{E}[(X-\mathbb{E}[X])^2]$.

Law of The Unconscious Statistician (LOTUS)¶

Tip

For expectation of $Y=g(X)$, we can get away without having to compute PMF explicitly for $Y$, as it can be shown that

\[\mathbb{E}[g(X)]=\sum_x g(x)p_X(x)\]
With the help of LOTUS, $\mathbb{V}(X)=\sum_x (x-\mathbb{E}[X])^2 p_X(x)$.

Moments of a rv¶

Note

The n-th moment of $X$ is defined as $\mathbb{E}[X^n]$.
Variance in terms of moments: $\mathbb{V}(X)=\mathbb{E}[X^2]-(\mathbb{E}[X])^2$.

Expectations of linear functions of rv¶

Note

For linear functions of $X$, $g(X)=aX+b$

$\mathbb{E}[aX+b]=a\mathbb{E}[X]+b$.
$\mathbb{V}(aX+b)=a^2\mathbb{V}(X)$.

Variance of bounded rv¶

Tip

For bounded rvs $a\leq X\leq b$, the variance is bounded as

\[\mathbb{V}(X)\leq\frac{(b-a)^2}{4}\]
Proof Hint:
- The expression $\mathbb{E}[(X-\gamma)^2]$ is minimised when $\gamma=\mathbb{E}[X]$
- Therefore, $\mathbb{V}(X)\leq\mathbb{E}\left[\left(X-\frac{a+b}{2}\right)^2\right]=\mathbb{E}[(X-a)(X-b)]+\frac{(b-a)^2}{4}$
- $a\leq X\leq b\implies (X-a)(X-b)\leq 0\implies \mathbb{E}[(X-a)(X-b)]\leq 0$

Expectations of general functions of rv¶

Warning

For non-linear functions, it is generally not true that $\mathbb{E}[g(X)]=g(\mathbb{E}[X])$.

Jensen’s inequality¶

Attention

If $g$ is convex, $\mathbb{E}[g(X)]\geq g(\mathbb{X}[X])$.
If $g$ is concave, $\mathbb{E}[g(X)]\leq g(\mathbb{X}[X])$.

Multiple discrete random variables¶

Note

We can define the joint-probability mass function for 2 rvs as

\[p_{X,Y}(x,y)=\mathbb{P}(\{X=x\}\cap\{Y=y\})=\mathbb{P}(X=x,Y=y).\]
The marginal probability is defined as $p_X(x)=\sum_y p_{X,Y}(x,y)$ (similarly for $p_Y(y)$.).
LOTUS holds, i.e. for $g(X,Y)$, $\mathbb{E}[g(X,Y)]=\sum_{x,y} g(x,y) p_{X,Y}(x,y)$.
Linearity of expectation holds, i.e. $\mathbb{E}[aX+bY+c]=a\mathbb{E}[X]+b\mathbb{E}[Y]+c$.
Extends naturally for more than 2 rvs.

Conditioning¶

Note

A discrete rv can be conditioned on an event $A$ (when $\mathbb{P}(A)>0$) and its conditional PMF is defined as

\[p_{X|A}(x)=\mathbb{P}(X=x|A).\]
Extends to the case when the event is defined in terms of another discrete rv, i.e. $A=\{Y=y\}$ with $p_Y(y)>0$ and is written as

\[p_{X|Y}(x|y)=\mathbb{P}(X=x|Y=y)=\frac{p_{X,Y}(x,y)}{p_Y(y)}\]
Connects to the joint PMF as $p_{X,Y}(x,y)=p_Y(y)p_{X|Y}(x|y)$

Bayes theorem¶

Tip

For $p_Y(y)>0$, $p_{Y|X}(y|x)=\frac{p_Y(y)p_{X|Y}(x|y)}{\sum_y p_Y(y)p_{X|Y}(x|y)}$
$p_Y(y)$ is known as prior, $p_{Y|X}(y|x)$ is called posterior, and $p_{X|Y}(x|y)$ is known as likelihood.
The denominator $Z=\sum_y p_Y(y)p_{X|Y}(x|y)$ is the probability normalisation factor (i.e. it ensures that the sum is 1).
We can often work with unnormalised probabilities when exact values are not required, as $p_{Y|X}(y|x)\propto p_Y(y)p_{X|Y}(x|y)$.

Total law of probability¶

Tip

Let $A_1,A_2,\cdots,A_n$ be disjoints events such that $\bigcup_{i=1}^n A_i=\Omega$ (i.e. they define a partition).
If $\mathbb{P}(A_i)>0$ for all $i$, then

\[p_X(x)=\sum_{i=1}^n\mathbb{P}(A_i)p_{X|A_i}(x)\]
This also works if the events $A_i$ are defined in terms of another discrete rv (i.e. $A_i=\{Y=y\}$)

\[p_{X}(x)=\sum_y p_Y(y)p_{X|Y}(x|y)\]
- Note: This extends it to the countable infinite case from the finite case.
This allows us to compute the probability of events in a complicated probability model by utilising events from a simpler model, i.e. let’s us use the divide-and-conquer technique. We just need to ensure that the events from the simpler model in fact exhausts the entirety of sample space of the original probability model.
For any other event $B$ where $\mathbb{P}(A_i\cap B)>0$ for all $i$

\[p_{X|B}(x)=\sum_{i=1}^n\mathbb{P}(A_i|B)p_{X|A_i\cap B}(x)\]

Conditional expectation¶

Note

Defined in terms of the conditional PMF, such as
- $\mathbb{E}[X|A]=\sum_x x p_{X|A}(x)$ and
- $\mathbb{E}[X|Y=y]=\sum_x x p_{X|Y}(x|y)$
LOTUS holds, i.e. $\mathbb{E}[g(X)|A]=\sum_x g(x)p_{X|A}(x)$.

Tip

Since we can have multivariable functions and joint distributions, we explicitly write the variables along with the expectation that are part of the PMF.
This means
- $\mathbb{E}_{X,Y}[f(X,Y)]=\sum_x\sum_yf(x,y)p_{X,Y}(x,y)$
- $\mathbb{E}_{X|Y}[f(X,Y)|Y=y]=\sum_xf(x,y)p_{X|Y}(x|y)$

Attention

$\mathbb{E}[X]$ is a constant.
We note that $\mathbb{E}_{X|Y}[X|Y=y]$ is just a function (not a rv) of a simple variable $y\in\mathbb{R}$.
- $g(y)=\mathbb{E}_{X|Y}[X|Y=y]$
On the other hand, $\mathbb{E}_{X|Y}[X|Y]$ is a rv and it has the same PMF as $Y$.

Tip

From total law of probability:

For partitions $A_1,A_2,\cdots,A_n$

\[\mathbb{E}[X]=\sum_x x p_X(x)=\sum_{i=1}^n \mathbb{P}(A_i)\sum_x x p_{X|A_i}(x)=\sum_{i=1}^n \mathbb{P}(A_i)\mathbb{E}[X|A_i]\]
For any other event $B$ where $\mathbb{P}(A_i\cap B)>0$ for all $i$

\[\mathbb{E}[X|B]=\sum_{i=1}^n \mathbb{P}(A_i|B)\mathbb{E}[X|A_i\cap B]\]
If the events, $A_i$, are represented by another discrete rv such that $A_i=\{Y=y\}$

\[\mathbb{E}[X]=\sum_y p_Y(y)\mathbb{E}[X|Y=y]=\sum_y g(y)p_Y(y)=\mathbb{E}[g(Y)]=\mathbb{E}\left[\mathbb{E}[X|Y]\right] \text{, where $g(Y)=\mathbb{E}[X|Y]$.}\]

Law of iterated expectation¶

Attention

For a single-valued function $f(X)$ of a rv, we have

\[\mathbb{E}_X[f(X)]=\mathbb{E}_Y\left[\mathbb{E}_{X|Y}[f(X)|Y]\right]\]
For a multi-valued function $f(X,Y)$ of two jointly distributed rvs, we have

\[\mathbb{E}_{X,Y}[f(X,Y)]=\mathbb{E}_X\left[\mathbb{E}_{Y|X}[f(X,Y)|X]\right]=\mathbb{E}_Y\left[\mathbb{E}_{X|Y}[f(X,Y)|Y]\right]\]
Proof (first):

\[\mathbb{E}_X[f(X)]=\sum_xf(x)p_X(x)=\sum_xf(x)\left(\sum_yp_{X,Y}(x,y)\right)=\sum_xf(x)\left(\sum_yp_{Y}(y)p_{X|Y}(x|y)\right)=\sum_yp_{Y}(y)\left(\sum_xf(x)p_{X|Y}(x|y)\right)=\sum_yp_{Y}(y)\mathbb{E}_{X|Y}[f(X)|Y=y]=\mathbb{E}_Y\left[\mathbb{E}_{X|Y}[f(X)|Y]\right]\]
Proof (second:

\[\mathbb{E}_{X,Y}[f(X,Y)]=\sum_x\sum_yf(x,y)p_{X,Y}(x,y)=\sum_x\sum_yf(x,y)p_X(x)p_{Y|X}(y|x)=\sum_xp_X(x)\left(\sum_yf(x,y)p_{Y|X}(y|x)\right)=\sum_xp_X(x)\mathbb{E}_{Y|X}[f(X,Y)|X=x]=\mathbb{E}_X\left[\mathbb{E}_{Y|X}[f(X,Y)|X]\right]\]

Notion of Independence¶

Note

$X$ is independent of an event $A$ iff $p_{X|A}(x)=p_X(x)$ for all $x$.
Two rvs are independent when $p_X(x)=p_{X|Y}(x|y)$ and $p_Y(y)=p_{Y|X}(y|x)$ hold for all values of $x$ and $y$.
Two independent rvs are written with the notation $X\perp\!\!\!\perp Y$.
If $X\perp\!\!\!\perp Y$, $p_{X,Y}(x,y)=p_X(x)p_Y(y)$ for all $x$ and $y$.

Expectation and variance for independent rvs¶

Note

$\mathbb{E}[XY]=\mathbb{E}[X]\mathbb{E}[Y]$
$\mathbb{V}(X+Y)=\mathbb{V}(X)+\mathbb{V}(Y)$
Extends naturally to more than 2 rvs.

Mean and variance of sample mean¶

Attention

Let $X_1,\cdots,X_n$ be a sample of size $n$.
We assume that these rvs
- are independent.
- have common mean $\mu$ and common variance $\sigma^2$.
The sample mean is the rv $M_n=\frac{1}{n}\sum_{i=1}^n X_i$.
Mean of $M_n$:

\[\mathbb{E}[M_n]=\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n X_i\right]=\frac{1}{n}\sum_{i=1}^n\mathbb{E}[X_i]=\frac{1}{n}\sum_{i=1}^n\mu=\mu\]
- Note: We’ve only used linearity of expectation.
Variance of $M_n$:

\[\mathbb{V}[M_n]=\mathbb{V}\left[\frac{1}{n}\sum_{i=1}^n X_i\right]=\frac{1}{n^2}\sum_{i=1}^n\mathbb{V}[X_i]=\frac{1}{n^2}\sum_{i=1}^n\sigma^2=\frac{\sigma^2}{n}\]
- Note: We’ve used the independence assumption here.

Note

Note: We don’t required them to be identically distributed.
This is useful in establishing WLLN along with Chebyshev’s inequality.

Some discrete random variables¶

Bernoulli¶

Any experiment that deals with a binary outcome (e.g. success or failure) can be represented by a Bernoulli rv.

Note

We can define a rv $X=1$ which represents success and $X=0$ which represents failure.
We only need to know about one of the probability values, $\mathbb{P}(X=1)=p$, as $\mathbb{P}(X=0)=1-p$.
Therefore, a Bernoulli rv is parameterised with just 1 parameter, $p$.
[Derive] For $X\sim\mathrm{Ber}(p)$, $\mathbb{E}[X]=p$ and $\mathbb{V}(X)=p(1-p)$.

Tip

For any set of events $A_1,A_2,\cdot A_n$, we can use indicator functions to denote the same.
Indicator functions are Bernoulli rvs which are defined

\[\begin{split}X_i = \begin{cases} 1 & \text{if $A_i$ occurs} \\ 0 & \text{otherwise} \end{cases}\end{split}\]
Under this setup, $\mathbb{P}(A_i)=\mathbb{E}[X_i]$.

Multinoulli¶

Any experiment that deals with a categorical outcome can be represented by a Multinoulli rv.

Note

If the rv $X$ takes the values from the set $\{x_1,\cdots,x_k\}$, then $X\sim\mathrm{Multinoulli}(p_1,\cdots,p_k)$.
We can do away with $k-1$ parameters instead of $k$, as $\sum_{i=1}^k p_i=1$.
Bernoulli is a special case of Multinoulli where $k=2$.

Uniform¶

TODO

Binomial¶

In a repeated ($n$-times) Bernoulli trial with parameter $p$, let $X$ denote the total number of successes. Then $X\sim\mathrm{Bin}(n,p)$ and the PMF is given by

\[p_X(x)={n \choose x} p^x(1-p)^{n-x}\]

Attention

Prove that $\sum_{x=0}^n p_X(x)=1$.

Note

We can write a Binomially distributed rv as a sum of independent, Bernoulli rvs.

Let’s denote each of the trials with a different Bernoulli rv, $X_i\sim\mathrm{Ber}(p)$ for $i$-th trial.
Then $Y=X_1+\cdots+X_n$ is the total number of successes, $X_i\perp\!\!\!\perp X_j$ for $i\neq j$.
[Derive] For $X\sim\mathrm{Bin}(n,p)$, $\mathbb{E}[X]=np$ and $\mathbb{V}(X)=np(1-p)$.
Hint:
- For mean, utilise the linearity of expectation (does not require independence).
- For variance, utilise independence in the sum of rvs.

Tip

Solving a problem with an exisitng framework often requires us to think of a process with which the experiment takes place. With the right process description, seemingly difficult problems often become easy.

The Birthday Problem¶

Attention

In a party of $500$ guests, what is the probability that you share your birthday with $5$ other people?

All birthdays are equally likely (assumption of the underlying probability model).
Person A’s birthday is independent of person B’s birthday.
[The process] To find out the number of people who share their birthday with me, I can
- pick a person at random and ask their birthday
- I consider it a success if their birthday is the same as mine, failure otherwise
- repeat for all $n$
Total number of successes represents the total number of people who share their birthday with me.

The Hat Problem¶

Attention

There are $n$ people with numbered hats. They throw all their hats into a basket and then pick up one hat one by one. What is the expected number of people who get their own hat back? What is the variance of this?

Let $X_i=1$ if $i$-th person get their hat back in the process, and $X_i=0$ otherwise.
Total number of people who get their own hat back is given by $Y=X_1+X_2+\cdots+X_n$.
This looks like the case for Binomial distribution but it’s not.
[IMPORTANT] In this case, the rvs are not independent.
- To see why, let’s take $n=2$.
- The unconditional probabilities $\mathbb{P}(X_1=1)=\mathbb{P}(X_2=1)=\frac{1}{2}$.
- But, if $X_1=1$, then $\mathbb{P}(X_2=1|X_1=1)=1$. If $X_1=0$, then $\mathbb{P}(X_2=1|X_1=0)=0$.
However, each person is equally likely to get their own hat back if they’re the first to pick.
[IMPORTANT] Therefore, for the unconditional probability, for any $i$, $\mathbb{P}(X_i=1)=\mathbb{P}(X_1=1)=\frac{1}{n}$.
The expectation can therefore be calculated by

\[\mathbb{E}[Y]=\mathbb{E}[X_1+\cdots+X_n]=\sum_{i=1}^n\mathbb{E}[X_i]=\sum_{i=1}^n\mathbb{E}[X_1]=n\cdot\frac{1}{n}=1\]
For the variance, we calculate $\mathbb{E}[Y^2]$ as follows:

\[\mathbf{E}[Y^2]=\mathbf{E}[(X_1+\cdots+X_n)^2]=\underbrace{\sum_{i=1}^n\mathbf{E}[X_i^2]}_\text{$n$ terms} + \underbrace{\sum_{i=1}^n\sum_{j=1|i\neq j}^n\mathbf{E}[X_i X_j]}_\text{$n^2-n$ terms}=\sum_{i=1}^n X_i^2\mathbb{P}(X_i)+\sum_{i=1}^n\sum_{j=1|i\neq j}^n X_i X_j\mathbb{P}(X_i,X_j)\]
For the first term:
- We can ignore the case where $X_i=0$ as $X_i^2=0$ as well.
- Also, $X_i^2=1$ when $X_i=1$.
- The first term becomes $\sum_{i=1}^n 1\cdot\mathbb{P}(X_1=1)=n\cdot\frac{1}{n}=1$.
For the second term:
- We ignore the cases when either of $X_i$ or $X_j$ are 0.
- [IMPORTANT] For $X_i=1,X_j=1$, by symmetry argument similar to above, we can conclude that for any $i\neq j$
\[\mathbb{P}(X_i=1,X_j=1)=\mathbb{P}(X_1=1,X_2=1)=\mathbb{P}(X_1=1)\mathbb{P}(X_2=1|X_1=1)=\frac{1}{n}\cdot\frac{1}{n-1}\]

Geometric¶

The number of repeated Bernoulli trials we need until we get a success can be modelled using a Geometric distribution. Let the Bernoulli trails have parameter $p$. Then $X\sim\mathrm{Geom}(p)$ and the PMF for $X=1,\cdots$ is given by

\[p_X(x)=(1-p)^x p\]

Attention

Prove that $\sum_{x=1}^\infty p_X(x)=1$.

Note

Geometric rvs have a memorylessness property. Even if we know that the first trial was a failure, it doesn’t tell us anything about the remaining number of trials required to get a success.
The remaining number of trials follows the same geometric distribution.
This fact is useful for obtaining the mean and variance of geometric rvs.
- Suppose the first trial was a failure. This is represented by the conditional rv $X|X>1$.
- Let the remaining number of trials until first success is represented by $Y$. Clearly, $X|X>1=Y+1$ and $\mathbb{E}[X|X>1]=\mathbb{E}[Y]+1$.
- By the memorylessless property, $Y\sim\mathrm{Geom}(p)$ as well. Therefore, $\mathbb{E}[Y]=\mathbb{E}[X]$.
- We use the fact to compute the conditional expectation, $\mathbb{E}[X|X>1]=1+\mathbb{E}[X]$.
[Derive] For $X\sim\mathrm{Geom}(p)$, $\mathbb{E}[X]=\frac{1}{p}$ and $\mathbb{V}(X)=\frac{1-p}{p^2}$.
Hint:
- Use divide-and-conquer by splitting the case where $X=1$ and $X>1$.
- Utilise the total expectation law as $\mathbb{E}[X]=\mathbb{P}(X=1)\mathbb{E}[X|X=1]+\mathbb{P}(X>1)\mathbb{E}[X|X>1]$

Multinomial¶

Like Binomial, Multinomial describes the joint distribution of counts of different possible values for of $n$ repeated Multinoulli trials.

Note

Let $Y\sim\mathrm{Multinoulli}(p_1,\cdots,p_k)$ where $Y=\{y_1,\cdots,y_k\}$.
Let $X_i$ be rv represending the number of times $y_i$ occurs.
These rvs are not independent.
The joint PMF for all such rvs is given by the Multinomial distribution, i.e. $X_1,\cdots,X_k\sim\mathrm{Multinomial}(p1,\cdots,p_k)$

\[p_{X1,\cdots,X_k}(x_1,\cdots,x_k)={n \choose {x_1,\cdots,x_k}} p_1^{x_1}\cdots p_k^{x_k}\]
Note that the individual rvs have a Binomial distribution, $X_i\sim\mathrm{Bin}(n, p_i)$.

Poisson¶

If a Binomial rv has $n\to\infty$ and $p\to 0$, we can approximate it using another rv with an easier-to-manipulate distribution. For $\lambda=n\cdot p$, $X\sim\mathrm{Poisson}(\lambda)$ ($\lambda>0$), the PMF is given by

\[p_X(x)=e^{-\lambda}\frac{\lambda^x}{x!}\]

Attention

Prove that $\sum_{x=0}^\infty p_X(x)=1$.

Tip

It is useful to model a specific, time-dependent outcome given just the average.
[Derive] For $X\sim\mathrm{Poisson}(\lambda)$, $\mathbb{E}[X]=\lambda$ and $\mathbb{V}(X)=\lambda$.
Hint:
- For mean, reindex the terms in the sum.
- For the variance, reindex terms in $\mathbb{E}[X^2]$ to evaluate $\lambda\mathbb{E}[X+1]$.

Attention

[The Birthday Problem] As the value of $p$ is quite low and $n$ is quite high, we can model this as a Poisson rv as well.

Continuous Random Variable¶

Continuous = values are from an uncountable subset of $\mathbb{R}$.

Probability density function¶

Note

When the set is uncountable, the probability $\mathbb{P}(X=x)$ of each individual such values $x$ is 0.
Therefore, the probabilistic interpreration has to work with a subset of the real line $B\subset\mathbb{R}$.
We define a probability density function (PDF), $f_X(x)\geq 0$, such that

\[\mathbb{P}(X\in B)=\int\limits_{B} f_X(x)\mathop{dx}.\]
This term is well defined when
- $B$ can be represented as the union of a countable collection of intervals.
- $f_X$ is a continuous/piecewise continuous function with at most countable number of points of discontinuity.
We say a rv is continuous for which such PDF can be defined.

Tip

For the simplest case when $B$ is an interval, $[a,b]$, then $\mathbb{P}(a\leq X\leq b)=\int\limits_a^b f_X(x)\mathop{dx}$.
Since individual points have 0 probability

\[\mathbb{P}(a\leq X\leq b)=\mathbb{P}(a\leq X< b)=\mathbb{P}(a< X\leq b)=\mathbb{P}(a< X< b).\]
Normalisation property holds, i.e.

\[\mathbb{P}(-\infty< X<\infty)=\int\limits_{-\infty}^\infty f_X(x)\mathop{dx}=1.\]

Probabilistic interpretation¶

Note

To understand why it is called a density

We consider an interval $[x,x+\delta]$, for some small $\delta>0$.

Assuming that $f_X(x)$ is “well behaved” (its values doesn’t jump around fanatically), we assume that it stays (almost) constant for this entire interval.

Therefore, $\mathbb{P}(X\in[x,x+\delta])=\int\limits_x^{x+\delta} f_X(t)\mathop{dt}\approx f_X(x)\cdot\delta$.

Hence, $f_X(x)$ can be thought of “probability per unit length”.

Attention

A PDF can take arbitrarily large values as long as the normalisation property holds, e.g.

\[\begin{split}f_X(x) = \begin{cases} \frac{1}{2\sqrt(x)} & \text{if $0 < x \leq 1$} \\ 0 & \text{otherwise} \end{cases}\end{split}\]

Expectation and Variance¶

We can define Expectation of as $\int\limits_{-\infty}^\infty x f_X(x) \mathop{dx}$ (assuming that the integral exists and is bounded).

Attention

Expectation is well-defined when $\int\limits_{-\infty}^\infty \left|x \right| f_X(x) \mathop{dx} < \infty$.
Example where the expectation isn’t defined

\[f_X(x)=\frac{c}{1+x^2}\]

where $c$ is a normalisation constant to make it a valid PDF.

Tip

LOTUS holds, even when $g(X)$ is a discrete-valued function.
Variance can be defined as usual.

Centerisation, standardisation, skewness and kurtosis¶

Attention

We denote $\tilde{X}=X-\mathbb{E}[X]$ as the centered version of $X$.
- We also have $\mathbb{E}[\tilde{X}]=\mathbb{E}[X-\mathbb{E}[X]]=0$.
Variance is the 2nd moment of centered rv $\mathbb{V}(X)=\mathbb{E}[\tilde{X}^2]$.
We denote $Z=\frac{X-\mathbb{E}[X]}{\sqrt{\mathbb{V}(X)}}=\frac{\tilde{X}}{\sqrt{\mathbb{E}[\tilde{X}^2]}}$ as the standardised version of $X$.
- We note that $\mathbb{E}[Z]=0$ and $\mathbb{E}[Z^2]=\mathbb{E}\left[\left(\frac{\tilde{X}}{\sqrt{\mathbb{E}[\tilde{X}^2]}}\right)^2\right]=\frac{\mathbb{E}[\tilde{X}^2]}{\mathbb{E}[\tilde{X}^2]}=1$.
Skewness is the 3rd moment of standardised rv, $\mathrm{skew}(X)=\mathbb{E}[Z^3]$.
- Skewness is a way to describe the shape of a probability distribution. It tells us if the distribution is lopsided.
  - If the skewness is positive, the distribution has a longer tail on the right.
  - If it’s negative, the distribution has a longer tail on the left.
Kurtosis is the 4th moment of standardised rv, $\mathrm{kurt}(X)=\mathbb{E}[Z^4]$.
- Kurtosis comes from the Greek word for bulging.
- Kurtosis describes how a probability distribution is shaped. It tells us about the distribution’s tails and its peak.
  - If kurtosis is positive, the distribution has heavy tails and a sharp peak.
  - If it’s negative, the distribution has light tails and a flat peak.

Tip

Note that $\mathbb{E}[X^2]=0$ signifies that $X=0$ with probability 1. This is a useful trick in many calculations.

Cauchy-Schwarz inequality¶

Note

We define the inner product between two rvs $X$ and $Y$ as $\langle X,Y\rangle=\mathbb{E}[XY]$.
- TODO: Understand why this is a valid definition for an inner product.
We can define the norm induced by this inner product as $\left\| \cdot \right\|_{\text{norm}}$, such that

\[\langle X,X\rangle=\left\| X \right\|_{\text{norm}}^2=\mathbb{E}[X^2]\]
Then Cauchy-Schwarz inequality becomes

\[|\langle X,Y\rangle|^2\leq \left\| X \right\|_{\text{norm}}^2\cdot\left\| Y \right\|_{\text{norm}}^2\implies \left(\mathbb{E}[XY]\right)^2\leq\mathbb{E}[X^2]\cdot\mathbb{E}[Y^2]\]
Direct proof without involving Cauchy-Schwarz:
- For $\mathbb{E}[Y^2]=0$, we have $\mathbb{P}(Y=0)=1$. In that case the above is satisfied.
- For $\mathbb{E}[Y^2]\neq 0$, the proof follows from the observation that
  
  \[\mathbb{E}\left[\left(X-\frac{\mathbb{E}[XY]}{\mathbb{E}[Y^2]}Y\right)^2\right]\geq 0\]

Cumulative distribution function¶

Regardless of whether a rv is discrete or continuous, there event $\{X\leq x\}$ has well defined probability.

Note

We can define a cumulative distribution function (CDF) for any rv as

\[\begin{split}F_X(x)=\mathbb{P}(X\leq x)=\begin{cases} \sum_{k\leq x} p_X(k), & \text{if $X$ is discrete} \\ \int\limits_{-\infty}^x f_X(x) \mathop{dx}, & \text{if $X$ is continuous} \end{cases}\end{split}\]

Properties of CDF¶

Attention

Monotonic: The CDF $F_X(x)$ is non-decreasing. If $x_1<x_2$, then $F_X(x_1)\leq F_X(x_2)$.
Normalised: We have $\lim\limits_{x\to -\infty} F_X(x)=0$ and $\lim\limits_{x\to \infty} F_X(x)=1$.
Right-continuous: We have $F_X(x)=F_X(x^+)$ for all $x$, where

\[F_X(x^+)=\lim\limits_{y\to x, y > x} F_X(y)\]
Let $X\sim F_X$ and $Y\sim G_Y$. We have

\[\forall x\in\mathbb{R}. F_X(x)=G_Y(x)\implies \forall \omega\in\Omega. \mathbb{P}(X\in \omega)=\mathbb{P}(Y\in \omega)\]

Multiple continuous random variables¶

Similar to the single continuous variable case, we say that two rvs, $X$ and $Y$ are jointly continuous if we can define an associated joint PDF $f_{X,Y}(x,y)\geq 0$ for any subset $B\subset\mathbb{R}^2$, such that $\mathbb{P}((x,y)\in B)=\iint\limits_{(x,y)\in B} f_{X,Y}(x,y) d(x,y)$.

Tip

For the simple case when $B=[a,b]\times [c,d]$, and when Fubini’s theorem applies, then

\[\mathbb{P}(a\leq X\leq b, c\leq Y\leq d)=\int\limits_a^b\int\limits_c^d f_{X,Y}(x,y) \mathop{dx} \mathop{dy}=\int\limits_c^d\int\limits_a^b f_{X,Y}(x,y) \mathop{dy} \mathop{dx}\]
Normalisation property holds.

\[\int\limits_{-\infty}^\infty\int\limits_{-\infty}^\infty f_{X,Y}(x,y)\mathop{dx} \mathop{dy}=1\]

Probabilistic interpretation¶

Note

For some small $\delta>0$ and $\epsilon>0$, we consider the rectangular area $[x,x+\delta]\times[y,y+\epsilon]$.
Assuming that $f_{X,Y}$ is “well behaved”, we can assume that it stays (almost) constant within this small rectangular region.
Therefore

\[\mathbb{P}(x\leq X\leq x+\delta, y\leq Y\leq y+\epsilon)=\int\limits_x^{x+\delta}\int\limits_y^{y+\epsilon}f_{X,Y}(t,v)\mathop{dt} \mathop{dv}\approx f_{X,Y}(x,y)\cdot\delta\cdot \epsilon.\]
Hence $f_{X,Y}(x,y)$ can be thought of as the joint probability per unit area.

Warning

If $X=g(Y)$, then the entire function $f_{X,Y}$ has an area of 0 in the $\mathbb{R}^2$ plane. Therefore, we cannot define a PDF which can represent probability per unit area. So $X$ and $Y$ cannot be jointly continuous even if they are marginally continuous (i.e. their marginal PDFs are well defined).

Note

The marginal probability is defined as $f_X(x)=\int\limits_{-\infty}^\infty f_{X,Y}(x,y)\mathop{dy}$ (similarly for $f_Y(y)$).
We can define joint CDF as

\[F_{X,Y}(x,y)=\mathbb{P}(X\leq x, Y\leq y)=\int\limits_{-\infty}^x \int\limits_{-\infty}^y f_{X,Y}(x,y) \mathop{dx} \mathop{dy}\]
- PDF can be recovered from CDF as
  
  \[f_{X,Y}(x,y)=\frac{\partial^2 F_{X,Y}}{\partial x\partial x}(x,y).\]
Extends naturally for more than 2 rvs.
All the properties for expectation holds as usual.

Conditioning¶

A continuous rv can be conditioned on an event, or another rv, discrete or continuous.

Conditioning on an event¶

A continuous rv can be conditioned on an event $A$ with $\mathbb{P}(A)>0$ and we can define a conditional PDF $f_{X|A}(x)$ such that for any (measurable) subset $B\in\mathbb{R}$

\[\mathbb{P}(X\in B|A)=\int\limits_B f_{X|A}(x) \mathop{dx}\]

Note

Normalisation property holds like normal PDFs, i.e. $\int\limits_{-\infty}^\infty f_{X|A}(x) \mathop{dx}=1$.
When the event is defined with the same rv such as $X\in A$, then

\[\begin{split}f_{X|X\in A}(x)=\begin{cases} \frac{f_{X}(x)}{\mathbb{P}(X\in A)}, & \text{if $X\in A$} \\ 0, & \text{otherwise} \end{cases}\end{split}\]

Probabilistic interpretation¶

Note

We can think of a small interval around $X=x$ of width $\delta$, so that $X\approx x$.
Assuming that $f_{X|A}(x)$ stays the same within this interval

\[\begin{split}\mathbb{P}(x\leq X\leq x+\delta|A)=\frac{\mathbb{P}(x\leq X\leq x+\delta,A)}{\mathbb{P}(A)}=\frac{\int\limits_{\{x\leq t\leq x+\delta\}\cap A} f_X(t)\mathop{dt}}{\mathbb{P}(A)}=\begin{cases}\frac{f_X(x)}{\mathbb{P}(A)}\int\limits_x^{x+\delta} \mathop{dt}\approx f_{X|A}(x)\cdot\delta & \text{if $[x,x+\delta]\in A$}\\ 0 & \text{otherwise}\end{cases}\end{split}\]
So, the conditional PDF represents conditional probability per unit length.
Conditional CDF can be defined as $F_{X|A}(x)=\int\limits_{-\infty}^x f_{X|A}(x) \mathop{dx}$.
Jointly continuous rvs can be conditioned on an event $C=\{x,y\}\in A$ with $\mathbb{P}(C)>0$ as exactly like above.

Total probability theorem¶

Tip

For a partition of the sample space $A_1,\cdots,A_n$, with $\mathbb{P}(A_i)>0$ for all $i$

\[F_X(x)=\sum_{i=1}^n \mathbb{P}(A_i) F_{X|A}(x)\]
Differentiating both sides, we can recover a formula involving PDFs as $f_X(x)=\sum_{i=1}^n \mathbb{P}(A_i) f_{X|A}(x)$.

Conditioning on a rv¶

Conditioning on a continuous rv¶

A continuous rv $X$ can be conditioned on another continuous rv $Y$, assuming that they are jointly continuous with CDF $f_{X,Y}(x,y)$ as long as $f_Y(y)>0$.

Note

The conditional PDF is defined as $f_{X|Y}(x|y)=\frac{f_{X,Y}(x,y)}{f_Y(y)}$.

Probabilistic interpretation¶

Note

We can think of a small interval around $X=x$ of width $\delta$, so that $X\approx x$.
However, we cannot take the conditioning event as $Y=y$ as it has 0 probability.
Therefore, we must consider a small interval around $Y=y$ of width $\epsilon$ such that $Y\approx y$.
Assuming that the joint and the marginal PDFs stay the same within this rectangular region, we have

\[\mathbb{P}(x\leq X\leq x+\delta|y\leq Y\leq y+\epsilon)=\frac{\mathbb{P}(x\leq X\leq x+\delta,y\leq Y\leq y+\epsilon)}{\mathbb{P}(y\leq Y\leq y+\epsilon)}\approx\frac{f_{X,Y}(x,y)\cdot\delta\cdot\epsilon}{f_Y(y)\cdot\epsilon}=\frac{f_{X,Y}(x,y)}{f_Y(y)}\cdot\delta=f_{X|Y}(x|y)\cdot\delta\]
The above doesn’t depent on $\epsilon$ at all, and is well defined even if we assign to it the limit value 0.
The interpretation then works as conditional probability per unit length of the rv $X$.

Definition of probability conditioned on an event with 0 probability¶

Tip

Using above, we can define the conditional probability for any (measurable) subset $B\in\mathbb{R}$ as

\[\mathbb{P}(X\in B|Y=y)=\int\limits_B f_{X|Y}(x|y) \mathop{dx}\]

Conditioning on a discrete rv¶

If we have a mixed distribution with one discrete rv, $K$ and one continuous rv $Y$, then we can define conditional PMF $p_{K|Y}(k|y)$ and conditional PDF $f_{Y|K}(y|k)$.

Probabilistic interpretation¶

Note

We can think of a small interval around $Y=y$ of width $\delta$, so that $Y\approx y$.
Assuming that $f_Y(y)$ and $f_{K|Y}(y)$ stays the same within this interval

\[p_{K|Y}(k|y)=\frac{\mathbb{P}(K=k,y\leq Y\leq y+\delta)}{\mathbb{P}(y\leq Y\leq y+\delta)}=\frac{\mathbb{P}(K=k)\mathbb{P}(y\leq Y\leq y+\delta|K=k)}{\mathbb{P}(y\leq Y\leq y+\delta)}\approx\frac{p_K(k)f_{Y|K}(y|k)\cdot\delta}{f_Y(y)\cdot\delta}=\frac{p_K(k)f_{Y|K}(y|k)}{f_Y(y)}\]

Total probability theorem¶

Note

We recover the marginals as
- $f_Y(y)=\sum_{k}p_K(k)f_{Y|K}(y|k)$ and
- $p_K(k)=\int\limits_{-\infty}^\infty f_Y(y)p_{K|Y}(k|t) \mathop{dy}$.

Bayes theorem¶

There are 4 versions of Bayes theorem.

Tip

Discrete-discrete: Already discussed in the context of discrete rv.
Discrete-continuous: $p_{K|Y}(k|y)=\frac{p_K(k)f_{Y|K}(y|k)}{f_Y(y)}$.
- Example: detection of digital signal transmission with noise
Continuous-discrete: $f_{X|K}(x|k)=\frac{f_X(x)p_{X|K}(x|k)}{p_K(k)}$.
- Example: inference about bernoulli parameter
Continuous-continuous: $f_{X|Y}(x|y)=\frac{f_X(x)f_{X|Y}(x|y)}{f_Y(y)}$.

Conditional expectation¶

Note

Conditional expectation and LOTUS with conditional PDFs work the same as the discrete case.

Notion of Independence¶

Note

Two jointly continuous rvs are considered independent ($X\perp\!\!\!\perp Y$) if $f_{X|Y}(x|y)=f_X(x)$ for all $x$ for all $y$ where $f_Y(y)>0$.
If $X\perp\!\!\!\perp Y$, $f_{X,Y}(x,y)=f_X(x)f_Y(y)$ and $F_{X,Y}(x,y)=F_X(x)F_Y(y)$ for all $x$ and $y$.

Some continuous random variables¶

Uniform¶

Exponential¶

TOD: explain the memorylessness property of the exponential and connection with geometric

Laplace¶

TOD: explain the memorylessness property of the exponential and connection with geometric

Gaussian¶

Note

We write $X\sim\mathcal{N}(\mu,\sigma)$ to represent an rv $X\in\mathbb{R}$ with Gaussian density, which is given by

\[f_X(x;\mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}}\exp(-\frac{1}{2\sigma^2}(x-\mu)^2)\]
$\mu,\sigma\in\mathbb{R}$ are the mean and standard-deviation parameters.

Multivariate Gaussian¶

Note

We write $X\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})$ to represent an rv $X\in\mathbb{R}^d$ with Gaussian density, which is given by

\[f_X(\mathbf{x};\boldsymbol{\mu},\boldsymbol{\Sigma})=(2\pi)^{-\frac{d}{2}}|\boldsymbol{\Sigma}|^{-\frac{1}{2}}\exp(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu}))\]
$\boldsymbol{\mu}\in\mathbb{R}^d$ is the mean and $\boldsymbol{\Sigma}\in\mathbb{R}^{d\times d}$ is the covariance matrix.

Important Properties of Gaussian Densities¶

Note

explain the shape of 2d normal density
independent case - circles in contours
dependent case - parabolas in contours
gaussians are closed under linear transform
gaussians are closed under conditioning