Skip to content

Updates to prob_dist lecture #289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 14, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 82 additions & 11 deletions lectures/prob_dist.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ kernelspec:

## Outline

In this lecture we give a quick introduction to data and probability distributions using Python
In this lecture we give a quick introduction to data and probability distributions using Python.

```{code-cell} ipython3
:tags: [hide-output]
Expand All @@ -42,7 +42,7 @@ import seaborn as sns

## Common distributions

In this section we recall the definitions of some well-known distributions and show how to manipulate them with SciPy.
In this section we recall the definitions of some well-known distributions and explore how to manipulate them with SciPy.

### Discrete distributions

Expand All @@ -61,7 +61,7 @@ $$ \mathbb P\{X = x_i\} = p(x_i) \quad \text{for } i= 1, \ldots, n $$
The **mean** or **expected value** of a random variable $X$ with distribution $p$ is

$$
\mathbb E X = \sum_{i=1}^n x_i p(x_i)
\mathbb{E}[X] = \sum_{i=1}^n x_i p(x_i)
$$

Expectation is also called the *first moment* of the distribution.
Expand All @@ -71,15 +71,15 @@ We also refer to this number as the mean of the distribution (represented by) $p
The **variance** of $X$ is defined as

$$
\mathbb V X = \sum_{i=1}^n (x_i - \mathbb E X)^2 p(x_i)
\mathbb{V}[X] = \sum_{i=1}^n (x_i - \mathbb{E}[X])^2 p(x_i)
$$

Variance is also called the *second central moment* of the distribution.

The **cumulative distribution function** (CDF) of $X$ is defined by

$$
F(x) = \mathbb P\{X \leq x\}
F(x) = \mathbb{P}\{X \leq x\}
= \sum_{i=1}^n \mathbb 1\{x_i \leq x\} p(x_i)
$$

Expand Down Expand Up @@ -157,6 +157,75 @@ Check that your answers agree with `u.mean()` and `u.var()`.
```


#### Bernoulli distribution

Another useful (and more interesting) distribution is the Bernoulli distribution

We can import the uniform distribution on $S = \{1, \ldots, n\}$ from SciPy like so:

```{code-cell} ipython3
n = 10
u = scipy.stats.randint(1, n+1)
```


Here's the mean and variance

```{code-cell} ipython3
u.mean(), u.var()
```

The formula for the mean is $(n+1)/2$, and the formula for the variance is $(n^2 - 1)/12$.


Now let's evaluate the PMF

```{code-cell} ipython3
u.pmf(1)
```

```{code-cell} ipython3
u.pmf(2)
```


Here's a plot of the probability mass function:

```{code-cell} ipython3
fig, ax = plt.subplots()
S = np.arange(1, n+1)
ax.plot(S, u.pmf(S), linestyle='', marker='o', alpha=0.8, ms=4)
ax.vlines(S, 0, u.pmf(S), lw=0.2)
ax.set_xticks(S)
plt.show()
```


Here's a plot of the CDF:

```{code-cell} ipython3
fig, ax = plt.subplots()
S = np.arange(1, n+1)
ax.step(S, u.cdf(S))
ax.vlines(S, 0, u.cdf(S), lw=0.2)
ax.set_xticks(S)
plt.show()
```


The CDF jumps up by $p(x_i)$ and $x_i$.


```{exercise}
:label: prob_ex2

Calculate the mean and variance for this parameterization (i.e., $n=10$)
directly from the PMF, using the expressions given above.

Check that your answers agree with `u.mean()` and `u.var()`.
```



#### Binomial distribution

Expand All @@ -170,7 +239,7 @@ Here $\theta \in [0,1]$ is a parameter.

The interpretation of $p(i)$ is: the number of successes in $n$ independent trials with success probability $\theta$.

(If $\theta=0.5$, this is "how many heads in $n$ flips of a fair coin")
(If $\theta=0.5$, p(i) can be "how many heads in $n$ flips of a fair coin")

The mean and variance are

Expand Down Expand Up @@ -215,12 +284,12 @@ plt.show()


```{exercise}
:label: prob_ex2
:label: prob_ex3

Using `u.pmf`, check that our definition of the CDF given above calculates the same function as `u.cdf`.
```

```{solution-start} prob_ex2
```{solution-start} prob_ex3
:class: dropdown
```

Expand Down Expand Up @@ -304,7 +373,7 @@ The definition of the mean and variance of a random variable $X$ with distributi
For example, the mean of $X$ is

$$
\mathbb E X = \int_{-\infty}^\infty x p(x) dx
\mathbb{E}[X] = \int_{-\infty}^\infty x p(x) dx
$$

The **cumulative distribution function** (CDF) of $X$ is defined by
Expand All @@ -328,7 +397,7 @@ This distribution has two parameters, $\mu$ and $\sigma$.

It can be shown that, for this distribution, the mean is $\mu$ and the variance is $\sigma^2$.

We can obtain the moments, PDF, and CDF of the normal density as follows:
We can obtain the moments, PDF and CDF of the normal density as follows:

```{code-cell} ipython3
μ, σ = 0.0, 1.0
Expand Down Expand Up @@ -659,7 +728,7 @@ x.mean(), x.var()


```{exercise}
:label: prob_ex3
:label: prob_ex4

Check that the formulas given above produce the same numbers.
```
Expand Down Expand Up @@ -700,6 +769,7 @@ The monthly return is calculated as the percent change in the share price over e
So we will have one observation for each month.

```{code-cell} ipython3
:tags: [hide-output]
df = yf.download('AMZN', '2000-1-1', '2023-1-1', interval='1mo' )
prices = df['Adj Close']
data = prices.pct_change()[1:] * 100
Expand Down Expand Up @@ -777,6 +847,7 @@ Violin plots are particularly useful when we want to compare different distribut
For example, let's compare the monthly returns on Amazon shares with the monthly return on Apple shares.

```{code-cell} ipython3
:tags: [hide-output]
df = yf.download('AAPL', '2000-1-1', '2023-1-1', interval='1mo' )
prices = df['Adj Close']
data = prices.pct_change()[1:] * 100
Expand Down