Skip to content

Edits to prob and dist lecture #262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 31, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 22 additions & 36 deletions lectures/prob_dist.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,6 @@ import scipy.stats
import seaborn as sns
```

+++ {"user_expressions": []}

## Common distributions

Expand Down Expand Up @@ -100,7 +99,6 @@ n = 10
u = scipy.stats.randint(1, n+1)
```

+++ {"user_expressions": []}

Here's the mean and variance

Expand All @@ -110,7 +108,6 @@ u.mean(), u.var()

The formula for the mean is $(n+1)/2$, and the formula for the variance is $(n^2 - 1)/12$.

+++ {"user_expressions": []}

Now let's evaluate the PMF

Expand All @@ -122,7 +119,6 @@ u.pmf(1)
u.pmf(2)
```

+++ {"user_expressions": []}

Here's a plot of the probability mass function:

Expand All @@ -135,7 +131,6 @@ ax.set_xticks(S)
plt.show()
```

+++ {"user_expressions": []}

Here's a plot of the CDF:

Expand All @@ -148,20 +143,21 @@ ax.set_xticks(S)
plt.show()
```

+++ {"user_expressions": []}

The CDF jumps up by $p(x_i)$ and $x_i$.

+++ {"user_expressions": []}

```{exercise}
:label: prob_ex1

Calculate the mean and variance directly from the PMF, using the expressions given above.
Calculate the mean and variance for this parameterization (i.e., $n=10$)
directly from the PMF, using the expressions given above.

Check that your answers agree with `u.mean()` and `u.var()`.
Check that your answers agree with `u.mean()` and `u.var()`.
```



#### Binomial distribution

Another useful (and more interesting) distribution is the **binomial distribution** on $S=\{0, \ldots, n\}$, which has PMF
Expand Down Expand Up @@ -205,7 +201,6 @@ ax.set_xticks(S)
plt.show()
```

+++ {"user_expressions": []}

Here's the CDF

Expand All @@ -218,7 +213,6 @@ ax.set_xticks(S)
plt.show()
```

+++ {"user_expressions": []}

```{exercise}
:label: prob_ex2
Expand Down Expand Up @@ -289,11 +283,9 @@ ax.set_xticks(S)
plt.show()
```

+++ {"user_expressions": []}

### Continuous distributions

+++ {"user_expressions": []}

Continuous distributions are represented by a **density function**, which is a function $p$ over $\mathbb R$ (the set of all numbers) such that $p(x) \geq 0$ for all $x$ and

Expand Down Expand Up @@ -322,7 +314,6 @@ $$
= \int_{-\infty}^x p(x) dx
$$

+++ {"user_expressions": []}

#### Normal distribution

Expand Down Expand Up @@ -366,7 +357,6 @@ plt.legend()
plt.show()
```

+++ {"user_expressions": []}

Here's a plot of the CDF:

Expand All @@ -382,7 +372,6 @@ plt.legend()
plt.show()
```

+++ {"user_expressions": []}

#### Lognormal distribution

Expand Down Expand Up @@ -495,22 +484,19 @@ plt.show()

#### Beta distribution

The **beta distribution** is a distribution on $\left(0, 1\right)$ with density
The **beta distribution** is a distribution on $(0, 1)$ with density

$$
p(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)}
x^{\alpha - 1} (1 - x)^{\beta - 1}
$$

where $\Gamma$ is the gamma function ($\Gamma(n) = (n - 1)!$ for $n \in \mathbb{N}$).

This distribution has two parameters, $\alpha$ and $\beta$.

It has a nice interpretation: if $X$ is beta distributed, then $X$ is the probability of success in a Bernoulli trial with a number of successes $\alpha$ and a number of failures $\beta$.
where $\Gamma$ is the [gamma function](https://en.wikipedia.org/wiki/Gamma_function).

For example, if $\alpha = \beta = 1$, then the beta distribution is uniform on $\left(0, 1\right)$ as the number of successes and failures are both 1.
(The role of the gamma function is just to normalize the density, so that it
integrates to one.)

While, if $\alpha = 3$ and $\beta = 2$, then the beta distribution is located more towards 1 as there are more successes than failures.
This distribution has two parameters, $\alpha > 0$ and $\beta > 0$.

It can be shown that, for this distribution, the mean is $\alpha / (\alpha + \beta)$ and
the variance is $\alpha \beta / (\alpha + \beta)^2 (\alpha + \beta + 1)$.
Expand Down Expand Up @@ -553,6 +539,7 @@ plt.legend()
plt.show()
```


#### Gamma distribution

The **gamma distribution** is a distribution on $\left(0, \infty\right)$ with density
Expand All @@ -562,11 +549,14 @@ $$
x^{\alpha - 1} \exp(-\beta x)
$$

This distribution has two parameters, $\alpha$ and $\beta$.
This distribution has two parameters, $\alpha > 0$ and $\beta > 0$.

It can be shown that, for this distribution, the mean is $\alpha / \beta$ and the variance is $\alpha / \beta^2$.
It can be shown that, for this distribution, the mean is $\alpha / \beta$ and
the variance is $\alpha / \beta^2$.

One interpretation is that if $X$ is gamma distributed, then $X$ is the sum of $\alpha$ independent exponentially distributed random variables with mean $1/\beta$.
One interpretation is that if $X$ is gamma distributed and $\alpha$ is an
integer, then $X$ is the sum of $\alpha$ independent exponentially distributed
random variables with mean $1/\beta$.

We can obtain the moments, PDF, and CDF of the normal density as follows:

Expand Down Expand Up @@ -608,7 +598,6 @@ plt.show()

## Observed distributions

+++ {"user_expressions": []}

Sometimes we refer to observed data or measurements as "distributions".

Expand All @@ -630,17 +619,17 @@ df = pd.DataFrame(data, columns=['name', 'income'])
df
```

+++ {"user_expressions": []}

In this situation, we might refer to the set of their incomes as the "income distribution."

The terminology is confusing because this is not the same thing as a probability distribution --- it's just a collection of numbers.
The terminology is confusing because this set is not a probability distribution
--- it's just a collection of numbers.

Below we explore some observed distributions.
However, as we will see, there are connections between observed distributions (i.e., sets of
numbers like the income distribution above) and probability distributions.

We will see that there are connections between observed distributions---like the income distribution above---and probability distributions, as we'll see below.
Below we explore some observed distributions.

+++ {"user_expressions": []}

### Summary statistics

Expand All @@ -658,8 +647,6 @@ $$
\frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2
$$

+++ {"user_expressions": []}

For the income distribution given above, we can calculate these numbers via

```{code-cell} ipython3
Expand All @@ -670,7 +657,6 @@ x = np.asarray(df['income'])
x.mean(), x.var()
```

+++ {"user_expressions": []}

```{exercise}
:label: prob_ex3
Expand Down