Skip to content

Commit bdd8624

Browse files
committed
Renamed notebooks
1 parent c54adc4 commit bdd8624

38 files changed

+85579
-0
lines changed

examples/statistical_rethinking_lectures/02-The_Garden_of_Forking_Data.ipynb

Lines changed: 1262 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/02-The_Garden_of_Forking_Data.myst.md

Lines changed: 588 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/03-Geocentric_Models.ipynb

Lines changed: 1645 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/03-Geocentric_Models.myst.md

Lines changed: 577 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/04-Categories_&_Curves.ipynb

Lines changed: 4948 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/04-Categories_&_Curves.myst.md

Lines changed: 1272 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/05-Elemental_Confounds.ipynb

Lines changed: 4054 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/05-Elemental_Confounds.myst.md

Lines changed: 1387 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/06-Good_&_Bad_Controls.ipynb

Lines changed: 5146 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/06-Good_&_Bad_Controls.myst.md

Lines changed: 1312 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/07-Fitting_Over_&_Under.ipynb

Lines changed: 2155 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/07-Fitting_Over_&_Under.myst.md

Lines changed: 698 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/08-Markov_Chain_Monte_Carlo.ipynb

Lines changed: 3598 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/08-Markov_Chain_Monte_Carlo.myst.md

Lines changed: 1073 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/09-Modeling_Events.ipynb

Lines changed: 4680 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/09-Modeling_Events.myst.md

Lines changed: 1226 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/10-Counts_&_Hidden_Confounds.ipynb

Lines changed: 5721 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/10-Counts_&_Hidden_Confounds.myst.md

Lines changed: 1391 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/11-Ordered_Categories.ipynb

Lines changed: 4223 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/11-Ordered_Categories.myst.md

Lines changed: 1027 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/12-Multilevel_Models.ipynb

Lines changed: 3337 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/12-Multilevel_Models.myst.md

Lines changed: 1142 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/13-Multilevel_Adventures.ipynb

Lines changed: 1753 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/13-Multilevel_Adventures.myst.md

Lines changed: 575 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/14-Correlated_Features.ipynb

Lines changed: 2007 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/14-Correlated_Features.myst.md

Lines changed: 851 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/15-Social_Networks.ipynb

Lines changed: 4089 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/15-Social_Networks.myst.md

Lines changed: 1293 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/16-Gaussian_Processes.ipynb

Lines changed: 5790 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/16-Gaussian_Processes.myst.md

Lines changed: 1553 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/17-Measurement_and_Misclassification.ipynb

Lines changed: 3579 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/17-Measurement_and_Misclassification.myst.md

Lines changed: 1082 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/18-Missing_Data.ipynb

Lines changed: 5216 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/18-Missing_Data.myst.md

Lines changed: 1420 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/19-Generalized_Linear_Madness.ipynb

Lines changed: 2292 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/19-Generalized_Linear_Madness.myst.md

Lines changed: 986 additions & 0 deletions
Large diffs are not rendered by default.

examples/statistical_rethinking_lectures/20-Horoscopes.ipynb

Lines changed: 403 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
---
2+
jupytext:
3+
text_representation:
4+
extension: .md
5+
format_name: myst
6+
format_version: 0.13
7+
kernelspec:
8+
display_name: Python 3 (ipykernel)
9+
language: python
10+
name: python3
11+
---
12+
13+
(lecture_20)=
14+
# Horoscopes
15+
:::{post} Jan 7, 2024
16+
:tags: statistical rethinking, bayesian inference, scientific workflow
17+
:category: intermediate
18+
:author: Dustin Stansbury
19+
:::
20+
21+
This notebook is part of the PyMC port of the [Statistical Rethinking 2023](https://github.com/rmcelreath/stat_rethinking_2023) lecture series by Richard McElreath.
22+
23+
[Video - Lecture 20 - Horoscopes](https://youtu.be/qwF-st2NGTU)# [Lecture 20 - Horoscopes](https://www.youtube.com/watch?v=qwF-st2NGTU)
24+
25+
```{code-cell} ipython3
26+
# Ignore warnings
27+
import warnings
28+
29+
import arviz as az
30+
import numpy as np
31+
import pandas as pd
32+
import pymc as pm
33+
import statsmodels.formula.api as smf
34+
import utils as utils
35+
import xarray as xr
36+
37+
from matplotlib import pyplot as plt
38+
from matplotlib import style
39+
from scipy import stats as stats
40+
41+
warnings.filterwarnings("ignore")
42+
43+
# Set matplotlib style
44+
STYLE = "statistical-rethinking-2023.mplstyle"
45+
style.use(STYLE)
46+
```
47+
48+
# Horoscopes
49+
50+
This lecture mostly outlines a set of high-level heuristics and workflows to improve the quality of scientific research. Therefore there's not a lot of implementation details in the lecture to cover. I won't go through copying the content from each slide, but I cover some highlights (mostly for my own benefit) below:
51+
52+
## Statistics is like fortune telling
53+
54+
- Vague facts lead to vague advice
55+
- Reading tea leaves is like following common flow charts for statistical analysis
56+
- There's little scientific inputs, therefore little scientific interpretation
57+
- That's often the feature and the bug of fortune telling, and statistics:
58+
- by providing vague interpretations (e.g. horoscope predictions) from vague inputs (e.g. birthday), they can "explain" any number of outcomes
59+
- just like vague horoscopes can "explain" any number of possible future events
60+
- Exaggerated importance
61+
- no one wants to hear evil portents in their tea leaves, just as no one wants to hear about NULL or negative statistical results
62+
- there's often incentive to use statistics to find the positive result
63+
- It's often easier to offload subjective scientific responsibility onto objective statistical procedures
64+
65+
## Three pillars of scientific workflow
66+
**1. Planning**
67+
- Goal setting
68+
- estimands
69+
- Theory building
70+
- assumptions
71+
- 4 types of theory building, increasing in specificity
72+
1. Heuristic (DAGs)
73+
- allows us to deduce a lot from establishing causal structure
74+
3. Structural
75+
- moves beyond DAGs by establishing specific functional forms of causes
76+
5. Dynamical models
77+
- usually work over spatial/temporal grid
78+
- tend to collapse large number of micro-states into macro interpretation
79+
7. Agent-based
80+
- focuses on individual micro states
81+
- Justified sampling
82+
- Which data do we use, and what's it's structure
83+
- Verify with simulation
84+
- Justified analysis
85+
- Which golems?
86+
- Can we recover estimands from simulations?
87+
- Documentation
88+
- How did it happen?
89+
- Help others and your future self
90+
- Scripting is self-documenting
91+
- Comments are important
92+
- Don't be clever, be explicit
93+
- Avoid clever one-liners
94+
- I find Python PEP useful here
95+
- Sharing
96+
- open source code and data formats
97+
- proprietary software does not facilitate shareing, and is bad scientific ethics
98+
- the irony here, is that MATLAB is so common in academic setting, particularly engineering 🙄
99+
- proprietery data formats can shoot you in the foot when you (or others) can no longer open them
100+
- Preregistration isn't a silver bullet
101+
- Pre-allocating expectations on a bad analysis approach (e.g. causal salad) doesn't fix the bad approach
102+
103+
**2. Working**
104+
105+
- Research engineering
106+
- Treat research more like software enginnering
107+
- standardized, battle-tested procedures that make software dependable and repeatable
108+
- version control (git)
109+
- testing
110+
- unit testing
111+
- integration testing
112+
- build up tests incrementally, validating each part of the workflow before proceeding to the next
113+
- documentation
114+
- review
115+
- 👀, 👀 have at least one other person review your analysis code and docs, and provide feedback
116+
- will often point out bugs, optimizations, or shortcomings in documentation
117+
- Look at good examples
118+
- e.g. on of [McElreath's Consulting Projects](https://github.com/rmcelreath/CES_rater_2021/tree/main)
119+
- [Data Carpentry](https://datacarpentry.org/)
120+
121+
**3. Reporting**
122+
123+
- Sharing materials
124+
- by following code-based Working flow, sharing is pretty much done for you
125+
- [Nice breakdown and example of Describing Methods](https://youtu.be/qwF-st2NGTU?si=3I7CMalLXv3pIQhr&t=3742)
126+
- Justify priors
127+
- Justify methods, and dealing with reviewers
128+
- Common fallacy: "good scientific design doesn't require complex statistics"
129+
- valid causal modeling requires complexity
130+
- don't try to convince Reviewer 3 to accept your methods, write to editor
131+
- move the convo from statistical to causal modeling
132+
- Describe data
133+
- structure
134+
- missing values: justify imputation if any
135+
- Describe results
136+
- aim to report contrasts and marginal effects
137+
- use densities over intervals
138+
- avoid interpeting coefficients as causal effects
139+
- Making decisions
140+
- this is often the goal (particularly in industry)
141+
- embrace uncertainty
142+
- uncertainty is not admission of weakness
143+
- Bayesian decision theory
144+
- use the posterior to simulate various policy interventions
145+
- can be used to provide posteriors to costs/benefits due to those interventions
146+
147+
+++
148+
149+
## Scientific Reform
150+
151+
- many of the metrics for good science are counterproductive
152+
- e.g. papers that are least replicated continue to have higher citation count
153+
- META POINT: this result in publishing be explained using a causal modeling and colider bias
154+
155+
### Collider bias in scientific publishing
156+
157+
#### Causal model of collider bias
158+
159+
```{code-cell} ipython3
160+
---
161+
jupyter:
162+
source_hidden: true
163+
---
164+
utils.draw_causal_graph(
165+
edge_list=[("newsworhiness, N", "published, P"), ("trustworthiness, T", "published, P")],
166+
)
167+
```
168+
169+
#### Simulating data from collider causal model
170+
171+
```{code-cell} ipython3
172+
np.random.seed(123)
173+
n_samples = 200
174+
175+
# N and T are independent
176+
N = stats.norm.rvs(size=n_samples)
177+
T = stats.norm.rvs(size=n_samples)
178+
179+
# Award criterion; either are large enough to threshold
180+
A = np.where(N + T > 2, 1, 0)
181+
182+
for awarded in [0, 1]:
183+
color = "gray" if not awarded else "C0"
184+
N_A = N[A == awarded]
185+
T_A = T[A == awarded]
186+
utils.plot_scatter(N_A, T_A, color=color)
187+
188+
fit_data = pd.DataFrame({"N": N_A, "T": T_A})
189+
cc = np.corrcoef(fit_data.T, fit_data.N)[0][1]
190+
awarded_model = smf.ols("T ~ N", data=fit_data).fit()
191+
utils.plot_line(
192+
N_A, awarded_model.predict(), color="C0", label=f"Post-selection\nTrend\ncorrelation={cc:0.2}"
193+
)
194+
plt.xlabel("Newsworthiness")
195+
plt.ylabel("Trustworthiness")
196+
plt.axis("square")
197+
plt.legend();
198+
```
199+
200+
By selecting at papers that are published based on a threshold that combines either newsworthiness--i.e. "sexy papers" that get cited a lot--or trustworthiness--i.e. boring papers that are replicable--we end up with highly-cited papers that tend to be less replicable.
201+
202+
+++ {"jp-MarkdownHeadingCollapsed": true}
203+
204+
## Horoscopes of research
205+
206+
- Many things that are "bad" about science (e.g. impact factor) are once well-intentioned reforms
207+
- Some potential fixes are available:
208+
1. No stats before transparently-communicated causal model
209+
- avoid causal salad
210+
2. Prove your code/analysis works within the scope of your project and assumptions
211+
3. Share as much as possible
212+
- sometimes data is not shareable
213+
- but you can create partial, anonomized, or synthetic datasets
214+
4. Beware proxies for research quality (e.g. citation count, impact factor)
215+
216+
+++
217+
218+
## Authors
219+
* Ported to PyMC by Dustin Stansbury (2024)
220+
* Based on Statistical Rethinking (2023) lectures by Richard McElreath
221+
222+
```{code-cell} ipython3
223+
%load_ext watermark
224+
%watermark -n -u -v -iv -w -p pytensor,xarray
225+
```
226+
227+
:::{include} ../page_footer.md
228+
:::

0 commit comments

Comments
 (0)