Skip to content

Commit 6669723

Browse files
committed
first draft sarimax
1 parent 221a412 commit 6669723

File tree

47 files changed

+3804
-35
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+3804
-35
lines changed
Lines changed: 321 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,321 @@
1+
---
2+
title: "SARIMAX Model Analysis of Apple Stock with Exogenous Variables"
3+
date: 2024-07-06T00:00:00+01:00
4+
description: "Short Stock price analysis on AAPL, then a prediction is tested using SARIMAX model"
5+
menu:
6+
sidebar:
7+
name: SARIMAX
8+
identifier: SARIMAX
9+
parent: stock_prediction
10+
weight: 12
11+
hero: sarimax_example_files/sarimax_example_6_0.png
12+
tags: ["Finance", "Statistics", "Forecasting"]
13+
categories: ["Finance"]
14+
---
15+
16+
17+
## Introduction to Exogenous Variables in Time Series Models
18+
In the previous articles ( [ARIMA](/posts/finance/stock_prediction/arima) and [SARIMA](/posts/finance/stock_prediction/sarima)
19+
20+
Exogenous variables, also known as external regressors, are independent variables that are not part of the main time series but can influence it. In the context of stock price prediction, exogenous variables might include:
21+
22+
1. Market indices (e.g., S&P 500)
23+
2. Economic indicators (e.g., GDP growth, unemployment rate)
24+
3. Company-specific metrics (e.g., revenue, earnings per share)
25+
4. Sentiment indicators (e.g., social media sentiment)
26+
27+
## Mathematical Formulation of SARIMAX
28+
29+
The SARIMAX model extends the SARIMA model by including exogenous variables. The mathematical representation is:
30+
31+
$$ φ(B)Φ(Bᵐ)(1-B)ᵈ(1-Bᵐ)D (Yₜ - β₁X₁,ₜ - β₂X₂,ₜ - ... - βₖXₖ,ₜ) = θ(B)Θ(Bᵐ)εₜ$$
32+
33+
Where:
34+
- $Yₜ$ is the dependent variable (in our case, Apple stock price)
35+
- $X₁,ₜ, X₂,ₜ, ..., Xₖ,$ₜ are the exogenous variables
36+
- $β₁, β₂, ..., βₖ$ are the coefficients of the exogenous variables
37+
- All other terms are as defined in the SARIMA model
38+
39+
## Implementing SARIMAX for Apple Stock
40+
41+
Let's implement a SARIMAX model for Apple stock, using the S&P 500 index as an exogenous variable:
42+
43+
44+
```python
45+
import pandas as pd
46+
import numpy as np
47+
import matplotlib.pyplot as plt
48+
import yfinance as yf
49+
from statsmodels.tsa.statespace.sarimax import SARIMAX
50+
from pmdarima import auto_arima
51+
52+
# Download Apple stock data and S&P 500 data
53+
start_date = "2021-01-01"
54+
end_date = "2024-06-24"
55+
aapl = yf.download("AAPL", start=start_date, end=end_date)['Close']
56+
sp500 = yf.download("^GSPC", start=start_date, end=end_date)['Close']
57+
58+
# Align the data and remove any missing values
59+
data = pd.concat([aapl, sp500], axis=1).dropna()
60+
data.columns = ['AAPL', 'SP500']
61+
62+
# Split the data into train and test sets
63+
train_size = int(len(data) * 0.8)
64+
train, test = data[:train_size], data[train_size:]
65+
test_size = len(test)
66+
67+
# Determine the best SARIMAX model
68+
exog = train['SP500']
69+
endog = train['AAPL']
70+
71+
model = auto_arima(endog, exogenous=exog, seasonal=True, m=12,
72+
start_p=1, start_q=1, start_P=1, start_Q=1,
73+
max_p=3, max_q=3, max_P=2, max_Q=2, d=1, D=1,
74+
trace=True, error_action='ignore', suppress_warnings=True,
75+
stepwise=True, out_of_sample=200)
76+
77+
print(model.summary())
78+
79+
# Fit the SARIMAX model
80+
sarimax_model = SARIMAX(endog, exog=exog, order=model.order, seasonal_order=model.seasonal_order)
81+
results = sarimax_model.fit()
82+
83+
print(results.summary())
84+
```
85+
86+
Performing stepwise search to minimize aic
87+
ARIMA(1,1,1)(1,1,1)[12] : AIC=inf, Time=1.82 sec
88+
ARIMA(0,1,0)(0,1,0)[12] : AIC=3802.747, Time=0.04 sec
89+
ARIMA(1,1,0)(1,1,0)[12] : AIC=3597.813, Time=0.15 sec
90+
ARIMA(0,1,1)(0,1,1)[12] : AIC=inf, Time=0.99 sec
91+
ARIMA(1,1,0)(0,1,0)[12] : AIC=3804.105, Time=0.04 sec
92+
ARIMA(1,1,0)(2,1,0)[12] : AIC=3525.586, Time=0.34 sec
93+
ARIMA(1,1,0)(2,1,1)[12] : AIC=inf, Time=2.90 sec
94+
ARIMA(1,1,0)(1,1,1)[12] : AIC=inf, Time=1.09 sec
95+
ARIMA(0,1,0)(2,1,0)[12] : AIC=3523.686, Time=0.26 sec
96+
ARIMA(0,1,0)(1,1,0)[12] : AIC=3596.070, Time=0.08 sec
97+
ARIMA(0,1,0)(2,1,1)[12] : AIC=inf, Time=2.63 sec
98+
ARIMA(0,1,0)(1,1,1)[12] : AIC=inf, Time=0.80 sec
99+
ARIMA(0,1,1)(2,1,0)[12] : AIC=3525.569, Time=0.34 sec
100+
ARIMA(1,1,1)(2,1,0)[12] : AIC=3526.799, Time=0.70 sec
101+
ARIMA(0,1,0)(2,1,0)[12] intercept : AIC=3525.686, Time=0.76 sec
102+
103+
Best model: ARIMA(0,1,0)(2,1,0)[12]
104+
Total fit time: 12.965 seconds
105+
SARIMAX Results
106+
==========================================================================================
107+
Dep. Variable: y No. Observations: 697
108+
Model: SARIMAX(0, 1, 0)x(2, 1, 0, 12) Log Likelihood -1758.843
109+
Date: Sun, 07 Jul 2024 AIC 3523.686
110+
Time: 00:01:08 BIC 3537.270
111+
Sample: 0 HQIC 3528.942
112+
- 697
113+
Covariance Type: opg
114+
==============================================================================
115+
coef std err z P>|z| [0.025 0.975]
116+
------------------------------------------------------------------------------
117+
ar.S.L12 -0.6850 0.032 -21.233 0.000 -0.748 -0.622
118+
ar.S.L24 -0.3251 0.036 -9.102 0.000 -0.395 -0.255
119+
sigma2 9.9300 0.420 23.621 0.000 9.106 10.754
120+
===================================================================================
121+
Ljung-Box (L1) (Q): 0.09 Jarque-Bera (JB): 50.28
122+
Prob(Q): 0.76 Prob(JB): 0.00
123+
Heteroskedasticity (H): 1.34 Skew: 0.10
124+
Prob(H) (two-sided): 0.03 Kurtosis: 4.31
125+
===================================================================================
126+
127+
Warnings:
128+
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
129+
130+
131+
132+
### Plotting
133+
134+
```python
135+
# Forecast
136+
forecast_steps = len(test)
137+
forecast = results.get_forecast(steps=forecast_steps, exog=test['SP500'])
138+
forecast_ci = forecast.conf_int(alpha=0.1)
139+
140+
# Plot the forecast
141+
plt.figure(figsize=(12, 6), dpi=200)
142+
plt.plot(train.index, train['AAPL'], label='Training Data')
143+
plt.plot(test.index, test['AAPL'], label='True Test Data')
144+
plt.plot(test.index, forecast.predicted_mean, color='r', label='SARIMAX Forecast')
145+
plt.fill_between(test.index, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1], color='pink', alpha=0.3)
146+
plt.title('SARIMAX Forecast of Apple Stock Prices')
147+
plt.xlabel('Date')
148+
plt.ylabel('Price')
149+
plt.legend()
150+
plt.show()
151+
152+
# Evaluate the model
153+
from sklearn.metrics import mean_squared_error, mean_absolute_error
154+
mse = mean_squared_error(test['AAPL'], forecast.predicted_mean)
155+
mae = mean_absolute_error(test['AAPL'], forecast.predicted_mean)
156+
rmse = np.sqrt(mse)
157+
print(f'Mean Squared Error: {mse:.4f}')
158+
print(f'Mean Absolute Error: {mae:.4f}')
159+
print(f'Root Mean Squared Error: {rmse:.4f}')
160+
161+
# Check the impact of the exogenous variable
162+
print(results.summary())
163+
```
164+
165+
166+
![png](sarimax_example_files/sarimax_example_2_1.png)
167+
168+
169+
170+
> Mean Squared Error: 1235.4975
171+
>
172+
> Mean Absolute Error: 28.1924
173+
>
174+
> Root Mean Squared Error: 35.1496
175+
176+
```
177+
SARIMAX Results
178+
==========================================================================================
179+
Dep. Variable: AAPL No. Observations: 697
180+
Model: SARIMAX(0, 1, 0)x(2, 1, 0, 12) Log Likelihood -1385.522
181+
Date: Sun, 07 Jul 2024 AIC 2779.044
182+
Time: 00:01:09 BIC 2797.156
183+
Sample: 0 HQIC 2786.053
184+
- 697
185+
Covariance Type: opg
186+
==============================================================================
187+
coef std err z P>|z| [0.025 0.975]
188+
------------------------------------------------------------------------------
189+
SP500 0.0475 0.001 37.409 0.000 0.045 0.050
190+
ar.S.L12 -0.6961 0.027 -25.701 0.000 -0.749 -0.643
191+
ar.S.L24 -0.3266 0.032 -10.212 0.000 -0.389 -0.264
192+
sigma2 3.3320 0.127 26.340 0.000 3.084 3.580
193+
===================================================================================
194+
Ljung-Box (L1) (Q): 1.59 Jarque-Bera (JB): 153.93
195+
Prob(Q): 0.21 Prob(JB): 0.00
196+
Heteroskedasticity (H): 0.98 Skew: -0.02
197+
Prob(H) (two-sided): 0.87 Kurtosis: 5.32
198+
===================================================================================
199+
200+
Warnings:
201+
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
202+
```
203+
204+
This model provide a good forecast for the first 20 candles, then it loses accuracy incrementally after that period (as the confidence levels diverges). A solution could be to retrain the model each month or some other arbitrary period. In the next section, we will see how to perform that in a simple way.
205+
206+
## Update Model each month
207+
208+
209+
```python
210+
predictions = []
211+
conf_inters = []
212+
step = 20 # one month has 20 tradable days
213+
214+
215+
for i in range(0, test_size, step):
216+
# Split the data into train and test sets
217+
train_size = int(len(data) * 0.8) + i
218+
train, test = data[:train_size], data[train_size:train_size+step]
219+
220+
# Determine the best SARIMAX model
221+
exog = train['SP500']
222+
endog = train['AAPL']
223+
224+
# Fit the SARIMAX model
225+
sarimax_model = SARIMAX(endog, exog=exog, order=model.order, seasonal_order=model.seasonal_order)
226+
results = sarimax_model.fit()
227+
228+
# Forecast
229+
forecast_steps = len(test)
230+
forecast = results.get_forecast(steps=forecast_steps, exog=test['SP500'])
231+
forecast_ci = forecast.conf_int()
232+
233+
predictions.append(forecast.predicted_mean)
234+
conf_inters.append(forecast_ci)
235+
236+
# print(i, forecast_steps)
237+
238+
```
239+
240+
### Plotting
241+
```python
242+
# Concatenate predictions list
243+
forecasts = pd.concat(predictions)
244+
forecasts_ci = pd.concat(conf_inters)
245+
246+
# Split the data into train and test sets
247+
train_size = int(len(data) * 0.8)
248+
train, test = data[:train_size], data[train_size:]
249+
test_size = len(test)
250+
251+
# Plot the forecast
252+
plt.figure(figsize=(12, 6), dpi=200)
253+
plt.plot(train.index[-200:], train['AAPL'].iloc[-200:], label='Training Data', color="#5e64f2")
254+
plt.plot(test.index, test['AAPL'], label='True Test Data', color="#b76426")
255+
plt.plot(test.index, forecasts, color='g', label='SARIMAX Forecast')
256+
plt.fill_between(test.index, forecasts_ci.iloc[:, 0], forecasts_ci.iloc[:, 1], color='blue', alpha=0.1)
257+
plt.title('SARIMAX Forecast of Apple Stock Prices using SP500 as exogenous data')
258+
plt.xlabel('Date')
259+
plt.ylabel('Price')
260+
plt.legend()
261+
plt.show()
262+
263+
# Evaluate the model
264+
from sklearn.metrics import mean_squared_error, mean_absolute_error
265+
mse = mean_squared_error(test['AAPL'], forecasts)
266+
mae = mean_absolute_error(test['AAPL'], forecasts)
267+
rmse = np.sqrt(mse)
268+
print(f'Mean Squared Error: {mse:.4f}')
269+
print(f'Mean Absolute Error: {mae:.4f}')
270+
print(f'Root Mean Squared Error: {rmse:.4f}')
271+
```
272+
273+
274+
275+
![png](sarimax_example_files/sarimax_example_6_0.png)
276+
277+
278+
279+
> Mean Squared Error: 72.5481
280+
>
281+
> Mean Absolute Error: 5.9930
282+
>
283+
> Root Mean Squared Error: 8.5175
284+
285+
286+
287+
## Interpreting the Results
288+
289+
When interpreting the SARIMAX model results, pay attention to:
290+
291+
1. The coefficient and p-value of the exogenous variable (S&P 500 in this case). A low p-value indicates that the S&P 500 is a significant predictor of Apple's stock price.
292+
293+
2. The AIC (Akaike Information Criterion) of the SARIMAX model compared to the SARIMA model without exogenous variables. A lower AIC suggests a better model fit.
294+
295+
3. The forecast accuracy metrics (MSE, MAE, RMSE) compared to the model without exogenous variables.
296+
297+
As expected from the "*update*" method, the MAE is much lower (6 against 28 of the previous one). In particular a 1-year forecast is a too far prediction for the model. Hence updating the model (retraining) each month can lead to much better results.
298+
299+
## Advantages of Including Exogenous Variables
300+
301+
1. **Improved Accuracy**: Exogenous variables can capture external influences on the stock price, potentially leading to more accurate predictions.
302+
303+
2. **Better Understanding of Relationships**: The model provides insights into how external factors affect the stock price.
304+
305+
3. **Flexibility**: You can include multiple exogenous variables to capture different aspects of the market or economy.
306+
307+
## Limitations and Considerations
308+
309+
1. **Data Availability**: Ensuring that you have future values of exogenous variables for forecasting can be challenging.
310+
311+
2. **Overfitting Risk**: Including too many exogenous variables can lead to overfitting.
312+
313+
3. **Assumption of Linear Relationships**: SARIMAX assumes linear relationships between the exogenous variables and the target variable.
314+
315+
4. **Stationarity**: Exogenous variables should ideally be stationary or differenced to achieve stationarity.
316+
317+
## Conclusion
318+
319+
Incorporating exogenous variables through a SARIMAX model can significantly enhance our ability to forecast Apple stock prices. By including relevant external factors like the S&P 500 index, we can capture broader market trends that influence individual stock performance.
320+
321+
However, it's crucial to carefully select exogenous variables based on domain knowledge and to rigorously test their impact on model performance. Always validate your model using out-of-sample data and consider combining statistical forecasts with fundamental analysis for a comprehensive investment strategy.

public/categories/finance/index.html

Lines changed: 47 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -296,10 +296,55 @@
296296
<div class="post-card">
297297
<div class="card">
298298
<div class="card-head">
299-
<a href="/posts/finance/stock_prediction/sarima/" class="post-card-link">
299+
<a href="/posts/finance/stock_prediction/sarimax/index.md/" class="post-card-link">
300300
<img class="card-img-top" src='/images/default-hero.jpg' alt="Hero Image">
301301
</a>
302302
</div>
303+
<div class="card-body">
304+
<a href="/posts/finance/stock_prediction/sarimax/index.md/" class="post-card-link">
305+
<h5 class="card-title">SARIMAX Model Analysis of Apple Stock with Exogenous Variables</h5>
306+
<p class="card-text post-summary">Introduction to Exogenous Variables in Time Series Models Exogenous variables, also known as external regressors, are independent variables that are not part of the main time series but can influence it. In the context of stock price prediction, exogenous variables might include:
307+
Market indices (e.g., S&amp;P 500) Economic indicators (e.g., GDP growth, unemployment rate) Company-specific metrics (e.g., revenue, earnings per share) Sentiment indicators (e.g., social media sentiment) Mathematical Formulation of SARIMAX The SARIMAX model extends the SARIMA model by including exogenous variables.</p>
308+
</a>
309+
310+
<div class="tags">
311+
<ul style="padding-left: 0;">
312+
313+
314+
<li class="rounded"><a href="/tags/finance/" class="btn btn-sm btn-info">Finance</a></li>
315+
316+
317+
<li class="rounded"><a href="/tags/statistics/" class="btn btn-sm btn-info">Statistics</a></li>
318+
319+
320+
<li class="rounded"><a href="/tags/forecasting/" class="btn btn-sm btn-info">Forecasting</a></li>
321+
322+
</ul>
323+
</div>
324+
325+
326+
</div>
327+
<div class="card-footer">
328+
<span class="float-start">
329+
Saturday, July 6, 2024
330+
| 12 minutes </span>
331+
<a
332+
href="/posts/finance/stock_prediction/sarimax/index.md/"
333+
class="float-end btn btn-outline-info btn-sm">Read</a>
334+
</div>
335+
</div>
336+
</div>
337+
338+
339+
340+
341+
<div class="post-card">
342+
<div class="card">
343+
<div class="card-head">
344+
<a href="/posts/finance/stock_prediction/sarima/" class="post-card-link">
345+
<img class="card-img-top" src='/posts/finance/stock_prediction/sarima/images/sarima_example_9_1.png' alt="Hero Image">
346+
</a>
347+
</div>
303348
<div class="card-body">
304349
<a href="/posts/finance/stock_prediction/sarima/" class="post-card-link">
305350
<h5 class="card-title">Time Series Analysis and SARIMA Model for Stock Price Prediction</h5>
@@ -327,7 +372,7 @@ <h5 class="card-title">Time Series Analysis and SARIMA Model for Stock Price Pre
327372
<div class="card-footer">
328373
<span class="float-start">
329374
Thursday, July 4, 2024
330-
| 7 minutes </span>
375+
| 6 minutes </span>
331376
<a
332377
href="/posts/finance/stock_prediction/sarima/"
333378
class="float-end btn btn-outline-info btn-sm">Read</a>

0 commit comments

Comments
 (0)