|
| 1 | +--- |
| 2 | +title: "SARIMAX Model Analysis of Apple Stock with Exogenous Variables" |
| 3 | +date: 2024-07-06T00:00:00+01:00 |
| 4 | +description: "Short Stock price analysis on AAPL, then a prediction is tested using SARIMAX model" |
| 5 | +menu: |
| 6 | + sidebar: |
| 7 | + name: SARIMAX |
| 8 | + identifier: SARIMAX |
| 9 | + parent: stock_prediction |
| 10 | + weight: 12 |
| 11 | +hero: sarimax_example_files/sarimax_example_6_0.png |
| 12 | +tags: ["Finance", "Statistics", "Forecasting"] |
| 13 | +categories: ["Finance"] |
| 14 | +--- |
| 15 | + |
| 16 | + |
| 17 | +## Introduction to Exogenous Variables in Time Series Models |
| 18 | +In the previous articles ( [ARIMA](/posts/finance/stock_prediction/arima) and [SARIMA](/posts/finance/stock_prediction/sarima) |
| 19 | + |
| 20 | +Exogenous variables, also known as external regressors, are independent variables that are not part of the main time series but can influence it. In the context of stock price prediction, exogenous variables might include: |
| 21 | + |
| 22 | +1. Market indices (e.g., S&P 500) |
| 23 | +2. Economic indicators (e.g., GDP growth, unemployment rate) |
| 24 | +3. Company-specific metrics (e.g., revenue, earnings per share) |
| 25 | +4. Sentiment indicators (e.g., social media sentiment) |
| 26 | + |
| 27 | +## Mathematical Formulation of SARIMAX |
| 28 | + |
| 29 | +The SARIMAX model extends the SARIMA model by including exogenous variables. The mathematical representation is: |
| 30 | + |
| 31 | +$$ φ(B)Φ(Bᵐ)(1-B)ᵈ(1-Bᵐ)D (Yₜ - β₁X₁,ₜ - β₂X₂,ₜ - ... - βₖXₖ,ₜ) = θ(B)Θ(Bᵐ)εₜ$$ |
| 32 | + |
| 33 | +Where: |
| 34 | +- $Yₜ$ is the dependent variable (in our case, Apple stock price) |
| 35 | +- $X₁,ₜ, X₂,ₜ, ..., Xₖ,$ₜ are the exogenous variables |
| 36 | +- $β₁, β₂, ..., βₖ$ are the coefficients of the exogenous variables |
| 37 | +- All other terms are as defined in the SARIMA model |
| 38 | + |
| 39 | +## Implementing SARIMAX for Apple Stock |
| 40 | + |
| 41 | +Let's implement a SARIMAX model for Apple stock, using the S&P 500 index as an exogenous variable: |
| 42 | + |
| 43 | + |
| 44 | +```python |
| 45 | +import pandas as pd |
| 46 | +import numpy as np |
| 47 | +import matplotlib.pyplot as plt |
| 48 | +import yfinance as yf |
| 49 | +from statsmodels.tsa.statespace.sarimax import SARIMAX |
| 50 | +from pmdarima import auto_arima |
| 51 | + |
| 52 | +# Download Apple stock data and S&P 500 data |
| 53 | +start_date = "2021-01-01" |
| 54 | +end_date = "2024-06-24" |
| 55 | +aapl = yf.download("AAPL", start=start_date, end=end_date)['Close'] |
| 56 | +sp500 = yf.download("^GSPC", start=start_date, end=end_date)['Close'] |
| 57 | + |
| 58 | +# Align the data and remove any missing values |
| 59 | +data = pd.concat([aapl, sp500], axis=1).dropna() |
| 60 | +data.columns = ['AAPL', 'SP500'] |
| 61 | + |
| 62 | +# Split the data into train and test sets |
| 63 | +train_size = int(len(data) * 0.8) |
| 64 | +train, test = data[:train_size], data[train_size:] |
| 65 | +test_size = len(test) |
| 66 | + |
| 67 | +# Determine the best SARIMAX model |
| 68 | +exog = train['SP500'] |
| 69 | +endog = train['AAPL'] |
| 70 | + |
| 71 | +model = auto_arima(endog, exogenous=exog, seasonal=True, m=12, |
| 72 | + start_p=1, start_q=1, start_P=1, start_Q=1, |
| 73 | + max_p=3, max_q=3, max_P=2, max_Q=2, d=1, D=1, |
| 74 | + trace=True, error_action='ignore', suppress_warnings=True, |
| 75 | + stepwise=True, out_of_sample=200) |
| 76 | + |
| 77 | +print(model.summary()) |
| 78 | + |
| 79 | +# Fit the SARIMAX model |
| 80 | +sarimax_model = SARIMAX(endog, exog=exog, order=model.order, seasonal_order=model.seasonal_order) |
| 81 | +results = sarimax_model.fit() |
| 82 | + |
| 83 | +print(results.summary()) |
| 84 | +``` |
| 85 | + |
| 86 | + Performing stepwise search to minimize aic |
| 87 | + ARIMA(1,1,1)(1,1,1)[12] : AIC=inf, Time=1.82 sec |
| 88 | + ARIMA(0,1,0)(0,1,0)[12] : AIC=3802.747, Time=0.04 sec |
| 89 | + ARIMA(1,1,0)(1,1,0)[12] : AIC=3597.813, Time=0.15 sec |
| 90 | + ARIMA(0,1,1)(0,1,1)[12] : AIC=inf, Time=0.99 sec |
| 91 | + ARIMA(1,1,0)(0,1,0)[12] : AIC=3804.105, Time=0.04 sec |
| 92 | + ARIMA(1,1,0)(2,1,0)[12] : AIC=3525.586, Time=0.34 sec |
| 93 | + ARIMA(1,1,0)(2,1,1)[12] : AIC=inf, Time=2.90 sec |
| 94 | + ARIMA(1,1,0)(1,1,1)[12] : AIC=inf, Time=1.09 sec |
| 95 | + ARIMA(0,1,0)(2,1,0)[12] : AIC=3523.686, Time=0.26 sec |
| 96 | + ARIMA(0,1,0)(1,1,0)[12] : AIC=3596.070, Time=0.08 sec |
| 97 | + ARIMA(0,1,0)(2,1,1)[12] : AIC=inf, Time=2.63 sec |
| 98 | + ARIMA(0,1,0)(1,1,1)[12] : AIC=inf, Time=0.80 sec |
| 99 | + ARIMA(0,1,1)(2,1,0)[12] : AIC=3525.569, Time=0.34 sec |
| 100 | + ARIMA(1,1,1)(2,1,0)[12] : AIC=3526.799, Time=0.70 sec |
| 101 | + ARIMA(0,1,0)(2,1,0)[12] intercept : AIC=3525.686, Time=0.76 sec |
| 102 | + |
| 103 | + Best model: ARIMA(0,1,0)(2,1,0)[12] |
| 104 | + Total fit time: 12.965 seconds |
| 105 | + SARIMAX Results |
| 106 | + ========================================================================================== |
| 107 | + Dep. Variable: y No. Observations: 697 |
| 108 | + Model: SARIMAX(0, 1, 0)x(2, 1, 0, 12) Log Likelihood -1758.843 |
| 109 | + Date: Sun, 07 Jul 2024 AIC 3523.686 |
| 110 | + Time: 00:01:08 BIC 3537.270 |
| 111 | + Sample: 0 HQIC 3528.942 |
| 112 | + - 697 |
| 113 | + Covariance Type: opg |
| 114 | + ============================================================================== |
| 115 | + coef std err z P>|z| [0.025 0.975] |
| 116 | + ------------------------------------------------------------------------------ |
| 117 | + ar.S.L12 -0.6850 0.032 -21.233 0.000 -0.748 -0.622 |
| 118 | + ar.S.L24 -0.3251 0.036 -9.102 0.000 -0.395 -0.255 |
| 119 | + sigma2 9.9300 0.420 23.621 0.000 9.106 10.754 |
| 120 | + =================================================================================== |
| 121 | + Ljung-Box (L1) (Q): 0.09 Jarque-Bera (JB): 50.28 |
| 122 | + Prob(Q): 0.76 Prob(JB): 0.00 |
| 123 | + Heteroskedasticity (H): 1.34 Skew: 0.10 |
| 124 | + Prob(H) (two-sided): 0.03 Kurtosis: 4.31 |
| 125 | + =================================================================================== |
| 126 | + |
| 127 | + Warnings: |
| 128 | + [1] Covariance matrix calculated using the outer product of gradients (complex-step). |
| 129 | + |
| 130 | + |
| 131 | + |
| 132 | +### Plotting |
| 133 | + |
| 134 | +```python |
| 135 | +# Forecast |
| 136 | +forecast_steps = len(test) |
| 137 | +forecast = results.get_forecast(steps=forecast_steps, exog=test['SP500']) |
| 138 | +forecast_ci = forecast.conf_int(alpha=0.1) |
| 139 | + |
| 140 | +# Plot the forecast |
| 141 | +plt.figure(figsize=(12, 6), dpi=200) |
| 142 | +plt.plot(train.index, train['AAPL'], label='Training Data') |
| 143 | +plt.plot(test.index, test['AAPL'], label='True Test Data') |
| 144 | +plt.plot(test.index, forecast.predicted_mean, color='r', label='SARIMAX Forecast') |
| 145 | +plt.fill_between(test.index, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1], color='pink', alpha=0.3) |
| 146 | +plt.title('SARIMAX Forecast of Apple Stock Prices') |
| 147 | +plt.xlabel('Date') |
| 148 | +plt.ylabel('Price') |
| 149 | +plt.legend() |
| 150 | +plt.show() |
| 151 | + |
| 152 | +# Evaluate the model |
| 153 | +from sklearn.metrics import mean_squared_error, mean_absolute_error |
| 154 | +mse = mean_squared_error(test['AAPL'], forecast.predicted_mean) |
| 155 | +mae = mean_absolute_error(test['AAPL'], forecast.predicted_mean) |
| 156 | +rmse = np.sqrt(mse) |
| 157 | +print(f'Mean Squared Error: {mse:.4f}') |
| 158 | +print(f'Mean Absolute Error: {mae:.4f}') |
| 159 | +print(f'Root Mean Squared Error: {rmse:.4f}') |
| 160 | + |
| 161 | +# Check the impact of the exogenous variable |
| 162 | +print(results.summary()) |
| 163 | +``` |
| 164 | + |
| 165 | + |
| 166 | + |
| 167 | + |
| 168 | + |
| 169 | + |
| 170 | +> Mean Squared Error: 1235.4975 |
| 171 | +> |
| 172 | +> Mean Absolute Error: 28.1924 |
| 173 | +> |
| 174 | +> Root Mean Squared Error: 35.1496 |
| 175 | +
|
| 176 | +``` |
| 177 | + SARIMAX Results |
| 178 | + ========================================================================================== |
| 179 | + Dep. Variable: AAPL No. Observations: 697 |
| 180 | + Model: SARIMAX(0, 1, 0)x(2, 1, 0, 12) Log Likelihood -1385.522 |
| 181 | + Date: Sun, 07 Jul 2024 AIC 2779.044 |
| 182 | + Time: 00:01:09 BIC 2797.156 |
| 183 | + Sample: 0 HQIC 2786.053 |
| 184 | + - 697 |
| 185 | + Covariance Type: opg |
| 186 | + ============================================================================== |
| 187 | + coef std err z P>|z| [0.025 0.975] |
| 188 | + ------------------------------------------------------------------------------ |
| 189 | + SP500 0.0475 0.001 37.409 0.000 0.045 0.050 |
| 190 | + ar.S.L12 -0.6961 0.027 -25.701 0.000 -0.749 -0.643 |
| 191 | + ar.S.L24 -0.3266 0.032 -10.212 0.000 -0.389 -0.264 |
| 192 | + sigma2 3.3320 0.127 26.340 0.000 3.084 3.580 |
| 193 | + =================================================================================== |
| 194 | + Ljung-Box (L1) (Q): 1.59 Jarque-Bera (JB): 153.93 |
| 195 | + Prob(Q): 0.21 Prob(JB): 0.00 |
| 196 | + Heteroskedasticity (H): 0.98 Skew: -0.02 |
| 197 | + Prob(H) (two-sided): 0.87 Kurtosis: 5.32 |
| 198 | + =================================================================================== |
| 199 | + |
| 200 | + Warnings: |
| 201 | + [1] Covariance matrix calculated using the outer product of gradients (complex-step). |
| 202 | +``` |
| 203 | + |
| 204 | +This model provide a good forecast for the first 20 candles, then it loses accuracy incrementally after that period (as the confidence levels diverges). A solution could be to retrain the model each month or some other arbitrary period. In the next section, we will see how to perform that in a simple way. |
| 205 | + |
| 206 | +## Update Model each month |
| 207 | + |
| 208 | + |
| 209 | +```python |
| 210 | +predictions = [] |
| 211 | +conf_inters = [] |
| 212 | +step = 20 # one month has 20 tradable days |
| 213 | + |
| 214 | + |
| 215 | +for i in range(0, test_size, step): |
| 216 | + # Split the data into train and test sets |
| 217 | + train_size = int(len(data) * 0.8) + i |
| 218 | + train, test = data[:train_size], data[train_size:train_size+step] |
| 219 | + |
| 220 | + # Determine the best SARIMAX model |
| 221 | + exog = train['SP500'] |
| 222 | + endog = train['AAPL'] |
| 223 | + |
| 224 | + # Fit the SARIMAX model |
| 225 | + sarimax_model = SARIMAX(endog, exog=exog, order=model.order, seasonal_order=model.seasonal_order) |
| 226 | + results = sarimax_model.fit() |
| 227 | + |
| 228 | + # Forecast |
| 229 | + forecast_steps = len(test) |
| 230 | + forecast = results.get_forecast(steps=forecast_steps, exog=test['SP500']) |
| 231 | + forecast_ci = forecast.conf_int() |
| 232 | + |
| 233 | + predictions.append(forecast.predicted_mean) |
| 234 | + conf_inters.append(forecast_ci) |
| 235 | + |
| 236 | + # print(i, forecast_steps) |
| 237 | + |
| 238 | +``` |
| 239 | + |
| 240 | +### Plotting |
| 241 | +```python |
| 242 | +# Concatenate predictions list |
| 243 | +forecasts = pd.concat(predictions) |
| 244 | +forecasts_ci = pd.concat(conf_inters) |
| 245 | + |
| 246 | +# Split the data into train and test sets |
| 247 | +train_size = int(len(data) * 0.8) |
| 248 | +train, test = data[:train_size], data[train_size:] |
| 249 | +test_size = len(test) |
| 250 | + |
| 251 | +# Plot the forecast |
| 252 | +plt.figure(figsize=(12, 6), dpi=200) |
| 253 | +plt.plot(train.index[-200:], train['AAPL'].iloc[-200:], label='Training Data', color="#5e64f2") |
| 254 | +plt.plot(test.index, test['AAPL'], label='True Test Data', color="#b76426") |
| 255 | +plt.plot(test.index, forecasts, color='g', label='SARIMAX Forecast') |
| 256 | +plt.fill_between(test.index, forecasts_ci.iloc[:, 0], forecasts_ci.iloc[:, 1], color='blue', alpha=0.1) |
| 257 | +plt.title('SARIMAX Forecast of Apple Stock Prices using SP500 as exogenous data') |
| 258 | +plt.xlabel('Date') |
| 259 | +plt.ylabel('Price') |
| 260 | +plt.legend() |
| 261 | +plt.show() |
| 262 | + |
| 263 | +# Evaluate the model |
| 264 | +from sklearn.metrics import mean_squared_error, mean_absolute_error |
| 265 | +mse = mean_squared_error(test['AAPL'], forecasts) |
| 266 | +mae = mean_absolute_error(test['AAPL'], forecasts) |
| 267 | +rmse = np.sqrt(mse) |
| 268 | +print(f'Mean Squared Error: {mse:.4f}') |
| 269 | +print(f'Mean Absolute Error: {mae:.4f}') |
| 270 | +print(f'Root Mean Squared Error: {rmse:.4f}') |
| 271 | +``` |
| 272 | + |
| 273 | + |
| 274 | + |
| 275 | + |
| 276 | + |
| 277 | + |
| 278 | + |
| 279 | +> Mean Squared Error: 72.5481 |
| 280 | +> |
| 281 | +> Mean Absolute Error: 5.9930 |
| 282 | +> |
| 283 | +> Root Mean Squared Error: 8.5175 |
| 284 | + |
| 285 | + |
| 286 | + |
| 287 | +## Interpreting the Results |
| 288 | + |
| 289 | +When interpreting the SARIMAX model results, pay attention to: |
| 290 | + |
| 291 | +1. The coefficient and p-value of the exogenous variable (S&P 500 in this case). A low p-value indicates that the S&P 500 is a significant predictor of Apple's stock price. |
| 292 | + |
| 293 | +2. The AIC (Akaike Information Criterion) of the SARIMAX model compared to the SARIMA model without exogenous variables. A lower AIC suggests a better model fit. |
| 294 | + |
| 295 | +3. The forecast accuracy metrics (MSE, MAE, RMSE) compared to the model without exogenous variables. |
| 296 | + |
| 297 | +As expected from the "*update*" method, the MAE is much lower (6 against 28 of the previous one). In particular a 1-year forecast is a too far prediction for the model. Hence updating the model (retraining) each month can lead to much better results. |
| 298 | + |
| 299 | +## Advantages of Including Exogenous Variables |
| 300 | + |
| 301 | +1. **Improved Accuracy**: Exogenous variables can capture external influences on the stock price, potentially leading to more accurate predictions. |
| 302 | + |
| 303 | +2. **Better Understanding of Relationships**: The model provides insights into how external factors affect the stock price. |
| 304 | + |
| 305 | +3. **Flexibility**: You can include multiple exogenous variables to capture different aspects of the market or economy. |
| 306 | + |
| 307 | +## Limitations and Considerations |
| 308 | + |
| 309 | +1. **Data Availability**: Ensuring that you have future values of exogenous variables for forecasting can be challenging. |
| 310 | + |
| 311 | +2. **Overfitting Risk**: Including too many exogenous variables can lead to overfitting. |
| 312 | + |
| 313 | +3. **Assumption of Linear Relationships**: SARIMAX assumes linear relationships between the exogenous variables and the target variable. |
| 314 | + |
| 315 | +4. **Stationarity**: Exogenous variables should ideally be stationary or differenced to achieve stationarity. |
| 316 | + |
| 317 | +## Conclusion |
| 318 | + |
| 319 | +Incorporating exogenous variables through a SARIMAX model can significantly enhance our ability to forecast Apple stock prices. By including relevant external factors like the S&P 500 index, we can capture broader market trends that influence individual stock performance. |
| 320 | + |
| 321 | +However, it's crucial to carefully select exogenous variables based on domain knowledge and to rigorously test their impact on model performance. Always validate your model using out-of-sample data and consider combining statistical forecasts with fundamental analysis for a comprehensive investment strategy. |
0 commit comments