Esri enhanced the Forest-based Forecast tool in ArcGIS Pro 2.9 to build a Multivariate Forest model when users choose to add Other Variables in the tool. However, the validation RMSE is underestimated in the multivariate model. Therefore, if validation RMSE is used to choose the best model in the Evaluate Forecasts By Location tool, this bug might produce misleading results.
In additional to the validation model, the forecast model is using the fitted values of other variables but not the observed values. Thus, the forecasted values are off.
To build a validation model, the tool excludes some of the final time steps of each time series and fits the forest model to the data that was not excluded. This validation model is then used to forecast the values of the data that were withheld, and the forecasted values are compared to the observed values that were hidden to calculate the validation RMSE. When the validation model does the forecasting, the model should assume it doesn’t know any observed values in data that is excluded for validation, and always use the forecasted values to forecast the next forecasted value. But in 2.9, we were incorrectly using the observed values to forecast the next steps in validation, thus the validation RMSE tends to be underestimated.
For example, say we are using fully vaccinated rate to aid the forecast of daily new deaths, and choose a sliding window size = 4. Before starting validation, the tool builds two models: one univariate model for fully vaccinated rate and one multivariate model for daily new deaths.
When it comes to validate the multivariate model of daily new deaths, the tool should use the last four observed daily new death cases, and the last four observed fully vaccinated rates to feed the multivariate model, and make a forecast of the first validation step. It should then use the last three observed values of the two variables, the first forecasted value of daily new deaths, and the first forecast fully vaccinated rate, to make a forecast of the second validation step. And similarly, it should use the last two observed values of the two variables, and both the first two forecasted values of the daily new deaths and fully vaccinated rates, to make a forecast of the third validation step, and so forth. By calculating the validation RMSE in this way, the further forecasted steps should have a larger validation RMSE.
The bug of validation RMSE in the ArcGIS Pro 2.9 released version is that in the Multivariate Forest-based Forecast model, the validation RMSE calculation does not incorporate the forecasted values. It uses the observed values instead. In another words, for each validation step, the tool always uses four previous observed values of daily new deaths, and four previous observed fully vaccinated rates, to make a forecast of the next daily new death. Therefore, we can always get a smaller validation RMSE since the forecasts are based on more real information.
In addition to the underestimated validation RMSE, the forecasted values and forecasted RMSE are also off. The reason is that when the tool builds a multivariate forecast model, we should use the observed values of other variables if they exist, as sown in the the left graph below. However, the tool in ArcGIS Pro 2.9 uses the fitted values instead, as shown in the right graph showing below. Since the fitted values of the tool are very close to the observed values, in most of cases, the incorrect forecasted values and the correct forecasted will not have more than 1 percentage of change.
The fix will be included in the upcoming ArcGIS Pro 2.9.3 patch.
Before the patch is released: