Using Machine Learning to Determine Phenological State Along the Appalachian Trail in Massachusetts¶

Fall is one of the peak tourist seasons in the Berkshires, as visitors come to see the colorful leaves. Many people time their hikes along the Appalachian Trail around the timing of autumn foliage. This notebook develops models to predict the current phenological (i.e., seasonalal) state of forested pixels along the Appalachian Trail in Massachusetts. This code is run locally, but the best models are stored on GitHub to be used to update a web page on current predictions daily via GitHub Action.

This is a land cover change question, but rather than detecting changes between ecosystem types, these models classify where the canopy is within the seasonal greendown curve. Simpler approaches could be built from specific EVI or NDVI thresholds, the percent change relative to summer or winter values, or rates of change across recent observations. However, these methods are sensitive to noisy satellite data. Smoothing can reduce noise but makes it difficult to track the precise start and end of greendown in near-real time, and logistic greendown curves can only be fit robustly after the season has ended. The machine learning models developed here instead predict, from a snapshot of current and recent conditions, what stage of the greendown curve each pixel is currently experiencing.

This notebook trains and evaluates machine learning models to classify the fall phenological state of forested pixels along the Appalachian Trail in Massachusetts. Each pixel-day observation is assigned one of four labels — before (pre-greendown), early (early color change), late (peak to near-complete greendown), or after (post-greendown — based on its day-of-year relative to fitted logistic curve transition dates.

Satellite-derived vegetation indices (EVI and NDVI) from the Harmonized Landsat Sentinel-2 (HLS) dataset are used as input features, along with day length and recent temperature. Two model families are explored:

Decision trees — fast and interpretable, trained on individual observations independently
Recurrent neural networks (LSTM) — trained on temporal sequences of observations per pixel per year, able to exploit within-season context

Multiple variants of each model type are trained to evaluate regularization and training strategies. All models are compared on a held-out test set in the final section.

The best-performing models are deployed via GitHub Actions to generate daily predictions of phenological state within a 50 m buffer of the Appalachian Trail in Massachusetts. Predictions are available at https://k-wheeler.github.io/phenology/. Satellite and weather data are pulled from Google Earth Engine.

This project was developed by Kathryn Wheeler, Ph.D., with the assistance of Claude Code.

Below is a summary of the analysis decisions. The full analysis is at https://github.com/k-wheeler/AppTrail_Phenology.

Model Comparison¶

All trained models are loaded from disk and evaluated on the same held-out test set. The summary table is sorted by test accuracy. Metrics shown include overall accuracy, per-class precision, and training time for both training and test data.

=== Summary (sorted by test accuracy) ===
                             Type Training Time  Train Accuracy  Test Accuracy
Model                                                                         
RNN Dropout                   RNN      00:05:16          0.9699         0.9539
RNN Dropout 50m           RNN 50m      00:02:56          0.9547         0.9514
RNN Class Balance             RNN      01:57:58          0.9856         0.9491
RNN L1 50m                RNN 50m      00:05:14          0.9323         0.9403
RNN L1                        RNN      00:10:12          0.9403         0.9366
RNN Early Stop                RNN      00:26:44          0.9471         0.9336
RNN Soft Labels               RNN      00:09:58          0.9239         0.9224
DT No Pruning       Decision Tree      00:00:00          1.0000         0.8820
DT Select Features  Decision Tree      00:02:43          0.8731         0.8610
DT Pruned           Decision Tree      00:03:32          0.8727         0.8566

=== Test Precision by Class ===
                    before  early   late  after
Model                                          
RNN Dropout         0.9643 0.6898 0.6541 0.9635
RNN Dropout 50m     0.9495 0.5970 0.7093 0.9782
RNN Class Balance   0.9627 0.5823 0.5509 0.9719
RNN L1 50m          0.9430 0.6543 0.6010 0.9528
RNN L1              0.9416 0.6041 0.5783 0.9561
RNN Early Stop      0.9502 0.5369 0.4912 0.9630
RNN Soft Labels     0.9363 0.0000 0.4925 0.9068
DT No Pruning       0.9906 0.9154 0.7892 0.8376
DT Select Features  0.9918 0.9012 0.7279 0.8359
DT Pruned           0.9918 0.9102 0.7279 0.8105

=== Train Precision by Class ===
                    before  early   late  after
Model                                          
RNN Dropout         0.9720 0.7563 0.8082 0.9831
RNN Dropout 50m     0.9486 0.7005 0.7676 0.9797
RNN Class Balance   0.9821 0.9577 0.8855 0.9983
RNN L1 50m          0.9339 0.6287 0.6737 0.9434
RNN L1              0.9415 0.5839 0.6252 0.9651
RNN Early Stop      0.9529 0.6256 0.6380 0.9783
RNN Soft Labels     0.9321 0.0000 0.7792 0.9142
DT No Pruning       1.0000 1.0000 1.0000 1.0000
DT Select Features  0.9889 0.9114 0.7533 0.8498
DT Pruned           0.9889 0.9222 0.7533 0.8378

No description has been provided for this image

Conclusions¶

Model Performance¶

These analyses were initially run using only pixels within 50 m of the Appalachian Trail for training and testing, then expanded to 100 m after model performance proved unsatisfactory.

Overall accuracy (the fraction of all observations classified correctly) is a misleading metric here because correctly identifying the transitional classes (early and late greendown) matters more than classifying the more common before and after states. The canopy spends more time in the before and after states than during active senescence (i.e., color change), so the early and late classes are inherently underrepresented in the training data. Precision(the fraction of predictions for a given class that are actually correct) for these classes is therefore a more informative metric for evaluating performance on these transitional classes.

When trained on 50 m buffer data, the recurrent neural network (RNN, a model that processes observations as a time series drawing on the history of prior observations within a season) with dropout regularization (a technique that randomly disables a subset of model connections during each training step to prevent the model from over-relying on any single feature) achieved the best overall test accuracy, but had poor precision for the early (0.60) and late (0.70) greendown classes. Adding class balancing (oversampling the underrepresented early and late training sequences so all classes are equally represented), early stopping (halting training once validation performance stops improving), or L1 regularization (adding a penalty on large model weights to favor simpler solutions) did not improve accuracy or precision for these transitional classes. Expanding training data to 100 m substantially improved model performance: the 100 m dropout-only RNN achieved the highest overall accuracy, and none of the additional regularization strategies improved upon it, though early stopping did meaningfully reduce training time at the detriment of model performance.

Among the decision trees (models that classify observations by applying a series of yes/no thresholds on input features), the unpruned full-feature model had the highest training accuracy. However, a training accuracy of 1.0 indicates severe overfitting: the model has memorized the training data rather than learned patterns that generalize to new observations and years. The feature-selected model, which also applied cost-complexity pruning to limit tree depth, achieved the best test accuracy, with a similar degree of overfitting to the pruning-only model. Removing the three lowest-importance spectral delta features slightly reduced precision for the early greendown class compared to pruning alone, but probably not enough to warrant retaining them given the additional computational cost (slower training and larger stored data tables). The unpruned full-feature model achieved the highest precision for the early and late greendown classes, but its overfitting makes it unlikely to generalize well to future years.

All decision tree variants had lower overall test accuracy than the RNN models but notably higher precision for the early and late greendown classes. Precision for these transitional classes declined in both model types when the training footprint expanded from 50 m to 100 m. This is consistent with the general observation that decision trees tend to outperform neural network models on small datasets, while the latter benefit more from additional data to learn robust temporal patterns.

For operational greendown state predictions, the feature-selected pruned decision tree and the 100 m dropout-only RNN will be used. Future improvements could include training the RNN on more pixels and incorporating higher-level spectral features such as satellite embeddings.

Efficiency Choices¶

Smoothed the Appalachian Trail geometry to reduce complexity.
More computationally expensive steps were run locally once and their outputs stored:
- Pulling data from Google Earth Engine, storing as processed data locally, and uploading historical average GeoTIFFs to GitHub.
- Training models locally and then uploading them to GitHub.
GitHub Actions pull the most recent observation values and run the prediction.