Authors
Andrew Zhou1 and Ivan Revilla2, 1USA, 2California State Polytechnic University, USA
Abstract
Nutrient enrichment of aquatic environments is a prevalent issue with wide-reaching negative implications for ecological stability, tourism and recreation, and vital drinking supplies. Proper management of nutrient influxes-primarily nitrogen and phosphorus-into aquatic environments is facilitated by continuous monitoring of nutrient levels within water bodies of interest, which offers a more complete understanding of seasonal trends and faster response times compared to traditional lab testing. However, continuous nutrient monitoring systems are prohibitively expensive, with ongoing energy and maintenance requirements that limit deployment. Machine learning shows potential for virtual sensor development with real-time nutrient prediction, based on continuously monitored surrogate indicators. In this study, we test the feasibility of this premise by evaluating the performance of Random Forest regressor (RF), k-Nearest Neighbors (kNN), Support Vector Machine regression (SVM), Decision Tree regressor, Artificial Neural Network, Gradient Boosting Regressor (GBR), and Histogram Gradient Boosting Regressor (HGBR) on one year of water quality testing data from sites across the Continental United States (CONUS). To address values missing not at random, an issue prevalent in water quality testing data, important surrogate indicators are identified by permutation importance. Models are then trained and tuned with Bayesian Optimization to identify hyperparameters optimal for explaining target variance. Across both phosphorus and nitrogen prediction, RF achieved the highest validation performance, with GBR and HGBR trailing marginally. Ensemble tree models appear to be well-suited to continuous nutrient monitoring and may be a cost-efficient solution to greatly supplement the existing high-frequency testing network.
Keywords
Freshwater, Total Phosphorus, Total Nitrogen, Ensemble Tree, Bayesian Optimization