BIDS Senior Fellow Lauel Larsen and ESDL Project Scientist Dino Bellugi are offering this project (#4) through UC Berkeley's Undergraduate Research Apprentice Program (URAP) for the Fall 2021 academic semester.
The Environmental Systems Dynamics Laboratory (ESDL) focuses on the interplay between biological, physical, and human aspects of the environment using a combination of physically-based and data-driven models. Research topics include how river deltas grow or shrink, how landslides occur and mobilize, how deforestation affects precipitation, and how to forecast the response of environmental systems under changing forcing scenarios. This internship aims to expand on our current work exploring the use of deep learning (DL) for environmental predictions.
DL methods often outperform other models (including physical ones) in making environmental predictions but are often used as a “black box”, reducing our ability to gain insight into the physical processes involved. For example, Long-Short-Term-Memory (LSTM) networks are extremely effective in making river streamflow predictions, even in watersheds that are snow-dominated, as they can capture the lags between the forcing and response variables. Unlike a physical model, the LSTM does not know that in the winter precipitation turns to snow and does not become streamflow until the melting season. Yet, it learns from data that the system has a memory, and is able in many cases to generate accurate streamflow predictions, based on precipitation and temperature time series. In such cases the state variables indeed track observed snow measurements, even though these have not been provided to the LSTM as input variables. This suggests that the internal states of a trained LSTM represent hydrologic processes that control streamflow, and they can be identified by their correspondence to independent, and collocated observational datasets that the model has not seen. Thus, analyzing the LSTM state variables could provide insight on how the response may change under different climatic regimes, as well as the capability of approximating basin-wide variables that are not measured in many watersheds.
In addition, we seek to introduce physical constraints (such as water balance) to the LSTM, by modifying the optimization loss function and/or by including process-based model outputs among the input variables. This will enable the improvement of streamflow prediction, particularly in non-stationary conditions where out-of-sample data are more frequent, as well as a more robust generalization to other watersheds where data measurements are more sparse.
Similar applications include the prediction of soil moisture, evapo-transpiration, solute concentration, and subsurface pore water pressure. In addition to generating good predictions, we would like to learn how these response times change across time and space. We also want to explore how transferable DL methods are across different landscapes or climate gradients, as transferability is essential in developing larger scale models that can be trained concurrently on many different watersheds in different climatic and topographic settings. Finally,we want to explore how introducing physical constraints using physics-based loss functions and hybrid data-driven and process-based models can aid generalization and performance in non-stationary conditions.
The student intern will work with a variety of time series data from intensely monitored Critical Zone observatories, as well as from state and national datasets discharge and precipitation. The student will work collaboratively to develop DL models and to interpret the LSTM state variables, and their relative importance. Example tasks involved in this project:
- Experiment with diverse LSTM model architectures
- Parallelize code to tune hyperparameters of the LSTM model at the large scale using UC Berkeley high-performance computing clusters
- Implement physical constraints in the LTSM model structures
- Apply transfer learning in LSTM model
- Analyze the importance of physical inputs in LSTM using method like layer-wise relevance propagation
Learning outcomes include:
- Mastering how to train, calibrate and optimize deep neural networks model
- Learning how to use artificial intelligence to understand physical processes and improve environmental science
- Achieving an improved understanding of environmental systems and hydrology in particular
- Improving data processing skills, including time series analyses
- Becoming familiar with major issues in environmental forecasting and the underlying science
- Gaining hands-on experience with data-driven approaches to catchment hydrology
Day-to-day supervisors for this project: Dino Bellugi, Staff Researcher, Liang Zhang, Graduate Student
Qualifications: This project will be of interest to students in Computer Science, Data Science, and Statistics (though students from other majors are also welcome to apply) who have an interest in applying their Machine Learning (ML) experience to the domains of Earth Science, Geography, Civil and Environmental Engineering.
Required: Students should have programming capabilities in Python and PyTorch in particular. Familiarity with Matlab, and other ML libraries such as Scikit-Learn, Keras, and the Matlab Statistics and Machine Learning Toolbox is desirable. Students should demonstrate a strong ML background, highlighting courses they have taken, and applications developed. Students should be willing to work as a member of a research team and have strong communications skills.
Off-Campus Research Site: Predominantly remote, pending campus re-opening progress.
Related website: http://esdlberkeley.com.