Author(s): Isaac Amouzou
Mentor(s): Ben Seiyon Lee, Department of Statistics
AbstractWhat are DBPs?
Disinfectant byproducts or DBPs for short are chemical compounds that form when disinfectants in water react with natural organic matter.
Chronic DBP exposure can cause significant negative health effects, such as bladder cancer, colon cancer, and pregnancy complications.
A person can be exposed to DBPs through highly disinfected water sources, for example, chlorine interacting with organic matter in water. To ensure drinking water safety, it is imperative that DBP levels in public water sources be properly monitored.
Unfortunately, DBP exposure is difficult to measure directly. Instead, epidemiologists have been using Trihalomethanes or THMs, which are easier to measure, as a surrogate for DBP exposure. This was because it was believed that THM concentrations are proportional to concentrations of other DBP classes.
A previous study examined the link between THMs and a DBP class Haloacetonitrile or HANs using over 9500 measurements from 248 public water systems. This study found that THMs could only explain 30% of the variance in HAN concentrations.
For the project, we wanted to create a statistical framework to model the concentrations of hard-to-measure unregulated DBPs that drive toxicity using a wide array of co-occurring DBPs.
We also wanted to take into account the public water system (PWS) of origin for the measurements.
The data used is from the Information Collection Request database from the environmental protection agency or EPA. The dataset has more than 13,000 measurements from 295 public water systems.
For models, Linear mixed models or LMMs were used. LMMs are useful for data with high variability between groups. And in this specific case, the variability between public water systems can be considered.
LMMs allow estimation of the fixed effects (ex: DBP concentration or categorical variables) that can be measured while accounting for the variability among groups (ex: Public Water System) with random effects.
For variable selection, LASSO regression was used, which allows us to select important variables using a penalization approach with a tuning parameter, lambda.
as you can see on the top graph here as we increase lambda the coefficients tend towards 0
and in the bottom graph you can see how we select lambda, by running LASSO regression with multiple lambdas and measuring the mean squared error, which is a metric used to evaluate prediction accuracy.
We select all DBPs that have nonzero coefficients.
Multiple model structures were tested against the selected model structure (Full LMM)
The metrics used to evaluate models included AIC and BIC which are for comparing Goodness of fit (lower is better)
Conditional R-squared which measures the proportion of variance of the response explained by the model. (higher is better)
RMSE which is for comparing prediction accuracy.
We found that for four different DBP classes (haloacetic acids-5 haloacetic acids-6, haloacetonitriles, and haloketones), we can model a significant amount of the variance with a unique group of DBPs when we take into account the water system and key categorical variables.
In the upcoming summer, we plan to conduct an in-depth missing analysis of data and prepare new models for better estimation of the concentrations using the information gained from this model. We also further plan to develop a methodology to classify at-risk water treatment systems.
This is my work cited and thank you for listening to my presentation.
2 replies on “Modeling the relationship between regulated and unregulated disinfectant-by-products (DBPs) and other difficult-to-measure DBP classes”
Great job explaining your methods and the importance of your work. I look forward to your results from the summer.
Thank you, Isaac. I think you did a great job with your presentation and are tackling very interesting research. I do hope you continue with new methodology developments. Well done!