A Machine-Learning Approach for Studying Disinfectant By-Products (DBPs) – OSCAR Celebration of Student Scholarship and Impact

Author(s): Isaac Amouzou

Mentor(s): Ben Seiyon Lee, Department of Statistics

Abstract

Disinfectant byproducts or DBPs for short are chemical compounds that form when disinfectants in water (ex: chlorine) react with natural organic matter. Chronic DBP exposure can cause significant negative health effects, such as bladder cancer, colon cancer, and pregnancy complications. DBP exposure is difficult to measure so a surrogate is needed to model DBP exposure. In the previous spring we developed statistical models that could accurately model the relationship between specific DBPs (selected using a feature selection method called lasso regression), categorical variables and toxic DBP classes when taking into account the public water system of origin. This project developed a classification framework for detecting at-risk public water systems using a machine learing method known as feed-forward neural networks. We also created an interactive tool for allowing anyone regardless of prior experience to explore the data used for the project, perform predictions using the previously developed models on the DBP classes, and evaluate the models with the DBP classes using either ICR data that contained data from over 13,000 measurements from 295 different public water systems or from data given by the user.

Audio Transcript

hello I am Isaac Amouzou and this is my project a machine learning approach for studying disinfectant byproducts or dvps disinfectant byproducts or chemical compounds that form when disinfectants in water react with natural organic matter some negative health effects that can come from chronic disinfectant byproduct exposure are bladder cancer colon cancer or pregnancy complications highly disinfectant water sources can lead to DBP exposure to ensure drinking water safety it is imperative that DBP levels and public water sources be properly monitored

dbp exposure is difficult to measure directly so surrogates have been used for example trihalomethanes or THMs are easier to measure and it was believed that THM concentrations are proportional to concentrations of other dbp classes a past study examined the link between THMs and a dbp class Haloacetonitriles or Hans the data consisted of over 9500 measurements from 248 Public Water Systems the result was that thms could only explain 30% of the variance in HANconcentrations

previously in Spring We examined four DBP classes using statistical models for this current project our goals were to create a framework to classify at-risk Public Water Systems given a DBP threshold using machine learning methods and to create a publicly available interactive tool to explore our data to generate and download predictions and evaluate model performance our data is from the information collection request database from the Environmental Protection Agency the data set contains over 13 000 measurements from 295 Public Water Systems

this is our framework for classifying at-risk Public Water Systems for class the classification we use a neural network model in this section we prepare the data to be processed and passed into the neural network for classification here we created a threshold to classify at-risk water systems and to prepare the DBP target class haa9 a specific threshold we used here was 72 micrograms per liter and we did 1.5 standard deviations off that limit so any public water system that is above 72 minus one and a half standard deviations will be classified as at risk and any below will be classified as save here we split the data into training and validation sets this is done because when we want to train our new network model we want to split up the validation sets so we can train with the specific training set and then only validate with data that hasn’t been seen before

here we want to prepare our data loader and batch data and batching is essential for stochastic gradient descent or SGD because the model updates its perimeters after processing each batch instead of updating after processing the entire data set this introduces Randomness in the update process and can lead to faster convergence and better generalization

this is the structure of a feed forward neural network a feed fowardr neural network is consisted of an input layer one or more hidden layers and then to an output layer each layer in the neural network will consist of a neuron which is a node that can take input and pass output a neuron also performs a computation with weights and biases before passing the computation through an activation function which is a function that adds non-linearity to a neural network which helps it the model learn complex relationships some common activation function are relu tanh or softMax here is the training and validation procedure for our neural network

here’s our neural network structure we have 326 inputs and two outputs with where the two outputs are either at risk which is represented by one meaning above the threshold or zero which means not at risk or below the threshold we have one hidden layer of 164 neurons and we have selected railroad as our activation function

here are the metrics we will use we have Precision which is sometimes known as positive predictive rate it is a measure for evaluating how well a model is at classifying positive or in this case at risk Public Water Systems and as well as recall which is sometimes known as true positive rate which is a measure for evaluating situations when for all cases where the data or at-risk Public Water Systems can be identified as positive and how accurate the model is at classifying them as positive

here we have an Epoch which is a single pass through the entire training data set and in the process of turning this model we have we use 18 epochs this is the final evaluation section and here we have a recall score of 0.72 of 72 percent the Precision score of 0.79 or 79 in an accuracy score of 0.83 or 83 which is just checking out how accurate a model was at classification

this interactive tool was made using R shiny the first panel shows a map of the Public Water Systems being used here on the all tab this denotes all public water systems for each DBP class when hovering over a specific marker you will be able to see the average concentrations of these DBP classes the public water system ID and the public water system name when you click on a marker you’ll be presented with a link to the EPA water system report on the public the specific public water system as well as the icr water system report

when you select a DBP from the drop down you it will you’ll be shown all public water systems that reported that specific DBP level and it will denote how close they are to a risk threshold with red being above the risk threshold yellow being within one standard deviation of the risk threshold and green being more than one standard deviation away from the risk threshold

on the predict panel if you have data that you would like to generate predictions for you can upload it as a CSV file like so and the dashboard will automatically remove any rows that have missing DBP or categorical values you can select which model you would like to predict with as well as which DBP class you would like to predict and generate your predictions here you can see the predicted value of the DBP class HKS or haloketones and the actual value although to generate predictions you do not need the actual value and if you want to download the data you can just simply click this button and download it as well

on the evaluation panel similarly you will also have to select a model that you wish to evaluate as well as which class you would like to evaluate the model on and again I’ll select hello ketones and you have two options here if you have data that you wish to test the model on yourself you can input the data as well as a similar process before and I’ll also remove any missing rows this time you actually will need the actual the class value in order to evaluate how well the model performs you’ll still have to click the evaluate button and you will get these metrics that appear the average prediction error is the average difference between the predicted DBP and the actual DBP the average percentage prediction error is the average percentage difference between the predictions and the actual DBP MSE or mean squared error is assesses the average square difference between the observed and predicted values rmsc is the square root of MSE so it’s called root mean squared error and while MSE measures the units that are the square of the target variable rmsc measures this in the same units as the target variable and here adjusted r squared is the proportion of the DBP class with the model can explain and I could also evaluate to see that a linear mix model as well if I wanted to use the icr data I can input seed which allows for random selection and reproducibility of results I can also input the training proportion which is how much of the data set the model will use for training so in this case 0.7 or 70% of the data will be used for training and I can evaluate here

the metrics are going to be the same but instead they are calculated on a test set which is kept separate from the training set here we also again have the test average prediction error which is the average prediction error on the test set the average percentage prediction error on the test set as well and MSE and RMSE on the test set too the new metrics we have here are a conditional r squared which is the proportion of the DBP class and in this case HKS are haloketones that the model can explain and the margin r squared which is the proportion of the DBP class that the fix effects or variables without taking into account public water system levels can explain and this is an interactive tool that you can use to explore the data as well as predict and evaluate the models using the models using neural networks we have created a framework to classify water systems that fall above or below a risk threshold we have created an interactive tool that will allow the users to examine the data that was used predict with and evaluate using the models and we have we will we plan to continue to develop and improve our classification framework and interactive tool we also plan to implement imputation to take into account the missing records for certain DBP classes

and these are citations and thank you for listening to my presentation

2 replies on “A Machine-Learning Approach for Studying Disinfectant By-Products (DBPs)”

Its great to start off the video with an introduction to what the DBPs are, its helpful to revisit along the way of the presentation and that you have your citations available. Wow! I don’t think i have ever thought about the DBP and its impact on the public and community, working in healthcare we use these products all the time. I was quite intrigued by the data you found being mainly located on the east coast. The work you did to calculate the amount of the DBP in specific locations also looks cool, I’m sure it took a while to input all that data and find an appropriate system to categorize it. I like the friendly user experience of the map that was created to see which areas contained what amounts of the byproduct the hovering option and the color coding for the potential hazards of that public water source. Awesome job man!

Great job, Isaac. What a useful tool. What is the next step? Will you be testing the tool on actual bodies of water? Will you make it available?

2 replies on “A Machine-Learning Approach for Studying Disinfectant By-Products (DBPs)”

Leave a Reply Cancel reply