Estimating Uncertainty in Machine Learning Models — Part 1
Dhruv Nair is a Data Scientist.
We want rigidly defined areas of doubt and uncertainty.
The Hitchhiker ‘s Guide to the GALAXY was written by Douglas Adams.
Why is uncertainty important ?
Imagine for a second that we are building a computer vision model for a construction company. The company is interested in automation and would like our program to run on their drones.
After we deploy our algorithm onto their drones, we think the project is a great success. ABC Construction told us that the drones keep crashing into the white trucks that they have parked at their sites. You rush to one of the sites to look at the vision model, and realize that the side of the white truck is just the bright sky. The drones are flying straight into the trucks, thinking there is nothing there.
When making predictions about data in the real world, it is a good idea to include an estimate of how sure your model is. If models are required to make decisions that have consequences, this is even more true. Measures of uncertainty can help prevent accidents in applications such as self driving cars.
There are sources of uncertainty.
There are two types of uncertainty when modeling any process.
Uncertainty is inherent in the process we are trying to explain. A ping pong ball that is dropped above a table will land in a different spot every time. In practice, uncertainty in this category is irreducible.
The uncertainty is attributed to an inadequate knowledge of the model most suited to explain the data. More knowledge about the problem at hand makes this uncertainty reducible. Adding more parameters to the model will help reduce uncertainty.
We do n’t know how to estimate uncertainty.
A bakery is trying to estimate the number of cakes it will sell in a given month based on the number of customers that come in. We will use a simple linear regression model to model this problem. We will use the available data to estimate the different types of uncertainty in this model.
Equation for Linear Regression
It is unlikely that we will ever determine the true parameters of the model from the sample data because the coefficients are subject to sampling uncertainty. Providing an estimate of the set of possible values for these coefficients will tell us how appropriately our current model is able to explain the data.
Let ‘s generate some data. We are going to sample our values from a scaled and shifted distribution. Our y values are not related to the x values.
As np, import numpy.
From numpy. Random import from numpy. Random import seed.
Random number seed
The number of samples is 100.
x is 20 5 x 10 x 50
Our data ends up looking like this.
Distribution of Sample Data
bootstrap sampling will be used to estimate the uncertainty in our model parameters. The technique of bootstrap sampling is to replace the original dataset with a new one. It can give us an idea of the range of parameters that could describe the data.
Read more: Knitting Looms & Boards Arts, Crafts & Sewing Knitting & Crochet Knitting Loom Machine with Row Counter 40 Multifunctional Knitting Loom Machine Kit for Adults/Kids DIY Knit Scarf Hat Sock Smart Weave
In the code below, we run 1000 iteration of bootstrap sampling, fit a linear regression model to each sample dataset, log the coefficients, and intercept the model at every iteration.
From the book. Utils import resampling.
The coefficients are related to intercepts.
The model for LinearRegression is idx, randint, x_train, y_train. The coefficients of fit are x_train, y_train. The model is appended. The item intercepts. The model is appended. intercept
We got the 97. The 5th percentile is from the coefficients. The coefficients and intercepts have a 95 % confidence interval. The advantage of using percentiles to determine the interval is that you do n’t have to make assumptions about the sampling distribution of the coefficients.
The upper_coefficient is np. The percentile is 97. The upper_Intercept is np. The percentile is 97.
The coefficient is lower. 2 percentiles. The inverse of the lower_Intercept is. The percentile is 2.
The coefficients can be used to plot the 95 % confidence interval for a family of curves.
Confidence Interval for Model Parameters
Let ‘s estimate the uncertainty in the predictions. Our model predicts the number of cakes sold based on the number of customers in the store. We assume that for a fixed x, the response y is normally distributed because we expect different values of x to produce different mean responses in y.
The residuals from our predictions can be used to approximate the variance in y conditioned on x. We can use the standard error of the mean response to build the confidence interval of the mean response. The measure is how well we are approximating the true mean response. The smaller we can make this value, the better.
Confidence Intervals of the Mean from the Standard Error of the Model
The mean is dependent on the coefficients and intercepts. The square root of the variance is the standard error. The standard error of the conditional mean is proportional to the deviation of the values of x from the mean, so as it approaches the mean value of x, we can see it getting narrower.
The bakery can determine the interval for the average number of cakes it will sell for a given number of customers, but they do n’t know the interval for the number of cakes they might sell for a given number of customers.
The mean response of y is only accounted for by a confidence interval. It does n’t give an interval for all possible values of y. We would need to use a prediction interval to do that.
The prediction interval is similar to the confidence interval. When calculating the standard error, we include the variance of our dependent variable y, which leads to a wider interval.
Prediction Interval for the Model Predictions
In the first part of our series on estimating uncertainty, we looked at ways to estimate sources of uncertainty. When the size and complexity of your data and model increase, these estimations become a lot harder.
When normality assumptions are made about the sampling distributions of the model ‘s residuals, bootstrapping techniques wo n’t work. When these assumptions are violated, how do we measure uncertainty ?
We will look at ways to quantify uncertainty in more complex models in the next part of the series.
Dhruv Nair is a Data Scientist.