Inspired by a post on visually weighted regression plots in R, I’ve been playing with shading to visually represent uncertainty in a model fit. In making these plots, I’ve used python and matplotlib. I used gaussian process regression from sklearn to model a synthetic data set, based on this example.
In the first plot, I’ve just used the error estimate returned from GaussianProcess.predict() as the Mean Squared Error. Assuming normally distributed error, and shading according to the normalized probability density, the result is the figure below.
The dark shading is proportional to the probability that the curve passes through that point. It should be noted that unlike traditional standard deviation plots, this view emphasizes the regions where the fit is most strongly constrained. A contour plot of the probability density would narrow where the traditional error bars are widest.
The errors aren’t really gaussian, of course. We can estimate the errors empirically, by generating sample data many times from the same generating function, adding random noise as before. The noise has the same distribution, but will be different in each trial due to the random number generator used. We can ensure that the trials are reproducible by explicitly setting the random seeds. This is similar to the method of error estimation from a sample population known as the bootstrap (although this is not a true bootstrap, as we are generating new trials instead of simulating them by sampling subsets of the data). After fitting each of the sample populations, the predicted curves for 200 trials are shown in the spaghetti plot below.
If we calculate the density of the fitted curves, we can make the empirical version of the density plot, like so:
It’s not as clean as the analytic error. In theory, we should be able to make it smoother by adding more trials, but this is computationally expensive (we’re already solving our problem hundreds of times). That isn’t an issue for a problem this size, but this method would require some additional thought and care for a larger dataset (and/or higher dimensions).
Nevertheless, the computational infrastructure of NumPy and SciPy, as well as tools like matplotlib and sklearn, make Python a great environment for this kind of data exploration and modeling.