Inspired by a post on visually weighted regression plots in R, I’ve been playing with shading to visually represent uncertainty in a model fit. In making these plots, I’ve used python and matplotlib. I used gaussian process regression from sklearn to model a synthetic data set, based on this example.

In the first plot, I’ve just used the error estimate returned from GaussianProcess.predict() as the Mean Squared Error. Assuming normally distributed error, and shading according to the normalized probability density, the result is the figure below.

The dark shading is proportional to the probability that the curve passes through that point. It should be noted that unlike traditional standard deviation plots, this view emphasizes the regions where the fit is most strongly constrained. A contour plot of the probability density would narrow where the traditional error bars are widest.

The errors aren’t really gaussian, of course. We can estimate the errors empirically, by generating sample data many times from the same generating function, adding random noise as before. The noise has the same distribution, but will be different in each trial due to the random number generator used. We can ensure that the trials are reproducible by explicitly setting the random seeds. This is similar to the method of error estimation from a sample population known as the bootstrap (although this is not a true bootstrap, as we are generating new trials instead of simulating them by sampling subsets of the data). After fitting each of the sample populations, the predicted curves for 200 trials are shown in the spaghetti plot below.

If we calculate the density of the fitted curves, we can make the empirical version of the density plot, like so:

It’s not as clean as the analytic error. In theory, we should be able to make it smoother by adding more trials, but this is computationally expensive (we’re already solving our problem hundreds of times). That isn’t an issue for a problem this size, but this method would require some additional thought and care for a larger dataset (and/or higher dimensions).

Nevertheless, the computational infrastructure of NumPy and SciPy, as well as tools like matplotlib and sklearn, make Python a great environment for this kind of data exploration and modeling.

The code that generated these plots is in an ipython notebook file, which you can view online or download directly.

I love how this looks, and for the example given it seems very appropriate. Of course it’s not as quantitatively useful as a traditional error bar, but when you have the sampling required it looks better than some cursed whisker plot.

One could imagine a similar plotting technique for error ellipses on individual points too. And if the errors are correlated, you’d get some kind of funky error banana contours.

Really nice. I haven’t worked with R, but years ago I thought this would be a great feature for matplotlib. At the time, I was thinking that the markers for experimental data could be density-shaded based on sigma_x and sigma_y to give a better visual estimation of the experimental uncertainty.

Pingback: Interview by Packt Authors

hey, nice blog, i’ve been following for a while.

i was just wondering how to program in python with matplotlib, a best fit scatter line on a scatter plot with multiple points. In the matplotlib web i have not find anything, i don’t know if you know how to do it. i suppose there must exist….

Thanks for all, and congrats for your site.