Archive for the 'Python' category

What Is Your Python Budget?

Jan 18 2013 Published by under General, New York, Python

C programmers, by necessity, generally develop a mental model for understanding the performance characteristics of their code. Developing this intuition in a high level language like Python can be more of a challenge. While good Python tools exist for identifying time and memory performance (line_profiler by Robert Kern and guppy by Sverker Nilsson), you are largely on your own if you want to develop intuition for code that is yet to be written. Understanding the cost of basic operations in your Python implementation can help guide design decisions by ruling out extensive use of expensive operations.

Why is this important you ask? Interactive applications appear responsive when they react to user behaviour within a given time budget. In our consulting engagements, we often find that a lack of awareness regarding the cost of common operations can lead to sluggish application performance. Some examples of user interaction thresholds:

  • If you are targeting 60 fps in a multimedia application you have 16 milliseconds of processing time per frame. In this time, you need to update state, figure out what is visible, and then draw it.
  • Well behaved applications will load a functional screen that a user can interact with in under a second. Depending on your application, you may need to create an expensive datastructure upfront before your user can interact with the application. Often one needs to find a way to at least make it feel like the one second constraint is being respected.
  • You run Gentoo / Arch. In this case, obsessing over performance is a way of life.

Obviously rules are meant to be broken, but knowing where to be frugal can help you avoid or troubleshoot performance problems. Performance data for Python and PyPy are listed below.

Machine configuration

CPU – AMD 8150

RAM – 16 GB PC3-12800

Windows 7 64 bit

Python 2.7.2 — EPD 7.3-1 (64-bit)

PyPy 2.0.0-beta1 with MSC v.1500 32 bit
Steps to obtain timings and create table from data

python measure.py cpython.data
pypy measure.py pypy.data

python draw_table.py cpython.data cpython.png
python draw_table.py pypy.data pypy.png

To obtain the code to measure timings and create the associated tables for your own machine, checkout https://github.com/deepankarsharma/cost-of-python

CPython timing data

 

PyPy timing data

 

 

6 responses so far

Visualizing Uncertainty

Dec 07 2012 Published by under General, New York, Open Source, Python

Inspired by a post on visually weighted regression plots in R, I’ve been playing with shading to visually represent uncertainty in a model fit. In making these plots, I’ve used python and matplotlib. I used gaussian process regression from sklearn to model a synthetic data set, based on this example.

In the first plot, I’ve just used the error estimate returned from GaussianProcess.predict() as the Mean Squared Error. Assuming normally distributed error, and shading according to the normalized probability density, the result is the figure below.

The dark shading is proportional to the probability that the curve passes through that point.  It should be noted that unlike traditional standard deviation plots, this view emphasizes the regions where the fit is most strongly constrained.  A contour plot of the probability density would narrow where the traditional error bars are widest.

The errors aren’t really gaussian, of course.  We can estimate the errors empirically, by generating sample data many times from the same generating function, adding random noise as before.  The noise has the same distribution, but will be different in each trial due to the random number generator used.  We can ensure that the trials are reproducible by explicitly setting the random seeds.  This is similar to the method of error estimation from a sample population known as the bootstrap (although this is not a true bootstrap, as we are generating new trials instead of simulating them by sampling subsets of the data).  After fitting each of the sample populations, the predicted curves for 200 trials are shown in the spaghetti plot below.

If we calculate the density of the fitted curves, we can make the empirical version of the density plot, like so:

It’s not as clean as the analytic error.  In theory, we should be able to make it smoother by adding more trials, but this is computationally expensive (we’re already solving our problem hundreds of times).  That isn’t an issue for a problem this size, but this method would require some additional thought and care for a larger dataset (and/or higher dimensions).

Nevertheless, the computational infrastructure of NumPy and SciPy, as well as tools like matplotlib and sklearn, make Python a great environment for this kind of data exploration and modeling.

The code that generated these plots is in an ipython notebook file, which you can view online or download directly.

4 responses so far

Older posts »

Featuring Advanced Search Functions plugin by YD