Category Archives: Open Source

Plotting in Excel with PyXLL and Matplotlib

Author: Tony Roberts, creator of PyXLL, a Python library that makes it possible to write add-ins for Microsoft Excel in Python. Download a FREE 30 day trial of PyXLL here.


Plotting in Excel with PyXLL and MatplotlibPython has a broad range of tools for data analysis and visualization. While Excel is able to produce various types of plots, sometimes it’s either not quite good enough or it’s just preferable to use matplotlib.

Users already familiar with matplotlib will be aware that when showing a plot as part of a Python script the script stops while a plot is shown and continues once the user has closed it. When doing the same in an IPython console when a plot is shown control returns to the IPython prompt immediately, which is useful for interactive development.

Something that has been asked a couple of times is how to use matplotlib within Excel using PyXLL. As matplotlib is just a Python package like any other it can be imported and used in the same way as from any Python script. The difficulty is that when showing a plot the call to matplotlib blocks and so control isn’t returned to Excel until the user closes the window.

This blog shows how to plot data from Excel using matplotlib and PyXLL so that Excel can continue to be used while a plot window is active, and so that same window can be updated whenever the data in Excel is updated. Continue reading

Enthought Canopy v1.2 is Out: PTVS, Mavericks, and Qt

Author: Jason McCampbell

Canopy 1.2 is out! The release of Mac OS “Mavericks” as a free update broke a few features, primarily IPython, so we held the release to try to make sure everything worked. That ended up taking longer than we wanted, but 1.2 is finally out and adds support for Mavericks. There is one Mavericks-specific, Qt font issue that we are working on correcting which causes the wrong system font to be selected so UI’s look less-nice than they should.

Enthought Canopy integrated into PTVS

Enthought Canopy integrated into PTVS

The biggest new feature is integration with Microsoft’s Python Tools for Visual Studio (PTVS) package. PTVS is a full, professional-grade development IDE for Python based on Visual Studio and provides mixed Python/C debugging. The ability to do mixed-mode debugging is a huge boon to software developers creating C (or FORTRAN) extensions to Python. Canopy v1.2 includes a custom DLL that allows us to integrate more completely with PTVS and solves some issues with auto-completion of Python standard library calls.

Beyond PTVS, we have added the Qt development tools, such as qmake and the UIC compiler, to the Canopy installation tree. These tools are available on all platforms now and enable Qt developers to access them from Canopy directly rather than having to build the tools themselves.

Canopy 1.2 includes a large number of smaller additions and stability improvements. Highlights can be found in the release notes and we encourage all users to update existing installs. As always, thanks for using Canopy and please don’t hesitate to drop us a note letting us know what you like or what you would like to see improved. You can contact us via the Help -> Suggestions/Feedback menu item or by sending email to canopy.support@enthought.com.

And you can download Canopy from the Enthought Store page.

“venv” in Python 2.7 and how it Simplifies Life

Virtual environments, specifically ‘venv’ which we backported from Python 3.x, are a technology that enables the creation of multiple, lightweight, independent Python environments. Each virtual environment appears to be a self-contained Python installation, but loads the Python standard library and other common resources from a common base Python installation. Optionally, a virtual environment can also load packages from its base Python environment, whether that’s Canopy Core itself or another virtual environment.

What makes virtual environments so interesting? Well, they reduce disk space by not having to duplicate the full Python environment each time. But more than that, making Python environments far “lighter” enables several interesting capabilities.

First, the most common use of virtual environments is to allow separate projects to run in separate environments with different packages requirements. Each Python application runs in a separate virtual environment so package updates needed for one application don’t break the others. This model has long been used by web developers as well as a few scientific software developers.

The second case is specifically enabled by Canopy. Sharp-eyed readers will have noted in the first paragraph that we said that a virtual environment can have Canopy Core or another virtual environment as the base. But virtual environments can’t be layered, right? Now they can.

We have extended venv to support arbitrary numbers of layers, so we can do something like this:

'venv' in Canopy

‘venv’ in Canopy

‘Project1’ can be created with the following Canopy command:

canopy_cli  setup  ./Project1

Canopy constructs Project1 with all of the standard Canopy packages installed, and Project1 can now be customized to run the application. Once we’ve got Project1 working with a particular Python configuration, what if we want to see if the application works with the latest version of NumPy? We could update and potentially break the stable environment. Or, we can do this:

./Project1/bin/ venv  -s  ./Project1_play

Now ‘Project1_Play’ is a virtual environment which has by default all of Project1’s packages and package versions available. We can now update NumPy or other packages in Project1_play and test the application. If it doesn’t work, no big deal, we just delete it. We now have the ability to rapidly experiment with different (safe) Python environments without breaking our stable working area.

Canopy makes use of virtual environments to provide a protected Python environment for the Canopy GUI application to run in, and to provide one or more User Python environments which you can customize and run your own code in. Canopy Core is the base for each of these virtual environments, providing the core Python components and several common, large packages such as Qt and PySide. This structure means that the Canopy GUI can be updated without impacting your code, and any package updates you install won’t destabilize the Canopy GUI.

Canopy Core can be updated if you want, such as to move to a new version of Python, and each of the virtual environments will be updated automatically as well. This eliminates the need to install a new Python environment and then re-install any third-party packages into that new environment just to update Python.

For more information on how to set up virtual environments with Canopy, check the online docs, or get Canopy v1.1 and try it out.

Our next post will detail how to use Canopy and virtual environments to set up multi-user networks and cluster environments.

Fun with QtWebKit HTML5 Video

Solving the QtWebKit HTML5 Video DirectShow Problem

A while back I was given the task of fixing the problems that our development team was having with playing H.264 or WebM video on Windows in a QWebView widget using the HTML5 <video> tag. The application in question is a hybrid of a traditional desktop application and a web-based application, and there is a need to be able to use the HTML5 video capabilities of WebKit in one of the application’s components, in order to deliver some training content.

Had I known how big of a headache it was going to be, I may have just set my hair on fire and run screaming from the room at the beginning. But instead, being blissfully ignorant, I said, “Sure, I can take a look at that.” Oh, by the way, this was my very first day of work at Enthought, (within the first hour or so if I remember correctly) and also the first time I had dived deeply in to the Qt and PySide code. It was also the first time I had worked with the DirectShow API and the first time in many years that I had worked much with COM. Yeah, I don’t think setting my hair on fire and screaming would have made a big enough statement.

Anyway, the purpose of this article is to discuss some of the problems I ran into and to show you how I solved or worked around them. On that first day I got a few high level reports of the problems people were having. I heard things like:

  • It doesn’t work at all on some computers and sorta works on others.
  • When it works it crashes upon reloading the page.
  • It’s probably just a codec problem.
  • I don’t think it can ever work because of _____________.
  • It should work out of the box because of _____________.
  • Etc.

First things first

So the first thing I do of course it is ask the smartest guy I know about it, Google. I uncovered many tales of woe from people experiencing the same problems, hopes that it would be fixed in Qt 5, disappointments in discovering that it wasn’t. And here and there a few little tidbits and clues about how to make things work.

Background

One of the components of the Qt toolkit is the Phonon library, which provides various classes related to streaming and playing media, and until recently QtWebKit used Phonon to embed media players in web pages. It was decided that the Phonon API was a higher level API than many multimedia applications would need, so the QtMultiMediaKit API was started as a lower-level replacement and Phonon was deprecated. The Windows QtWebKit code was ported to use QtMultiMediaKit instead of Phonon and with the Qt 4.8 release it no longer uses the Phonon back-end.

However, at about the same time the transition of Qt from Nokia to Digia happened and the development of QtMultiMediaKit (as part of the qt-mobility libraries and plugins) was paused in a not quite completed state, and so it hasn’t been fully incorporated into the Qt distribution yet. So this means that out of the box QtWebKit is not able to play HTML5 media on Windows, because the code for the multimedia plugin it is expecting to use is not included. QtWebKit’s HTML5 media features on Windows is basically caught in a gap between past and future technologies. I believe that QtWebKit is still using Phonon on the other platforms.

How to build Qt + qt-mobility

So the first thing that needed to be done to solve this problem is to figure out how to build Qt plus the qt-mobility libraries, such that QtWebKit is able to use the multimedia plugins for displaying video. I quickly found out that this is a classic chicken-and-egg problem because qt-mobility needs an existing Qt build to be able to use the classes that it provides, but the Qt build also needs to have qt-mobility present so that it knows to include the code that will use the multimedia plugins. So to make this work we will need to use two chickens to get an egg. In other words, we’ll need to build Qt twice.

But first, here are some prerequisites:

  • Since my end goal is to build PySide for Python 2.7, I used MS Visual Studio 2008 as the compiler.
  • A fairly recent Windows SDK is needed, since the one with VS 2008 doesn’t have new enough DirectShow support. So I used version 7.1 of the Windows SDK. I initially went down the rat-hole of trying to install the DirectX SDK to use with the older platform SDK included with VS 2008, but that just caused more trouble and I never got that build fully working. Just use the 7.1 SDK instead and you can avoid wasting a few days like I did.
  • Get and install the OpenSSL library
  • I recommend using JOM instead of nmake for the build, as it is able to parallelize the build steps to take advantage of multiple processor cores if you have them. It is fully nmake compatible, and greatly reduces the time needed to build Qt. (From several hours to around 40 minutes for one of my computers.) I copied it to “nmake.exe” and put it in a location on the PATH that is found before the Visual Studio nmake.
  • Some parts of the configuration process need a working Perl interpreter. I had troubles getting it to work with the cygwin perl that I already had installed, so I installed Strawberry Perl instead. Make sure it is found first on the PATH.

The next step is configuring and building Qt. To save time you can tell it to skip building the WebKit components for this first build, and then turn it back on for the second build after qt-mobility has been built.

Follow the regular Qt build instructions for setting up the environment and such. For example I set QTDIR to the root of the Qt source tree, added $QTDIR/bin to the PATH, and set QMAKESPEC to win32-msvc2008. I used the stock qt-everywhere-opensource-src-4.8.4 tarball.

Next go to your Qt source tree and run configure, followed by running nmake. I’m using a cygwin bash shell, so if you are using a stock Windows cmd.exe shell or something else then you may have to adapt a few things. Here is how I run configure:

./configure.exe -release -opensource -platform $QMAKESPEC \
       -qt-zlib -qt-libpng -qt-libmng -qt-libtiff -qt-libjpeg \
       -openssl -I ${OPENSSL_BASE}include -L ${OPENSSL_BASE}lib \
       -nomake demos -nomake examples \
       -no-webkit
nmake

The next step is to build qt-mobility. You can fetch the code from the project’s git repository at http://qt.gitorious.org/qt-mobility. I run configure in the qt-mobility source tree like this:

cmd.exe /c configure.bat -prefix $QTDIR -no-wmf -release \
       -modules "sensors multimedia"
nmake
nmake install

Let’s break that down a little. I used the $QTDIR value as the prefix so when the qt-mobility libraries are installed they will be in the same place as the Qt libraries, which means that it’s easier for the Qt code and other applications as they do not have to do anything extra to find them. On the other hand, it clutters up the Qt tree a little and when doing a “nmake clean” there I have to clean up qt-mobility stuff and a few other things by hand.

I used the -no-wmf flag (No Windows Media Framework) because we need our application to work on XP and WMF is not available there. Plus, although WMF is supposed to be “the future” it isn’t all there yet and DirectShow is still more capable.

The -modules flag tells configure to set up the build for only the sensors and multimedia components of qt-mobility. Those are the only modules needed for the QtMultiMediaKit library and plugins that we want.

Finally, we run “nmake” followed by “nmake install”.

The 2nd chicken

The final step is to reconfigure and rebuild Qt. Just run the same configure command as before, substituting “-webkit” in place of “-no-webkit” and then run nmake again.

Bugs

But wait, there’s more!

At this point we were able to play WebM and H.264 video in a QWebView widget using the HTML5 video tag, but it was still crashing hard when reloading the page, or when navigating a page or two away from the page with the video. Not good.

After much debugging, experimenting, rebuilding and cursing I found the problem. I won’t go into a complete explanation of DirectShow here, to be informative enough it would have to be a huge amount of text and this article is already too long. The nutshell version of the pertinent bits is that DirectShow constructs a “graph” of “filter components” that are able to take the media stream as input, split it into audio and video streams if necessary, and run it through various transformation components until it is able to provide whatever format that the output devices require. Depending on the DirectShow components that are installed, the format of the source, and the needs of the output, then very different graphs can be constructed. Here are a couple simple ones that I was working with:

  

DirectShow components, like most of COM, uses a reference counting pattern to manage the life-cycle of the components. Every time some other component wants to hold on to a reference to something they increment the reference count, and then when they are done with it they release their reference and the count is decremented. When the reference count reaches zero then the object deletes itself.

I found that one of the DirectShow components in QtMultiMediaKit was not following that pattern, and it was being explicitly deleted from the class that created it. Via some debugging code I found that the reference count of this component, the VideoSurfaceFilter class shown as “VideoOutput” in the graphs above, was still around 3 or 4 when that deletion happened. That meant that although that class instance was gone, there were still 3 or 4 other components that thought that it still existed. When QtWebKit cleaned up the resources for that page when it was reloaded or after the next page was loaded then that filter graph was released and one of those other components tried to access the now invalid VideoSurfaceFilter and the application crashed.

My qt-mobility changes fixing this have been submitted to the Qt bug tracker and you can see it here: https://bugreports.qt-project.org/browse/QTMOBILITY-2091

For the record, here is a simple little application I used for testing:

import sys
from PySide import QtGui, QtWebKit
TESTURL = "http://camendesign.com/code/video_for_everybody/test.html"
app = QtGui.QApplication(sys.argv)
QtWebKit.QWebSettings.globalSettings().setAttribute(
         QtWebKit.QWebSettings.PluginsEnabled, True)
view = QtWebKit.QWebView()
url = sys.argv[1] if len(sys.argv) > 1 else TESTURL
view.load(url)
view.show()
app.exec_()

Working Codecs

But wait! There is still more!

As alluded to above, DirectShow is not an all-in-one solution. It is a collection of components that are plugged in to a filter graph, where each can provide just one part of the transformation of the media’s ones and zeros into audio sound waves and dancing pixels on the screen. And as with any collection, more components from various 3rd-party sources can be added to the collection. These other components can enhance existing capabilities, or even add new capabilities to the system.

For example, out of the box Windows is not able to decode and render WebM media. It is able to decode and render H.264 audio/video streams, but it doesn’t know how to split those streams out from the typical container formats used today, such as .MP4 files. By installing and registering some 3rd-party DirectShow filters then functionality gaps such as these can be filled.

For our application we want to be able to include a set of DirectShow filters with our installer, so we can be sure that our customers have at least basic functionality on their systems and that our application can work out of the box. In order to do that we needed something with a permissive license, and the OpenCodecs package from Xiph.org fit the bill. They provide filters that can handle WebM video streams and use a BSD-style license so we can distribute it without risk of GPL infection. It still has some issues though.

The filter pack that I experimented with and had the best results with was the LAV filters from https://code.google.com/p/lavfilters/. It is able to transform the video streams directly to the input format required by qt-mobility’s VideoSurfaceFilter, so less transformation steps are required. Compare the two graphs above to see the difference. As nearly any engineer will tell you, fewer transformations of anything is almost always better than more transformations. However, it is licensed using the GPL and we felt that it is still too much of a grey area for us to distribute it along with a non-GPL’d application. Since we don’t link to it directly and it is only accessed via operating system services then using it at runtime is fine, however distributing it as part of our installer such that it looks like it is part of the whole thing would probably ring a bell somewhere and start some lawyers salivating. But we will be suggesting that our users install it themselves if they experience problems.

Conclusion

Yes, there is still even more, including some other changes and enhancements that I’ve been making to other related projects along the way. But I won’t go into details here. If they become significant enough I’ll probably write another blog post or two.

To conclude this article I’d just like to mention that despite the problems I’ve been dealing with I have been very impressed with the Qt source code and its capabilities. I can tell that a lot of thought went in to the design and implementation, and I look forward to being able to contribute more to it and also to PySide.

Visualizing Uncertainty

Inspired by a post on visually weighted regression plots in R, I’ve been playing with shading to visually represent uncertainty in a model fit. In making these plots, I’ve used python and matplotlib. I used gaussian process regression from sklearn to model a synthetic data set, based on this example.

In the first plot, I’ve just used the error estimate returned from GaussianProcess.predict() as the Mean Squared Error. Assuming normally distributed error, and shading according to the normalized probability density, the result is the figure below.

The dark shading is proportional to the probability that the curve passes through that point.  It should be noted that unlike traditional standard deviation plots, this view emphasizes the regions where the fit is most strongly constrained.  A contour plot of the probability density would narrow where the traditional error bars are widest.

The errors aren’t really gaussian, of course.  We can estimate the errors empirically, by generating sample data many times from the same generating function, adding random noise as before.  The noise has the same distribution, but will be different in each trial due to the random number generator used.  We can ensure that the trials are reproducible by explicitly setting the random seeds.  This is similar to the method of error estimation from a sample population known as the bootstrap (although this is not a true bootstrap, as we are generating new trials instead of simulating them by sampling subsets of the data).  After fitting each of the sample populations, the predicted curves for 200 trials are shown in the spaghetti plot below.

If we calculate the density of the fitted curves, we can make the empirical version of the density plot, like so:

It’s not as clean as the analytic error.  In theory, we should be able to make it smoother by adding more trials, but this is computationally expensive (we’re already solving our problem hundreds of times).  That isn’t an issue for a problem this size, but this method would require some additional thought and care for a larger dataset (and/or higher dimensions).

Nevertheless, the computational infrastructure of NumPy and SciPy, as well as tools like matplotlib and sklearn, make Python a great environment for this kind of data exploration and modeling.

The code that generated these plots is in an ipython notebook file, which you can view online or download directly.

Explore NYC 311 Data

Datagotham explored how organizations and individuals use data to gain insight and improve decision-making. Some interesting talks on “urban science” got us curious about what data is publicly available on the city itself. Of course, the first step in this process is actually gaining access to interesting data and seeing what it looks like. Open311.org seemed like a good first stop, but a quick glance suggests there isn’t much data available there. A look through the NYC open data website yielded a 311 dataset that aggregates about four million 311 calls from January 1, 2010 to August 25, 2012. There are a number of other data sets on the site, but we focused on this data set to keep things simple. What follows is really just the first step in a multi-step process, but we found it interesting.

NYC 311 calls are categorized into approximately 200 different complaint types, ranging from “squeegee” and “radioactive” to “noise” and “street condition.” There are an additional ~1000 descriptors (e.g. Radioactive -> Contamination). Each call is tagged with date, location type, incident address, and longitude and lattitude information but, weirdly, does not include the time of day. We had to throw out approximately 200,000 records because the records were incomplete. As always, “garbage in | garbage out” holds, so be advised, we are taking the data at face value.

Simple aggregations can help analysts develop intuition about the data and provide fodder for additional inquiry. Housing related complaints to HPD (NYC Dept of Housing Preservation and Development) represented the vast majority of calls (1,671,245). My personal favorite, “squeegee,” was far down at the bottom of the list with only 21 complaints over the last two years. I seem to remember a crackdown several years ago…perhaps it had an impact. “Radioactive” is another interesting category that deserves some additional explanation (could these be medical facilities?). Taxi complaints, as one would expect, are clustered in Manhattan and the airports. Sewage backups seem to be concentrated in areas with tidal exposure. Understanding the full set of complaint types would take some work, but we’ve included visualizations for a handful of categories above. The maps shows the number of calls per complaint type organized by census tract.

Tinkering With Visualizations

The immediate reaction to some of these visualizations is the need for normalization. For example, food poisoning calls are generally clustered in Manhattan (with a couple of hot spots in Queens…Flushing, I’m looking at you), but this likely reflects the density of restaurants in that part of the city. One might also make the knee jerk conclusion that Staten Island has sub-par snow plowing service and road conditions. As we all know, this is crazy talk! Nevertheless, whatever the eventual interpretation, simple counts and visualizations provide a frame of reference for future exploration.

Another immediate critique is the color binning of the maps. My eye would like some additional resolution or different scaling to get a better feeling for the distribution of complaints (you’ll notice a lot of yellow in most of the maps). A quick look at some histograms (# of calls per census tract) illustrates the “power law” or log-normal shape of the resulting distributions. Perhaps scaling by quantile would yield more contrast.

# of calls per census tract

Adding Other Data

As mentioned, we are only scratching the surface here. More interesting analyses can be done by correlating other datasets in space and time with the 311 calls. We chose to aggregate the data initially by census tract so we can use population and demographic data from the census bureau and NYC to compare with call numbers.  For other purposes, we may want to aggregate by borough, zip code, or the UHF neighborhoods used by the public health survey. For example, below we show the percentage of people that consume one or more sugary drinks per day, on average (2009 survey, the darkest areas are on the order of 40-45%).

% drinking one or more sugary drinks per day

Here’s the same map with an underlay of the housing 311 data (health survey data at 50% alpha).

sugary drink vs. public housing complaints

While we make no conclusions about the underlying factors, it isn’t hard to imagine visualizing different data to broaden the frame of reference for future testing and analysis. Furthermore, we have not even explored the time dimension of this data. The maps above represent raw aggregations. Time series information on “noise” and other complaint categories can yield useful seasonality information, etc. If you find correlations between separate pieces of data, well, now you are getting into real analysis! Testing for causation, however, can make a big difference in how you interpret the data.

Tools: PostGIS, Psycopg2, Pandas, D3

PostGIS did most of the heavy lifting in this preliminary exploration. Psycopg2 provided our Python link and Pandas made it easy to group calls by category and census tract. We used D3 to build the maps at the beginning of the post. We used QGIS for some visualization and to generate the static images above. All of these tools are open source.

Semi-Final Thoughts

The point here is that we haven’t really done any serious number crunching yet, but the process of exploration has already helped develop useful intuition about the city and prompt some interesting questions about the underlying data. What is it about South Jamaica and sewer backups? What’s up with Chinatown and graffiti? The question is the most important thing. Without it, it’s easy to lose your way. What kind of questions would you ask this data? What other data sets might you bring to bear?

 

Scipy 2012

No Mas

Scipy 2012 is wrapping up today tomorrow as bands of sprinters come together to hack away on their projects of interest. Many thanks to the sponsors and volunteers that made this year’s Scipy another success. The Scipy conference has always been highly technical. Although “data science” and “big data” have become buzzwords recently, Scipy has been exploring these themes for many years. Projects featuring machine learning, high performance computing, and visualization were in full attendance at this year’s Scipy. Stay tuned for links to talk videos (care of Next Day Video)!

GPU-palooza: May 21, NYC

Enthought is proud to sponsor the NYC HPC-GPU Supercomputing meetup on May 21st, 2012. This meeting will focus on the GPU and Python (what else?).

A happy coincidence brings together three GPU wise men at this meetup. Andreas Klockner is the author of PyCUDA and PyOpenCL. Nicolas Pinto comes to us via MIT with an extensive background in machine learning/GPU-related research. Last but not least, Enthought’s own Sean Ross Ross will share the latest on OpenCL and CLyther.

It promises to be a GPU-accelerated night! The room is filled to capacity I believe, but it’s not uncommon for people to get off the waiting list.

Please come by and say hello!