Category Archives: Training

Webinar: Work Better, Smarter, and Faster in Python with Enthought Training on Demand

Join Us For a Webinar

Enthought Training on Demand Webinar

We’ll demonstrate how Enthought Training on Demand can help both new Python users and experienced Python developers be better, smarter, and faster at the scientific and analytic computing tasks that directly impact their daily productivity and drive results.

View a recording of the Work Better, Smarter, and Faster in Python with Enthought Training on Demand webinar here.

What You’ll Learn

Continue reading

Exploring NumPy/SciPy with the “House Location” Problem

Author: Aaron Waters

I created a Notebook that describes how to examine, illustrate, and solve a geometric mathematical problem called “House Location” using Python mathematical and numeric libraries. The discussion uses symbolic computation, visualization, and numerical computations to solve the problem while exercising the NumPy, SymPy, Matplotlib, IPython and SciPy packages.

I hope that this discussion will be accessible to people with a minimal background in programming and a high-school level background in algebra and analytic geometry. There is a brief mention of complex numbers, but the use of complex numbers is not important here except as “values to be ignored”. I also hope that this discussion illustrates how to combine different mathematically oriented Python libraries and explains how to smooth out some of the rough edges between the library interfaces.

http://nbviewer.ipython.org/urls/raw.github.com/awatters/CanopyDemoArchive/master/misc/house_locations.ipynb

Enthought Sponsors First NY QPUG Meetup

Though all eyes are probably on the aftermath of Pycon (which, from all reports, was another great conference), Enthought was happy to sponsor the first New York Quantitative Python User Group Meetup (wow that’s a mouthful) on March 6th. If you are in the New York area, you can sign up for the group here.

The program for the evening featured Marcos Lopez de Prado and our own Kelsey Jordahl (with an assist from yours truly). The meetup focused on the topic of portfolio optimization and some of its foibles. Marcos conducted an in-depth discussion of portfolio optimization in general and outlined his open source implementation of the CLA algorithm. He also discussed why he is such a fan of Python.

Our contribution to the evening focused on the the theme “From Research to Application.” And by “research” we meant both research code (Marcos’ CLA code is one example) and actual investment research. Firms are wrestling with data and trying to marshal all the expertise within the organization to make decisions. Increasingly, software is being used to help synthesize this information. In our thought experiment, we imagined a hypothetical portfolio manager or strategist that is trying to integrate the quantitative and fundamental expertise within the firm. What kind of information would this PM want to see? How could we make the application visually appealing and intuitively interactive?

We chose to use the Black-Litterman model to tie some of these threads together. In a nutshell, Black-Litterman takes a Bayesian approach to portfolio optimization. It assumes that the capital allocations in the market are decent and reverses the classical optimization process to infer expected returns (rather than weights). It also allows modification of these expected returns to reflect analyst views on a particular asset. For those of you not familiar with this subject, you can find an accessible discussion of the approach in He and Litterman (1999). Using the Black-Litterman model as our organizing principle, we put together an application that provides context for historical returns, relative value, and pairwise asset correlations, all wired together to provide full interactivity.

Given the limited time we had to put this together, there are obviously things we would have changed and things we would have liked to include. Nevertheless, we think the demo is a good example of how one can use open source technology to not only take advantage of research code but also integrate quantitative models and fundamental research.

FYI, the libraries used in the app are: Numpy/Pandas, Scipy, Traits, Chaco, and Enaml.

Videos of the talks are below. Tell us what you think!

QPUG_20130306_PortfolioDemo from NYQPUG on Vimeo.

QPUG_20130306_Marcos from NYQPUG on Vimeo.

What Is Your Python Budget?

C programmers, by necessity, generally develop a mental model for understanding the performance characteristics of their code. Developing this intuition in a high level language like Python can be more of a challenge. While good Python tools exist for identifying time and memory performance (line_profiler by Robert Kern and guppy by Sverker Nilsson), you are largely on your own if you want to develop intuition for code that is yet to be written. Understanding the cost of basic operations in your Python implementation can help guide design decisions by ruling out extensive use of expensive operations.

Why is this important you ask? Interactive applications appear responsive when they react to user behaviour within a given time budget. In our consulting engagements, we often find that a lack of awareness regarding the cost of common operations can lead to sluggish application performance. Some examples of user interaction thresholds:

  • If you are targeting 60 fps in a multimedia application you have 16 milliseconds of processing time per frame. In this time, you need to update state, figure out what is visible, and then draw it.
  • Well behaved applications will load a functional screen that a user can interact with in under a second. Depending on your application, you may need to create an expensive datastructure upfront before your user can interact with the application. Often one needs to find a way to at least make it feel like the one second constraint is being respected.
  • You run Gentoo / Arch. In this case, obsessing over performance is a way of life.

Obviously rules are meant to be broken, but knowing where to be frugal can help you avoid or troubleshoot performance problems. Performance data for Python and PyPy are listed below.

Machine configuration

CPU – AMD 8150

RAM – 16 GB PC3-12800

Windows 7 64 bit

Python 2.7.2 — EPD 7.3-1 (64-bit)

PyPy 2.0.0-beta1 with MSC v.1500 32 bit
Steps to obtain timings and create table from data

python measure.py cpython.data
pypy measure.py pypy.data

python draw_table.py cpython.data cpython.png
python draw_table.py pypy.data pypy.png

To obtain the code to measure timings and create the associated tables for your own machine, checkout https://github.com/deepankarsharma/cost-of-python

CPython timing data

 

PyPy timing data

 

 

Visualizing Uncertainty

Inspired by a post on visually weighted regression plots in R, I’ve been playing with shading to visually represent uncertainty in a model fit. In making these plots, I’ve used python and matplotlib. I used gaussian process regression from sklearn to model a synthetic data set, based on this example.

In the first plot, I’ve just used the error estimate returned from GaussianProcess.predict() as the Mean Squared Error. Assuming normally distributed error, and shading according to the normalized probability density, the result is the figure below.

The dark shading is proportional to the probability that the curve passes through that point.  It should be noted that unlike traditional standard deviation plots, this view emphasizes the regions where the fit is most strongly constrained.  A contour plot of the probability density would narrow where the traditional error bars are widest.

The errors aren’t really gaussian, of course.  We can estimate the errors empirically, by generating sample data many times from the same generating function, adding random noise as before.  The noise has the same distribution, but will be different in each trial due to the random number generator used.  We can ensure that the trials are reproducible by explicitly setting the random seeds.  This is similar to the method of error estimation from a sample population known as the bootstrap (although this is not a true bootstrap, as we are generating new trials instead of simulating them by sampling subsets of the data).  After fitting each of the sample populations, the predicted curves for 200 trials are shown in the spaghetti plot below.

If we calculate the density of the fitted curves, we can make the empirical version of the density plot, like so:

It’s not as clean as the analytic error.  In theory, we should be able to make it smoother by adding more trials, but this is computationally expensive (we’re already solving our problem hundreds of times).  That isn’t an issue for a problem this size, but this method would require some additional thought and care for a larger dataset (and/or higher dimensions).

Nevertheless, the computational infrastructure of NumPy and SciPy, as well as tools like matplotlib and sklearn, make Python a great environment for this kind of data exploration and modeling.

The code that generated these plots is in an ipython notebook file, which you can view online or download directly.

LFPUG: Python in the enterprise + Pandas

Over 80 people attended last night’s London Financial Python User Group (LFPUG), with presentations given by Den Pilsworth of AHL/MAN, Eric Jones of Enthought, and Wes Mckinney of Pandas fame. It was an evening filled with practical content, so come on out for the next meetup if you are in town (or for drinks at the pub afterwards)!

The agenda for the evening:

1. “Moving an algo business from R and Java to Python”, Dennis Pilsworth, AHL, Man Group
2. “Financial data analysis in Python with pandas”, Wes McKinney
3. “Fostering Python Adoption within a Company”, Eric Jones, Enthought.

Den presented a case study of how his firm introduced Python into production and ensured that “network distributed” deployment worked quickly enough to ensure good local response time with out overloading the network. He also discussed visualization and pointed out native Python tools need some work to remain competitive with the R user’s sweetheart, ggplot2. He graciously acknowledged the role Enthought’s training played in getting things rolling.

Wes Mckinney discussed the latest Pandas developments, particularly the Group-by function. A number of attendees were interested in potentially using this functionality to replace Excel pivot tables. Make sure to check out Wes’ new book, “Python for Data Analysis.”

Eric Jones discussed how to get Python adopted in the face of opposition, featuring some of the classic objections (e.g. “Python is too slow”).

LFPUG meets  roughly every other month, so look us up on LinkedIn and keep an eye out for the next meeting!

Explore NYC 311 Data

Datagotham explored how organizations and individuals use data to gain insight and improve decision-making. Some interesting talks on “urban science” got us curious about what data is publicly available on the city itself. Of course, the first step in this process is actually gaining access to interesting data and seeing what it looks like. Open311.org seemed like a good first stop, but a quick glance suggests there isn’t much data available there. A look through the NYC open data website yielded a 311 dataset that aggregates about four million 311 calls from January 1, 2010 to August 25, 2012. There are a number of other data sets on the site, but we focused on this data set to keep things simple. What follows is really just the first step in a multi-step process, but we found it interesting.

NYC 311 calls are categorized into approximately 200 different complaint types, ranging from “squeegee” and “radioactive” to “noise” and “street condition.” There are an additional ~1000 descriptors (e.g. Radioactive -> Contamination). Each call is tagged with date, location type, incident address, and longitude and lattitude information but, weirdly, does not include the time of day. We had to throw out approximately 200,000 records because the records were incomplete. As always, “garbage in | garbage out” holds, so be advised, we are taking the data at face value.

Simple aggregations can help analysts develop intuition about the data and provide fodder for additional inquiry. Housing related complaints to HPD (NYC Dept of Housing Preservation and Development) represented the vast majority of calls (1,671,245). My personal favorite, “squeegee,” was far down at the bottom of the list with only 21 complaints over the last two years. I seem to remember a crackdown several years ago…perhaps it had an impact. “Radioactive” is another interesting category that deserves some additional explanation (could these be medical facilities?). Taxi complaints, as one would expect, are clustered in Manhattan and the airports. Sewage backups seem to be concentrated in areas with tidal exposure. Understanding the full set of complaint types would take some work, but we’ve included visualizations for a handful of categories above. The maps shows the number of calls per complaint type organized by census tract.

Tinkering With Visualizations

The immediate reaction to some of these visualizations is the need for normalization. For example, food poisoning calls are generally clustered in Manhattan (with a couple of hot spots in Queens…Flushing, I’m looking at you), but this likely reflects the density of restaurants in that part of the city. One might also make the knee jerk conclusion that Staten Island has sub-par snow plowing service and road conditions. As we all know, this is crazy talk! Nevertheless, whatever the eventual interpretation, simple counts and visualizations provide a frame of reference for future exploration.

Another immediate critique is the color binning of the maps. My eye would like some additional resolution or different scaling to get a better feeling for the distribution of complaints (you’ll notice a lot of yellow in most of the maps). A quick look at some histograms (# of calls per census tract) illustrates the “power law” or log-normal shape of the resulting distributions. Perhaps scaling by quantile would yield more contrast.

# of calls per census tract

Adding Other Data

As mentioned, we are only scratching the surface here. More interesting analyses can be done by correlating other datasets in space and time with the 311 calls. We chose to aggregate the data initially by census tract so we can use population and demographic data from the census bureau and NYC to compare with call numbers.  For other purposes, we may want to aggregate by borough, zip code, or the UHF neighborhoods used by the public health survey. For example, below we show the percentage of people that consume one or more sugary drinks per day, on average (2009 survey, the darkest areas are on the order of 40-45%).

% drinking one or more sugary drinks per day

Here’s the same map with an underlay of the housing 311 data (health survey data at 50% alpha).

sugary drink vs. public housing complaints

While we make no conclusions about the underlying factors, it isn’t hard to imagine visualizing different data to broaden the frame of reference for future testing and analysis. Furthermore, we have not even explored the time dimension of this data. The maps above represent raw aggregations. Time series information on “noise” and other complaint categories can yield useful seasonality information, etc. If you find correlations between separate pieces of data, well, now you are getting into real analysis! Testing for causation, however, can make a big difference in how you interpret the data.

Tools: PostGIS, Psycopg2, Pandas, D3

PostGIS did most of the heavy lifting in this preliminary exploration. Psycopg2 provided our Python link and Pandas made it easy to group calls by category and census tract. We used D3 to build the maps at the beginning of the post. We used QGIS for some visualization and to generate the static images above. All of these tools are open source.

Semi-Final Thoughts

The point here is that we haven’t really done any serious number crunching yet, but the process of exploration has already helped develop useful intuition about the city and prompt some interesting questions about the underlying data. What is it about South Jamaica and sewer backups? What’s up with Chinatown and graffiti? The question is the most important thing. Without it, it’s easy to lose your way. What kind of questions would you ask this data? What other data sets might you bring to bear?

 

DataGotham…Complete!

Well, DataGotham is over. The conference featured a wide cross section of the data community in NYC. Talks spanned topics from “urban science” to “finding racism on FourSquare” to “creating an API for spaces.” Don’t worry, the videos will be online soon so you can investigate yourself. The organizers did a great job putting a conference of this size together on relatively short notice. Bravo NYC data crunchers!

One thing I somehow missed was a network graph created by the organizers to illustrate the tools used by attendees. I am happy to see python leading the way! The thickness of the edge indicates the number of people using both tools. It seems there are a lot of people trying to make Python and R “two great tastes that go great together.” I’m curious as to why more Python users aren’t using numpy and scipy. Food for thought…

Got tools?