Category Archives: New York

Enthought Sponsors First NY QPUG Meetup

Though all eyes are probably on the aftermath of Pycon (which, from all reports, was another great conference), Enthought was happy to sponsor the first New York Quantitative Python User Group Meetup (wow that’s a mouthful) on March 6th. If you are in the New York area, you can sign up for the group here.

The program for the evening featured Marcos Lopez de Prado and our own Kelsey Jordahl (with an assist from yours truly). The meetup focused on the topic of portfolio optimization and some of its foibles. Marcos conducted an in-depth discussion of portfolio optimization in general and outlined his open source implementation of the CLA algorithm. He also discussed why he is such a fan of Python.

Our contribution to the evening focused on the the theme “From Research to Application.” And by “research” we meant both research code (Marcos’ CLA code is one example) and actual investment research. Firms are wrestling with data and trying to marshal all the expertise within the organization to make decisions. Increasingly, software is being used to help synthesize this information. In our thought experiment, we imagined a hypothetical portfolio manager or strategist that is trying to integrate the quantitative and fundamental expertise within the firm. What kind of information would this PM want to see? How could we make the application visually appealing and intuitively interactive?

We chose to use the Black-Litterman model to tie some of these threads together. In a nutshell, Black-Litterman takes a Bayesian approach to portfolio optimization. It assumes that the capital allocations in the market are decent and reverses the classical optimization process to infer expected returns (rather than weights). It also allows modification of these expected returns to reflect analyst views on a particular asset. For those of you not familiar with this subject, you can find an accessible discussion of the approach in He and Litterman (1999). Using the Black-Litterman model as our organizing principle, we put together an application that provides context for historical returns, relative value, and pairwise asset correlations, all wired together to provide full interactivity.

Given the limited time we had to put this together, there are obviously things we would have changed and things we would have liked to include. Nevertheless, we think the demo is a good example of how one can use open source technology to not only take advantage of research code but also integrate quantitative models and fundamental research.

FYI, the libraries used in the app are: Numpy/Pandas, Scipy, Traits, Chaco, and Enaml.

Videos of the talks are below. Tell us what you think!

QPUG_20130306_PortfolioDemo from NYQPUG on Vimeo.

QPUG_20130306_Marcos from NYQPUG on Vimeo.

What Is Your Python Budget?

C programmers, by necessity, generally develop a mental model for understanding the performance characteristics of their code. Developing this intuition in a high level language like Python can be more of a challenge. While good Python tools exist for identifying time and memory performance (line_profiler by Robert Kern and guppy by Sverker Nilsson), you are largely on your own if you want to develop intuition for code that is yet to be written. Understanding the cost of basic operations in your Python implementation can help guide design decisions by ruling out extensive use of expensive operations.

Why is this important you ask? Interactive applications appear responsive when they react to user behaviour within a given time budget. In our consulting engagements, we often find that a lack of awareness regarding the cost of common operations can lead to sluggish application performance. Some examples of user interaction thresholds:

  • If you are targeting 60 fps in a multimedia application you have 16 milliseconds of processing time per frame. In this time, you need to update state, figure out what is visible, and then draw it.
  • Well behaved applications will load a functional screen that a user can interact with in under a second. Depending on your application, you may need to create an expensive datastructure upfront before your user can interact with the application. Often one needs to find a way to at least make it feel like the one second constraint is being respected.
  • You run Gentoo / Arch. In this case, obsessing over performance is a way of life.

Obviously rules are meant to be broken, but knowing where to be frugal can help you avoid or troubleshoot performance problems. Performance data for Python and PyPy are listed below.

Machine configuration

CPU – AMD 8150

RAM – 16 GB PC3-12800

Windows 7 64 bit

Python 2.7.2 — EPD 7.3-1 (64-bit)

PyPy 2.0.0-beta1 with MSC v.1500 32 bit
Steps to obtain timings and create table from data

python measure.py cpython.data
pypy measure.py pypy.data

python draw_table.py cpython.data cpython.png
python draw_table.py pypy.data pypy.png

To obtain the code to measure timings and create the associated tables for your own machine, checkout https://github.com/deepankarsharma/cost-of-python

CPython timing data

 

PyPy timing data

 

 

Visualizing Uncertainty

Inspired by a post on visually weighted regression plots in R, I’ve been playing with shading to visually represent uncertainty in a model fit. In making these plots, I’ve used python and matplotlib. I used gaussian process regression from sklearn to model a synthetic data set, based on this example.

In the first plot, I’ve just used the error estimate returned from GaussianProcess.predict() as the Mean Squared Error. Assuming normally distributed error, and shading according to the normalized probability density, the result is the figure below.

The dark shading is proportional to the probability that the curve passes through that point.  It should be noted that unlike traditional standard deviation plots, this view emphasizes the regions where the fit is most strongly constrained.  A contour plot of the probability density would narrow where the traditional error bars are widest.

The errors aren’t really gaussian, of course.  We can estimate the errors empirically, by generating sample data many times from the same generating function, adding random noise as before.  The noise has the same distribution, but will be different in each trial due to the random number generator used.  We can ensure that the trials are reproducible by explicitly setting the random seeds.  This is similar to the method of error estimation from a sample population known as the bootstrap (although this is not a true bootstrap, as we are generating new trials instead of simulating them by sampling subsets of the data).  After fitting each of the sample populations, the predicted curves for 200 trials are shown in the spaghetti plot below.

If we calculate the density of the fitted curves, we can make the empirical version of the density plot, like so:

It’s not as clean as the analytic error.  In theory, we should be able to make it smoother by adding more trials, but this is computationally expensive (we’re already solving our problem hundreds of times).  That isn’t an issue for a problem this size, but this method would require some additional thought and care for a larger dataset (and/or higher dimensions).

Nevertheless, the computational infrastructure of NumPy and SciPy, as well as tools like matplotlib and sklearn, make Python a great environment for this kind of data exploration and modeling.

The code that generated these plots is in an ipython notebook file, which you can view online or download directly.

Explore NYC 311 Data

Datagotham explored how organizations and individuals use data to gain insight and improve decision-making. Some interesting talks on “urban science” got us curious about what data is publicly available on the city itself. Of course, the first step in this process is actually gaining access to interesting data and seeing what it looks like. Open311.org seemed like a good first stop, but a quick glance suggests there isn’t much data available there. A look through the NYC open data website yielded a 311 dataset that aggregates about four million 311 calls from January 1, 2010 to August 25, 2012. There are a number of other data sets on the site, but we focused on this data set to keep things simple. What follows is really just the first step in a multi-step process, but we found it interesting.

NYC 311 calls are categorized into approximately 200 different complaint types, ranging from “squeegee” and “radioactive” to “noise” and “street condition.” There are an additional ~1000 descriptors (e.g. Radioactive -> Contamination). Each call is tagged with date, location type, incident address, and longitude and lattitude information but, weirdly, does not include the time of day. We had to throw out approximately 200,000 records because the records were incomplete. As always, “garbage in | garbage out” holds, so be advised, we are taking the data at face value.

Simple aggregations can help analysts develop intuition about the data and provide fodder for additional inquiry. Housing related complaints to HPD (NYC Dept of Housing Preservation and Development) represented the vast majority of calls (1,671,245). My personal favorite, “squeegee,” was far down at the bottom of the list with only 21 complaints over the last two years. I seem to remember a crackdown several years ago…perhaps it had an impact. “Radioactive” is another interesting category that deserves some additional explanation (could these be medical facilities?). Taxi complaints, as one would expect, are clustered in Manhattan and the airports. Sewage backups seem to be concentrated in areas with tidal exposure. Understanding the full set of complaint types would take some work, but we’ve included visualizations for a handful of categories above. The maps shows the number of calls per complaint type organized by census tract.

Tinkering With Visualizations

The immediate reaction to some of these visualizations is the need for normalization. For example, food poisoning calls are generally clustered in Manhattan (with a couple of hot spots in Queens…Flushing, I’m looking at you), but this likely reflects the density of restaurants in that part of the city. One might also make the knee jerk conclusion that Staten Island has sub-par snow plowing service and road conditions. As we all know, this is crazy talk! Nevertheless, whatever the eventual interpretation, simple counts and visualizations provide a frame of reference for future exploration.

Another immediate critique is the color binning of the maps. My eye would like some additional resolution or different scaling to get a better feeling for the distribution of complaints (you’ll notice a lot of yellow in most of the maps). A quick look at some histograms (# of calls per census tract) illustrates the “power law” or log-normal shape of the resulting distributions. Perhaps scaling by quantile would yield more contrast.

# of calls per census tract

Adding Other Data

As mentioned, we are only scratching the surface here. More interesting analyses can be done by correlating other datasets in space and time with the 311 calls. We chose to aggregate the data initially by census tract so we can use population and demographic data from the census bureau and NYC to compare with call numbers.  For other purposes, we may want to aggregate by borough, zip code, or the UHF neighborhoods used by the public health survey. For example, below we show the percentage of people that consume one or more sugary drinks per day, on average (2009 survey, the darkest areas are on the order of 40-45%).

% drinking one or more sugary drinks per day

Here’s the same map with an underlay of the housing 311 data (health survey data at 50% alpha).

sugary drink vs. public housing complaints

While we make no conclusions about the underlying factors, it isn’t hard to imagine visualizing different data to broaden the frame of reference for future testing and analysis. Furthermore, we have not even explored the time dimension of this data. The maps above represent raw aggregations. Time series information on “noise” and other complaint categories can yield useful seasonality information, etc. If you find correlations between separate pieces of data, well, now you are getting into real analysis! Testing for causation, however, can make a big difference in how you interpret the data.

Tools: PostGIS, Psycopg2, Pandas, D3

PostGIS did most of the heavy lifting in this preliminary exploration. Psycopg2 provided our Python link and Pandas made it easy to group calls by category and census tract. We used D3 to build the maps at the beginning of the post. We used QGIS for some visualization and to generate the static images above. All of these tools are open source.

Semi-Final Thoughts

The point here is that we haven’t really done any serious number crunching yet, but the process of exploration has already helped develop useful intuition about the city and prompt some interesting questions about the underlying data. What is it about South Jamaica and sewer backups? What’s up with Chinatown and graffiti? The question is the most important thing. Without it, it’s easy to lose your way. What kind of questions would you ask this data? What other data sets might you bring to bear?

 

DataGotham…Complete!

Well, DataGotham is over. The conference featured a wide cross section of the data community in NYC. Talks spanned topics from “urban science” to “finding racism on FourSquare” to “creating an API for spaces.” Don’t worry, the videos will be online soon so you can investigate yourself. The organizers did a great job putting a conference of this size together on relatively short notice. Bravo NYC data crunchers!

One thing I somehow missed was a network graph created by the organizers to illustrate the tools used by attendees. I am happy to see python leading the way! The thickness of the edge indicates the number of people using both tools. It seems there are a lot of people trying to make Python and R “two great tastes that go great together.” I’m curious as to why more Python users aren’t using numpy and scipy. Food for thought…

Got tools?

DataGotham: Sept 13 & 14 @ NYU

Enthought is proud to announce its sponsorship of the upcoming DataGotham conference in NYC. DataGotham is meant to be a “celebration of New York City’s data community.” Organized by the likes of Drew Conway and Hillary Mason (and other pillars of the community), DataGotham is shaping up to be a highly concentrated concoction of everything data science-y. Just take a look at the ever increasing line-up of speakers. Even for those of you who don’t live or work in Manhattan, it’s probably worth the trip!

At Enthought, we use Python because it offers an effective combination pragmatism, clarity, and power. We have always argued that good tools “get out of the way,” allowing you to focus on the real problem at hand. As such, we are excited to support DataGotham’s effort to highlight how the NYC scientific community is tackling a wide variety of questions in urban planning, social media, education, etc.

Don’t forget to check out the tutorials! There will be tutorials on Data Journalism, MongoDB/R, Julia (recently featured at Scipy), and Real Time Data Science.

Chaco Pygotham Talk

The penultimate video in our series of talks is an overview of Chaco, Enthought’s interactive plotting toolkit. Sit back and enjoy! You can find the github page here.

Clyther Pygotham Talk

Here is a screencast of the Clyther talk at Pygotham. Clyther is an open source project, along the lines of Cython, that allows users to program a GPU with Python via a JIT OpenCL engine. Unfortunately, the multiple monitor setup resulted in some odd aspect ratio’s. Nevertheless, the content has been captured for posterity. You can find the github page here. Enjoy!