Datagotham explored how organizations and individuals use data to gain insight and improve decision-making. Some interesting talks on “urban science” got us curious about what data is publicly available on the city itself. Of course, the first step in this process is actually gaining access to interesting data and seeing what it looks like. Open311.org seemed like a good first stop, but a quick glance suggests there isn’t much data available there. A look through the NYC open data website yielded a 311 dataset that aggregates about four million 311 calls from January 1, 2010 to August 25, 2012. There are a number of other data sets on the site, but we focused on this data set to keep things simple. What follows is really just the first step in a multi-step process, but we found it interesting.
NYC 311 calls are categorized into approximately 200 different complaint types, ranging from “squeegee” and “radioactive” to “noise” and “street condition.” There are an additional ~1000 descriptors (e.g. Radioactive -> Contamination). Each call is tagged with date, location type, incident address, and longitude and lattitude information but, weirdly, does not include the time of day. We had to throw out approximately 200,000 records because the records were incomplete. As always, “garbage in | garbage out” holds, so be advised, we are taking the data at face value.
Simple aggregations can help analysts develop intuition about the data and provide fodder for additional inquiry. Housing related complaints to HPD (NYC Dept of Housing Preservation and Development) represented the vast majority of calls (1,671,245). My personal favorite, “squeegee,” was far down at the bottom of the list with only 21 complaints over the last two years. I seem to remember a crackdown several years ago…perhaps it had an impact. “Radioactive” is another interesting category that deserves some additional explanation (could these be medical facilities?). Taxi complaints, as one would expect, are clustered in Manhattan and the airports. Sewage backups seem to be concentrated in areas with tidal exposure. Understanding the full set of complaint types would take some work, but we’ve included visualizations for a handful of categories above. The maps shows the number of calls per complaint type organized by census tract.
Tinkering With Visualizations
The immediate reaction to some of these visualizations is the need for normalization. For example, food poisoning calls are generally clustered in Manhattan (with a couple of hot spots in Queens…Flushing, I’m looking at you), but this likely reflects the density of restaurants in that part of the city. One might also make the knee jerk conclusion that Staten Island has sub-par snow plowing service and road conditions. As we all know, this is crazy talk! Nevertheless, whatever the eventual interpretation, simple counts and visualizations provide a frame of reference for future exploration.
Another immediate critique is the color binning of the maps. My eye would like some additional resolution or different scaling to get a better feeling for the distribution of complaints (you’ll notice a lot of yellow in most of the maps). A quick look at some histograms (# of calls per census tract) illustrates the “power law” or log-normal shape of the resulting distributions. Perhaps scaling by quantile would yield more contrast.
Adding Other Data
As mentioned, we are only scratching the surface here. More interesting analyses can be done by correlating other datasets in space and time with the 311 calls. We chose to aggregate the data initially by census tract so we can use population and demographic data from the census bureau and NYC to compare with call numbers. For other purposes, we may want to aggregate by borough, zip code, or the UHF neighborhoods used by the public health survey. For example, below we show the percentage of people that consume one or more sugary drinks per day, on average (2009 survey, the darkest areas are on the order of 40-45%).
Here’s the same map with an underlay of the housing 311 data (health survey data at 50% alpha).
While we make no conclusions about the underlying factors, it isn’t hard to imagine visualizing different data to broaden the frame of reference for future testing and analysis. Furthermore, we have not even explored the time dimension of this data. The maps above represent raw aggregations. Time series information on “noise” and other complaint categories can yield useful seasonality information, etc. If you find correlations between separate pieces of data, well, now you are getting into real analysis! Testing for causation, however, can make a big difference in how you interpret the data.
Tools: PostGIS, Psycopg2, Pandas, D3
PostGIS did most of the heavy lifting in this preliminary exploration. Psycopg2 provided our Python link and Pandas made it easy to group calls by category and census tract. We used D3 to build the maps at the beginning of the post. We used QGIS for some visualization and to generate the static images above. All of these tools are open source.
The point here is that we haven’t really done any serious number crunching yet, but the process of exploration has already helped develop useful intuition about the city and prompt some interesting questions about the underlying data. What is it about South Jamaica and sewer backups? What’s up with Chinatown and graffiti? The question is the most important thing. Without it, it’s easy to lose your way. What kind of questions would you ask this data? What other data sets might you bring to bear?