Tag Archives: NYC

Explore NYC 311 Data

Datagotham explored how organizations and individuals use data to gain insight and improve decision-making. Some interesting talks on “urban science” got us curious about what data is publicly available on the city itself. Of course, the first step in this process is actually gaining access to interesting data and seeing what it looks like. Open311.org seemed like a good first stop, but a quick glance suggests there isn’t much data available there. A look through the NYC open data website yielded a 311 dataset that aggregates about four million 311 calls from January 1, 2010 to August 25, 2012. There are a number of other data sets on the site, but we focused on this data set to keep things simple. What follows is really just the first step in a multi-step process, but we found it interesting.

NYC 311 calls are categorized into approximately 200 different complaint types, ranging from “squeegee” and “radioactive” to “noise” and “street condition.” There are an additional ~1000 descriptors (e.g. Radioactive -> Contamination). Each call is tagged with date, location type, incident address, and longitude and lattitude information but, weirdly, does not include the time of day. We had to throw out approximately 200,000 records because the records were incomplete. As always, “garbage in | garbage out” holds, so be advised, we are taking the data at face value.

Simple aggregations can help analysts develop intuition about the data and provide fodder for additional inquiry. Housing related complaints to HPD (NYC Dept of Housing Preservation and Development) represented the vast majority of calls (1,671,245). My personal favorite, “squeegee,” was far down at the bottom of the list with only 21 complaints over the last two years. I seem to remember a crackdown several years ago…perhaps it had an impact. “Radioactive” is another interesting category that deserves some additional explanation (could these be medical facilities?). Taxi complaints, as one would expect, are clustered in Manhattan and the airports. Sewage backups seem to be concentrated in areas with tidal exposure. Understanding the full set of complaint types would take some work, but we’ve included visualizations for a handful of categories above. The maps shows the number of calls per complaint type organized by census tract.

Tinkering With Visualizations

The immediate reaction to some of these visualizations is the need for normalization. For example, food poisoning calls are generally clustered in Manhattan (with a couple of hot spots in Queens…Flushing, I’m looking at you), but this likely reflects the density of restaurants in that part of the city. One might also make the knee jerk conclusion that Staten Island has sub-par snow plowing service and road conditions. As we all know, this is crazy talk! Nevertheless, whatever the eventual interpretation, simple counts and visualizations provide a frame of reference for future exploration.

Another immediate critique is the color binning of the maps. My eye would like some additional resolution or different scaling to get a better feeling for the distribution of complaints (you’ll notice a lot of yellow in most of the maps). A quick look at some histograms (# of calls per census tract) illustrates the “power law” or log-normal shape of the resulting distributions. Perhaps scaling by quantile would yield more contrast.

# of calls per census tract

Adding Other Data

As mentioned, we are only scratching the surface here. More interesting analyses can be done by correlating other datasets in space and time with the 311 calls. We chose to aggregate the data initially by census tract so we can use population and demographic data from the census bureau and NYC to compare with call numbers.  For other purposes, we may want to aggregate by borough, zip code, or the UHF neighborhoods used by the public health survey. For example, below we show the percentage of people that consume one or more sugary drinks per day, on average (2009 survey, the darkest areas are on the order of 40-45%).

% drinking one or more sugary drinks per day

Here’s the same map with an underlay of the housing 311 data (health survey data at 50% alpha).

sugary drink vs. public housing complaints

While we make no conclusions about the underlying factors, it isn’t hard to imagine visualizing different data to broaden the frame of reference for future testing and analysis. Furthermore, we have not even explored the time dimension of this data. The maps above represent raw aggregations. Time series information on “noise” and other complaint categories can yield useful seasonality information, etc. If you find correlations between separate pieces of data, well, now you are getting into real analysis! Testing for causation, however, can make a big difference in how you interpret the data.

Tools: PostGIS, Psycopg2, Pandas, D3

PostGIS did most of the heavy lifting in this preliminary exploration. Psycopg2 provided our Python link and Pandas made it easy to group calls by category and census tract. We used D3 to build the maps at the beginning of the post. We used QGIS for some visualization and to generate the static images above. All of these tools are open source.

Semi-Final Thoughts

The point here is that we haven’t really done any serious number crunching yet, but the process of exploration has already helped develop useful intuition about the city and prompt some interesting questions about the underlying data. What is it about South Jamaica and sewer backups? What’s up with Chinatown and graffiti? The question is the most important thing. Without it, it’s easy to lose your way. What kind of questions would you ask this data? What other data sets might you bring to bear?



Well, DataGotham is over. The conference featured a wide cross section of the data community in NYC. Talks spanned topics from “urban science” to “finding racism on FourSquare” to “creating an API for spaces.” Don’t worry, the videos will be online soon so you can investigate yourself. The organizers did a great job putting a conference of this size together on relatively short notice. Bravo NYC data crunchers!

One thing I somehow missed was a network graph created by the organizers to illustrate the tools used by attendees. I am happy to see python leading the way! The thickness of the edge indicates the number of people using both tools. It seems there are a lot of people trying to make Python and R “two great tastes that go great together.” I’m curious as to why more Python users aren’t using numpy and scipy. Food for thought…

Got tools?

DataGotham: Sept 13 & 14 @ NYU

Enthought is proud to announce its sponsorship of the upcoming DataGotham conference in NYC. DataGotham is meant to be a “celebration of New York City’s data community.” Organized by the likes of Drew Conway and Hillary Mason (and other pillars of the community), DataGotham is shaping up to be a highly concentrated concoction of everything data science-y. Just take a look at the ever increasing line-up of speakers. Even for those of you who don’t live or work in Manhattan, it’s probably worth the trip!

At Enthought, we use Python because it offers an effective combination pragmatism, clarity, and power. We have always argued that good tools “get out of the way,” allowing you to focus on the real problem at hand. As such, we are excited to support DataGotham’s effort to highlight how the NYC scientific community is tackling a wide variety of questions in urban planning, social media, education, etc.

Don’t forget to check out the tutorials! There will be tutorials on Data Journalism, MongoDB/R, Julia (recently featured at Scipy), and Real Time Data Science.