Category Archives: News

PyXLL: Deploy Python to Excel Easily

PyXLL Solution Home | Buy PyXLL | Press Release

Today Enthought announced that it is now the worldwide distributor for PyXLL, and we’re excited to offer this key product for deploying Python models, algorithms and code to Excel. Technical teams can use the full power of Enthought Canopy, or another Python distro, and end-users can access the results in their familiar Excel environment. And it’s straightforward to set up and use.

Installing PyXLL from Enthought Canopy

PyXLL is available as a package subscription (with significant discounts for multiple users). Once you’ve purchased a subscription you can easily install it via Canopy’s Package Manager as shown in the screenshots below (note that at this time PyXLL is only available for Windows users). The rest of the configuration instructions are in the Quick Start portion of the documentation. PyXLL itself is a plug-in to Excel. When you start Excel, PyXLL loads into Excel and reads in Python modules that you have created for PyXLL. This makes PyXLL especially useful for organizations that want to manage their code centrally and deploy to multiple Excel users.

Enthought Canopy Package Manager   Install PyXLL from Enthought Canopy's Package Manager

Creating Excel Functions with PyXLL

To create a PyXLL Python Excel function, you use the @xl_func decorator to tell PyXLL the following function should be registered with Excel, what its argument types are, and optionally what its return type is. PyXLL also reads the function’s docstring and provides that in the Excel function description. As an example, I created a module my_pyxll_module.py and registered it with PyXLL via the Continue reading

SciPy 2013 Conference Recap

Author: Eric Jones

Another year, another great conference.  Man, this thing grew a ton this year.  At final count, we had something like 340 participants which is way up from last year’s 200 or so attendees.  In fact, we had to close registration a couple of weeks before the program because that is all our venue could hold.  We’ll solve that next year.  Invite your friends.  We’d love to see 600 or even more.

Many thanks to the organizing team.  Andy Terrell and Jonathan Rocher did an amazing job as conference chairs this year both managing that growth and keeping the trains on time.  We expanded to 3 parallel sessions this year, which often made me want to be in 3 places at once.  Didn’t work.  Thankfully, the videos for all the talks and sessions are available online.  The video team really did a great job — thanks a ton.

I’ve wondered whether the size would change the feel of the conference, but I’m happy to report it still feels like an gathering of friends, new and old.  Aric Hagberg mentioned he thinks this is because it’s such a varied (motley?) crowd from disparate fields gathered to teach, learn, and share software tools and ideas.  This fosters a different atmosphere than some academic conferences where sparring about details of a talk is a common sport.  Hmh.  Re-watching the videos, I see Fernando Perez mentions this as well.

Thanks again to all who organized and all who attended.  I’m already looking forward to seeing you again next year.  Below are my personal musings on various topics at the conference:

  • The tutorials were, as usual, extremely well attended.  I spent the majority of my time there in the scikits learn track by Gael VaroquauxOlivier Grisel, and Jake VanderPlas.  Jeez, has this project gone far.  It is stunning to see the breath and quality of the algorithms that they have.  It’s obviously a hot topic these days; it is great to have such an important tool set at our disposal.
  • Fernando Perez gave a keynote this year about IPython.  We can safely say that 2013 is the year of the IPython notebook.  It was *everywhere*.  I’d guess 80+% of the talks and tutorials for the conference used it in their presentations.  Fernando went one step further, and his slide deck was actually live IPython notebooks.  Quite cool.  I do believe it’ll change the way people teach Python…  But, the most impressive thing is that Fernando still has and can execute the original 250 line script that was IPython 0.00001.  Scratch that.  The most impressive thing is to hear how Fernando has managed to build a community and a project that is now supported by a $1.1M grant from the Sloan foundation.  Well done sir.  The IPython project really does set the standard on so many levels.
  • Olivier Grisel, of scikits learn fame, gave a keynote on trends in machine learning.  It was really nice because he talked about the history of neural networks and the advances that have been made in “deep learning” in recent years.  I began grad school in NN research, and was embarrassed to realize how recent (1986) the back propagation learning algorithm was when I first coded it for research (1993).  It seemed old to me then — but I guess 7 years to a 23 year is, well, pretty old.  Over the years, I became a bit disenchanted with neural nets because they didn’t reveal the underlying physical process within the data.  I still have this bias, but Olivier’s discussion of the “deep learning” advances convinced me that I should get re-educated.  And, perhaps I’m getting more pragmatic as the gray hairs fill in (and the bald spot grows).  It does look like it’s effective for multiple problems in the detection and classification world.
  • William Schroeder, CEO of Kitware, gave a keynote on the importance of reproducible research which was one of the conference themes.  It was a privilege to have him because of the many ways Kitware illuminated the path for high quality scientific software in the open source world with VTK.  I’ve used it both in C++ and, of course, from Python for many, many years.  In his talk, Will talked about the existing scientific publication model doesn’t work so well anymore, and that, in fact, with the web and tools that are now available, direct publishing of results is the future together with publishing our data sets and code that generated them.  This actually dovetailed really well with Fernando’s talk, and I can’t help but think that we are on this track.
  • David Li has been working with the SymPy team, and his talk showed off the SymPy Live site that they have built to interactively try out symbolic calculations on the web.  I believe David is the 2nd high school student to present in the history of SciPy, yes? (Evan Patterson was the other that I remember)  Heh.  Aaand, what were you doing your senior year?  Both were composed, confident, and dang good — bodes well for our future.
  • There are always a few talks of the “what I have learned” flavor at Python.  This year, Brian Granger of IPython fame gave one about the dangers of features and the benefits of bugs.  Brian’s talks are almost always one of my favorites (sorta like I always make sure to see what crazy stuff David Beazley presents at PyCon).  Part of it is that he often talks about parallel computing for the masses which is dear to my heart, but it is also because he organizes his topics so well.
  • Nicholas Kridler also unexpectedly hooked me with another one of these talks.  I was walking out of conference hall after the keynote to go see what silly things the ever smiling Jake Vanderplas might be up to in his astronomy talk.  But derned if Nicholas didn’t start walking through how he approaches new machine learning problems in interesting ways.  My steps slowed, and I finally sat down, happy to know that I could watch Jake’s talk later.  Nicholas used his wits and scikits learn to win(!) the Kaggle whale detection competition earlier this year, and he gave us a great overview of how he did it.  Well worth a listen.
  • Both Brian and Nicholas’ talks started me thinking how much I like to see how experts approach problems.  The pros writing all the cool libraries often give talks on the features of their tools or the results of their research, but we rarely get a glimpse into their day-to-day process.  Sorta like pair programming with Martin Chilvers is a life changing experience (heh.  for better or worse… :-)), could we have a series of talks where we get to ride shotgun with a number of different people and see how they work?  How does Ondrej Certik work through a debugging session on SymPy development?  Does his shiny new cowboy hat from Allen Boots help or not?  When approaching a new simulation or analysis, how does Aric Hagberg use graph theory (and Networkx) to set the problem up?  When Serge Rey gets a new set of geospatial data, what are the common things he does to clean and organize the data for analysis with PySAL?  How does Wes McKinney think through API design trade-offs as he builds Pandas?  And, most importantly, how does Stefan Van Der Walt get the front of his hair to stand up like that? (comb or brush? hair dryer on low or high?)  Ok, maybe not Stefan, but you get the idea.  We always see a polished 25 minute presentation that sums up months or years of work that we all know had many false starts and painful points.  If we could learn about where people stubbed their toe and how to avoid it in our work, it would be pretty cool.  Just an idea, but I will run it by the committee for next year and see if there is any interest.
  • The sprints were nothing short of awesome.  Something like 130+ people were there on the first day sprinting on 10-20 different libraries including SymPy, NumPy, IPython, Matplotlib as well as more specific tools like scikits image and PySal.  Amazing to see.  Perhaps the bigger surprise was that at least half also stayed for Saturday’s sprints.  scikits learn had a team of about 10 people that worked two full days together (Friday and Saturday activity visible on the commit graph), and I think multiple other groups did as well.  While we’ve held sprints for a while, we had 2 top 3 times as many people as 2012, and this year’s can only be described as wildly successful.

  • While I was there, I spent most of my time checking in on the PySide sprint where John Erhsman of Wingware got a new release ready for the 4.8 series of Qt (bless him), and Robin Dunn, Corran Webster, Stefan Landgovt, and John Wiggins investigated paths forward toward 5.x compatibility.  No one was too excited about Shiboken, but the alternatives are also not a walk in the park.  I think the feeling is, long term, we’ll need to bite the bullet and go a different direction than Shiboken.

Enthought and edX Come Together for Open Source Education

edX is a non-profit founded by Harvard and MIT that aims to create an open-source learning platform that allows anyone with an internet connection to take classes for free.

We welcomed edX students to EPD in a previous post, but Enthought’s own Josephine Dickinson has been auditing the class since then. Short lectures, exercises, and problem sets are the name of the game. Subjects such as “Recursion and Objects” have recently been introduced with students exploring polynomials, derivatives, the Newton-Raphson Method for root finding, and (of course) the venerable challenge of implementing Hangman. Good luck on those midterms!

Enthought is proud to be a part of this education initiative and was happy to issue the following statement:

Enthought, Inc., the Python-based technical computing solutions company, announced today that it has supplied over 20,000 downloads of its Enthought Python Distribution (EPD) Free to students taking MITx’s “6.00x: Introduction to Computer Science and Programming” on edX, the online learning initiative founded by Harvard University and the Massachusetts Institute of Technology (MIT).The free introductory course provides students with a survey of basic computer science concepts and an introduction to the Python programming language. EPD Free gives students a robust and reliable tool for learning Python.

“Enthought’s generous offer of free Python downloads is enabling tens of thousands of students around the world who are taking edX 6.00x to improve their understanding of basic computer science concepts,” said Anant Agarwal, President of edX. “We’re grateful for Enthought’s support of edX’s mission to improve access to higher education for students worldwide.”

EdX is a not-for-profit enterprise of its founding partners, the Massachusetts Institute of Technology (MIT) and Harvard University that offers online learning to on-campus students and to millions of people around the world. To do so, edX is building an open-source online learning platform and hosts an online web portal at www.edx.org for online education. EdX offers HarvardX, MITx, BerkeleyX, and soon UTx classes online for free. These institutions aim to extend their collective reach to build a global community of online students. Along with offering online courses, the three universities undertake research on how students learn and how technology can transform learning – both on-campus and online throughout the world.

“edX’s mission to provide high-quality education worldwide using open-source software meshes very well with our mission to provide open-source technical computing software and solutions,” says Dr. Eric Jones, Enthought’s CEO. “Enthought contributes to the development of numerous Python tools and works hard to bring the benefits of open-source to companies and developers worldwide. We’re very excited to help edX provide its students their Python programming environment.”

Enthought Python Distribution (EPD) Free is a lightweight distribution of scientific and analytic Python essentials: SciPy, NumPy, iPython, MatPlotLib, Traits and Chaco. EPD Free provides a free, cross-platform installer of the libraries Enthought considers fundamental for scientists, engineers, and analysts. It is ideal for beginners who seek a simple but powerful Python stack and developers who want to distribute a small, reliable environment for their applications.

“Many MIT students have used EPD Free in their computer science courses over the years and we’ve supported the instructors and the students.” said William Cowan, Enthought’s COO. “When edX told us they had over 70,000 registrants for the class we knew this was major milestone for edX and for Enthought, and we are thrilled to see so many students downloading and getting started with Python.”

LFPUG: Python in the enterprise + Pandas

Over 80 people attended last night’s London Financial Python User Group (LFPUG), with presentations given by Den Pilsworth of AHL/MAN, Eric Jones of Enthought, and Wes Mckinney of Pandas fame. It was an evening filled with practical content, so come on out for the next meetup if you are in town (or for drinks at the pub afterwards)!

The agenda for the evening:

1. “Moving an algo business from R and Java to Python”, Dennis Pilsworth, AHL, Man Group
2. “Financial data analysis in Python with pandas”, Wes McKinney
3. “Fostering Python Adoption within a Company”, Eric Jones, Enthought.

Den presented a case study of how his firm introduced Python into production and ensured that “network distributed” deployment worked quickly enough to ensure good local response time with out overloading the network. He also discussed visualization and pointed out native Python tools need some work to remain competitive with the R user’s sweetheart, ggplot2. He graciously acknowledged the role Enthought’s training played in getting things rolling.

Wes Mckinney discussed the latest Pandas developments, particularly the Group-by function. A number of attendees were interested in potentially using this functionality to replace Excel pivot tables. Make sure to check out Wes’ new book, “Python for Data Analysis.”

Eric Jones discussed how to get Python adopted in the face of opposition, featuring some of the classic objections (e.g. “Python is too slow”).

LFPUG meets  roughly every other month, so look us up on LinkedIn and keep an eye out for the next meeting!

Explore NYC 311 Data

Datagotham explored how organizations and individuals use data to gain insight and improve decision-making. Some interesting talks on “urban science” got us curious about what data is publicly available on the city itself. Of course, the first step in this process is actually gaining access to interesting data and seeing what it looks like. Open311.org seemed like a good first stop, but a quick glance suggests there isn’t much data available there. A look through the NYC open data website yielded a 311 dataset that aggregates about four million 311 calls from January 1, 2010 to August 25, 2012. There are a number of other data sets on the site, but we focused on this data set to keep things simple. What follows is really just the first step in a multi-step process, but we found it interesting.

NYC 311 calls are categorized into approximately 200 different complaint types, ranging from “squeegee” and “radioactive” to “noise” and “street condition.” There are an additional ~1000 descriptors (e.g. Radioactive -> Contamination). Each call is tagged with date, location type, incident address, and longitude and lattitude information but, weirdly, does not include the time of day. We had to throw out approximately 200,000 records because the records were incomplete. As always, “garbage in | garbage out” holds, so be advised, we are taking the data at face value.

Simple aggregations can help analysts develop intuition about the data and provide fodder for additional inquiry. Housing related complaints to HPD (NYC Dept of Housing Preservation and Development) represented the vast majority of calls (1,671,245). My personal favorite, “squeegee,” was far down at the bottom of the list with only 21 complaints over the last two years. I seem to remember a crackdown several years ago…perhaps it had an impact. “Radioactive” is another interesting category that deserves some additional explanation (could these be medical facilities?). Taxi complaints, as one would expect, are clustered in Manhattan and the airports. Sewage backups seem to be concentrated in areas with tidal exposure. Understanding the full set of complaint types would take some work, but we’ve included visualizations for a handful of categories above. The maps shows the number of calls per complaint type organized by census tract.

Tinkering With Visualizations

The immediate reaction to some of these visualizations is the need for normalization. For example, food poisoning calls are generally clustered in Manhattan (with a couple of hot spots in Queens…Flushing, I’m looking at you), but this likely reflects the density of restaurants in that part of the city. One might also make the knee jerk conclusion that Staten Island has sub-par snow plowing service and road conditions. As we all know, this is crazy talk! Nevertheless, whatever the eventual interpretation, simple counts and visualizations provide a frame of reference for future exploration.

Another immediate critique is the color binning of the maps. My eye would like some additional resolution or different scaling to get a better feeling for the distribution of complaints (you’ll notice a lot of yellow in most of the maps). A quick look at some histograms (# of calls per census tract) illustrates the “power law” or log-normal shape of the resulting distributions. Perhaps scaling by quantile would yield more contrast.

# of calls per census tract

Adding Other Data

As mentioned, we are only scratching the surface here. More interesting analyses can be done by correlating other datasets in space and time with the 311 calls. We chose to aggregate the data initially by census tract so we can use population and demographic data from the census bureau and NYC to compare with call numbers.  For other purposes, we may want to aggregate by borough, zip code, or the UHF neighborhoods used by the public health survey. For example, below we show the percentage of people that consume one or more sugary drinks per day, on average (2009 survey, the darkest areas are on the order of 40-45%).

% drinking one or more sugary drinks per day

Here’s the same map with an underlay of the housing 311 data (health survey data at 50% alpha).

sugary drink vs. public housing complaints

While we make no conclusions about the underlying factors, it isn’t hard to imagine visualizing different data to broaden the frame of reference for future testing and analysis. Furthermore, we have not even explored the time dimension of this data. The maps above represent raw aggregations. Time series information on “noise” and other complaint categories can yield useful seasonality information, etc. If you find correlations between separate pieces of data, well, now you are getting into real analysis! Testing for causation, however, can make a big difference in how you interpret the data.

Tools: PostGIS, Psycopg2, Pandas, D3

PostGIS did most of the heavy lifting in this preliminary exploration. Psycopg2 provided our Python link and Pandas made it easy to group calls by category and census tract. We used D3 to build the maps at the beginning of the post. We used QGIS for some visualization and to generate the static images above. All of these tools are open source.

Semi-Final Thoughts

The point here is that we haven’t really done any serious number crunching yet, but the process of exploration has already helped develop useful intuition about the city and prompt some interesting questions about the underlying data. What is it about South Jamaica and sewer backups? What’s up with Chinatown and graffiti? The question is the most important thing. Without it, it’s easy to lose your way. What kind of questions would you ask this data? What other data sets might you bring to bear?

 

Welcome EdX students!

Welcome to EPDFree EdX students! We are honored to partner with EdX to offer up a Python environment you can use for your studies. For those of you who don’t know, “EdX is a not-for-profit enterprise of its founding partners Harvard University and the Massachusetts Institute of Technology that features learning designed specifically for interactive study via the web.”

We’ve always had a strong relationship with the academic community and are excited about accelerating the speed of science with our tools and consulting work. It’s exciting to know we will be a part of this ambitious initiative to train the next generation of scientists around the world. Good luck on those problem sets!

SIGGRAPH 2012: Mobile, OpenGL 4.3 and Python

I recently had the opportunity to attend SIGGRAPH in LA. For those of you who don’t know, SIGGRAPH is an annual conference for Graphics and Visualization that does a great job of attracting people from both the scientific and artistic halves of the visualization community. For broader coverage, you can read some of the usual blog coverage here. I was particularly interested, however, in the OpenGL developments.

Increased focus on performance/watt for mobile applications

SIGGRAPH 2012 was important for many reasons, but particularly for those of us that use OpenGL (this year was the 20th anniversary of the OpenGL API). OpenGL 4.3 and OpenGL ES 3.0 were announced and there were many interesting sessions on the new OpenGL release (more about this later) and graphics on mobile devices.

The rapid ascent of mobile and its dominance as the primary computer that people interact with on a daily basis has opened up an interesting challenge for people designing games and visualization — the difference in the power envelope between mobile devices and their desktop brethren. The power envelope of high-end desktop devices is ~300 watt while the power envelope of mobile gpu’s is < 1 watt. This power disparity implies a massive gulf in performance across the full spectrum of devices, assuming similar architectures are in use.

Multiple speakers stressed that power consumption should be a first class design metric when designing graphics algorithms along with the traditional metric of performance. Currently, the tools for profiling and measuring power consumption of algorithms are almost non-existent and nowhere near the sophistication of tools for measuring performance. Nevertheless, data transfers over a bus were recognized as an expensive power activity and a place where power savings can be realized.

To this end, OpenGL (finally) announced new texture compression formats that are royalty free, work on both OpenGL and OpenGL ES and are guaranteed on all compliant OpenGL implementations. This is great for developers since we can finally assume that this functionality will be available across all devices. More information about the new texture compression formats lives here.

Using OpenGL 4.3 from Python: Rabbit of Caerbannog

The OpenGL ARB committee started an effort to modernise OpenGL around OpenGL 3.x and large parts of the old fixed pipeline functionality was deprecated. These changes were great from a driver implementors point of view and should allow developers to write code that runs faster on modern GPU’s. However the deprecations have obsoleted much of the OpenGL tutorials that exist on the internet. I have listed two examples here which do not use deprecated functionality and can be used as a starting point to write modern OpenGL graphics examples in Python.

So, how do you use the brand spanking new 4.3 api’s from Python? I’ve written a code sample that uses modern opengl (no deprecated functionality used) to draw a triangle from Python.

Opengl display output showing a colored triangle

Triangle drawn using modern opengl and Python

For the code please refer to https://gist.github.com/3494203

For fun, here’s a screenshot of Stanford bunny drawn using modern OpenGL -

Stanford bunny drawn using modern OpenGL and python

Stanford bunny drawn using modern OpenGL and python

For the code please refer to  https://gist.github.com/3494560

Notes:

  1. Right now only Nvidia has Opengl 4.3 drivers available
  2. OSX only supports Opengl versions upto 3.2 as of today
  3. You will require a trunk release of PyOpenGL to create a OpenGL 4.3 context
  4. Code for the examples can be obtained from https://github.com/enthought/glfwpy.git
  5. I have conservatively marked the OpenGL version as 3.2 here since many readers will not have 4.3 working on their machines. To enable OpenGL 4.3 change the OpenWindowHint call for OPENGL_VERSION_MAJOR to 4 and the OpenWindowHint call for OPENGL_VERSION_MINOR to 3.

EuroScipy 2012

EuroScipy 2012 starts tomorrow! Four days of exciting tutorials and talks. The conference is hosted in Brussels at ULB (which you probably know if you went to FOSDEM).

The first two days are dedicated to a great set of tutorials. The introductory track should please any new data analyst starting with Python:

  • array manipulation with NumPy
  • plotting with Matplotlib
  • introduction to scientific computing with Scipy.

In the advanced track, HPC and parallel computing are the main focus but tutorials also offer:

  • advanced numpy and scipy
  • time series data analysis with Pandas
  • visualisation
  • packaging and scientific software development insights.

 

Last but not least, the European Enthought team will offer:

  • a tutorial on Enaml, a new library that makes GUI programming fun
  • a tutorial on how to write robust scientific code with testing
  • a tutorial about Bento, a pythonic packaging system for Python software

Plan for an exciting weekend as well with various talks covering finance to geophysics to biology. Don’t forget to come for the keynote sessions with David Beazley on Saturday and Eric Jones, Enthought’s CEO, on Sunday!

 

See you in Brussels!