by: Tim Diller, Product Manager and Scientific Software Developer, Enthought
Last week I attended the AIChE (American Institute of Chemical Engineers) Spring Meeting in San Antonio, Texas. It was a great time of year to visit this cultural gem deep in the heart of Texas (and just down the road from our Austin offices), with plenty of good food, sights and sounds to take in on top of the conference and its sessions.
The AIChE Spring Meeting focuses on applications of chemical engineering in industry, and Enthought was invited to present a poster and deliver a “vendor perspective” talk on the Canopy Platform for Process Monitoring and Optimization as part of the “Big Data Analytics” track. This was my first time at AIChE, so some of the names were new, but in a lot of ways it felt very similar to many other engineering conferences I have participated in over the years (for instance, ASME (American Society of Mechanical Engineers), SAE (Society of Automotive Engineers), etc.).
This event underscored that regardless of industry, engineers are bringing the same kinds of practical ingenuity to bear on similar kinds of problems, and with the cost of data acquisition and storage plummeting in the last decade, many engineers are now sitting on more data than they know how to effectively handle.
What exactly is “big data”? Does it really matter for solving hard engineering problems?
One theme that came up time and again in the “Big Data Analytics” sessions Enthought participated in was what exactly “big data” is. In many circles, a good working definition of what makes data “big” is that it exceeds the size of the physical RAM on the machine doing the computation, so that something other than simply loading the data into memory has to be done to make meaningful computations, and thus a working definition of some tens of GB delimits “big” data from “small”.
For others, and many at the conference indeed, a more mundane definition of “big” means that the data set doesn’t fit within the row or column limits of a Microsoft Excel Worksheet.
But the question of whether your data is “big” is really a moot one as far as we at Enthought are concerned; really, being “big” just adds complexity to an already hard problem, and the kind of complexity is an implementation detail dependent on the details of the problem at hand.
And that relates to the central message of my talk, which was that an analytics platform (in this case I was talking about our Canopy Platform) should abstract away the tedious complexities, and help an expert get to the heart of the hard problem at hand.
At AIChE, the “hard problems” at hand seemed invariably to involve one or both of two things: (1) increasing safety/reliability, and (2) increasing plant output.
To solve these problems, two general kinds of activity were on display: different pattern recognition algorithms and tools, and modeling, typically through some kind of regression-based approach. Both of these things are straightforward in the Canopy Platform.
The Canopy Platform is a collection of related technologies that work together in an integrated way to support the scientist/analyst/engineer.
What is the Canopy Platform?
If you’re using Python for science or engineering, you have probably used or heard of Canopy, Enthought’s Python-based data analytics application offering an integrated code editor and interactive command prompt, package manager, documentation browser, debugger, variable browser, data import tool, and lots of hidden features like support for many kinds of proxy systems that work behind the scenes to make a seamless work environment in enterprise settings.
However, this is just one part of the Canopy Platform. Over the years, Enthought has been building other components and related technologies that work together in an integrated way to support the engineer/analyst/scientist solving hard problems.
At the center of the this is the Enthought Python Distribution, with runtime interpreters for Python 2.7 and 3.x and over 450 pre-built Python packages for scientific computing, including tools for machine learning and the kind of regression modeling that was shown in some of the other presentations in the Big Data sessions. Other components of the Canopy Platform include interface modules for Excel (PyXLL) and for National Instruments’ LabView software (Python Integration Toolkit for LabVIEW), among others.
A key component of our Canopy Platform is our Deployment Server, which simplifies the tricky tasks of deploying proprietary applications and packages or creating customized, reproducible Python environments inside an organization, especially behind a firewall or an air-gapped network.
Finally, (and this is what we were really showing off at the AIChE Big Data Analytics session) there are the Data Catalog and the Cloud Compute layers within the Canopy Platform.
The Data Catalog provides an indexed interface to potentially heterogeneous data sources, making them available for search and query based on various kinds of metadata.
The Data Catalog provides an indexed interface to potentially heterogeneous data sources. These can range from a simple network directory with a collection of HDF5 files to a server hosting files with the Byzantine complexity of the IRIG 106 Ch. 10 Digital Recorder Standard used by US military test flight ranges. The nice thing about the Data Catalog is that it lets you query and select data based on computed metadata, for example “factory A, on Tuesdays when Ethylene output was below 10kg/hr”, or in a test flight data example “test flights involving a T-38 that exceeded 10,000 ft but stayed subsonic.”
With the Cloud Compute layer, an expert user can write code and test it locally on some subset of data from the Data Catalog. Then, when it is working to satisfaction, he or she can publish the code as a computational kernel to run on some other, larger subset of the data in the Data Catalog, using remote compute resources, which might be an HPC cluster or an Apache Spark server. That kernel is then available to other users in the organization, who do not have to understand the algorithm to run it on other data queries.
In the demo below, I showed hooking up the Data Catalog to some historical factory data stored on a remote machine.
The Data Catalog allows selection of subsets of the data set for inspection and ad hoc analysis. Here, three channels are compared using a time window set on the time series data shown on the top plot.
Then using a locally tested and developed compute kernel, I did a principal component analysis on the frequencies of the channel data for a subset of the data in the Data Catalog. Then I published the kernel and ran it on the entire data set using the remote compute resource.
After the compute kernel has been published and run on the entire data set, then the result explorer tool enables further interactions.
Ultimately, the Canopy Platform is for building and distributing applications that solve hard problems. Some of the products we have built on the platform are available today (for instance, Canopy Geoscience and Virtual Core), others are in prototype stage or have been developed for other companies with proprietary components and are not publicly available.
It was exciting to participate in the Big Data Analytics track this year, to see what others are doing in this area, and to be a part of many interesting and fruitful discussions. Thanks to Ivan Castillo and Chris Reed at Dow for arranging our participation.