Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way

The Data Import Tool can highlight missing value cells, helping you easily identify columns or rows containing NaN valuesThis is the second blog in a series. See the first blog here: Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way

No dataset is perfect and most datasets that we have to deal with on a day-to-day basis have values missing, often represented by “NA” or “NaN”. One of the reasons why the Pandas library is as popular as it is in the data science community is because of its capabilities in handling data that contains NaN values.

But spending time looking up the relevant Pandas commands might be cumbersome when you are exploring raw data or prototyping your data analysis pipeline. This is one of the places where the Canopy Data Import Tool helps make data munging faster and easier, by simplifying the task of identifying missing values in your raw data and removing/replacing them.

Why are missing values a problem you ask? We can answer that question in the context of machine learning. scikit-learn and TensorFlow are popular and widely used libraries for machine learning in Python. Both of them caution the user about missing values in their datasets. Various machine learning algorithms expect all the input values to be numerical and to hold meaning. Both of the libraries suggest removing rows and/or columns that contain missing values.

If removing the missing values is not an option, given the size of your dataset, then they suggest replacing the missing values. The scikit-learn library provides an Imputer class, which can be used to replace missing values. See the sci-kit learn documentation for an example of how the Imputer class is used. Similarly, the decode_csv function in the TensorFlow library can be passed a record_defaults argument, which will replace missing values in the dataset. See the TensorFlow documentation for specifics.

The Data Import Tool provides capabilities to handle missing values in your dataset because we strongly believe that discovering and handling missing values in your dataset is a part of the data import and cleaning phase and not the analysis phase of the data science process.

Digging into the specifics, here we’ll compare how you can go about handling missing values with three typical scenarios, first using the Pandas library, then contrasting with the Data Import Tool:

  1. Identifying missing values in data
  2. Replacing missing values in data, and
  3. Removing missing values from data.

Note : Pandas’ internal representation of your data is called a DataFrame. A DataFrame is simply a tabular data structure, similar to a spreadsheet or a SQL table.

Continue reading

Webinar- Get More From Your Core: Applying Artificial Intelligence to CT, Photo, and Well Log Analysis with Virtual Core

What: Presentation, demo, and Q&A with Brendon Hall, Geoscience Product Manager, Enthought

Who should watch this webinar:

  • Oil and gas industry professionals who are looking for ways to extract more value from expensive science wells
  • Those interested in learning how artificial intelligence and machine learning techniques can be applied to core analysis

VIEW 


Geoscientists and petroleum engineers rely on accurate core measurements to characterize reservoirs, develop drilling plans and de-risk play assessments. Whole-core CT scans are now routinely performed on extracted well cores, however the data produced from these scans are difficult to visualize and integrate with other measurements.

Virtual Core automates aspects of core description for geologists, drastically reducing the time and effort required for core description, and its unified visualization interface displays cleansed whole-core CT data alongside core photographs and well logs. It provides tools for geoscientists to analyze core data and extract features from sub-millimeter scale to the entire core.

In this webinar and demo, we’ll start by introducing the Clear Core processing pipeline, which automatically removes unwanted artifacts (such as tubing) from the CT image. We’ll then show how the machine learning capabilities in Virtual Core can be used to describe the core, extracting features such as bedding planes and dip angle. Finally, we’ll show how the data can be viewed and analyzed alongside other core data, such as photographs, wellbore images, well logs, plug measurements, and more.

What You’ll Learn:

  • How core CT data, photographs, well logs, borehole images, and more can be integrated into a digital core workshop
  • How digital core data can shorten core description timelines and deliver business results faster
  • How new features can be extracted from digital core data using artificial intelligence
  • Novel workflows that leverage these features, such as identifying parasequences and strategies for determining net pay

VIEW 

Presenter:

Brendon Hall, Geoscience Applications Engineer, Enthought Brendon Hall, Enthought
Geoscience Product Manager and Application Engineer

Continue reading

Enthought Presents the Canopy Platform at the 2017 American Institute of Chemical Engineers (AIChE) Spring Meeting

by: Tim Diller, Product Manager and Scientific Software Developer, Enthought

Last week I attended the AIChE (American Institute of Chemical Engineers) Spring Meeting in San Antonio, Texas. It was a great time of year to visit this cultural gem deep in the heart of Texas (and just down the road from our Austin offices), with plenty of good food, sights and sounds to take in on top of the conference and its sessions.

The AIChE Spring Meeting focuses on applications of chemical engineering in industry, and Enthought was invited to present a poster and deliver a “vendor perspective” talk on the Canopy Platform for Process Monitoring and Optimization as part of the “Big Data Analytics” track. This was my first time at AIChE, so some of the names were new, but in a lot of ways it felt very similar to many other engineering conferences I have participated in over the years (for instance, ASME (American Society of Mechanical Engineers), SAE (Society of Automotive Engineers), etc.).

This event underscored that regardless of industry, engineers are bringing the same kinds of practical ingenuity to bear on similar kinds of problems, and with the cost of data acquisition and storage plummeting in the last decade, many engineers are now sitting on more data than they know how to effectively handle.

What exactly is “big data”? Does it really matter for solving hard engineering problems?

One theme that came up time and again in the “Big Data Analytics” sessions Enthought participated in was what exactly “big data” is. In many circles, a good working definition of what makes data “big” is that it exceeds the size of the physical RAM on the machine doing the computation, so that something other than simply loading the data into memory has to be done to make meaningful computations, and thus a working definition of some tens of GB delimits “big” data from “small”.

For others, and many at the conference indeed, a more mundane definition of “big” means that the data set doesn’t fit within the row or column limits of a Microsoft Excel Worksheet.

But the question of whether your data is “big” is really a moot one as far as we at Enthought are concerned; really, being “big” just adds complexity to an already hard problem, and the kind of complexity is an implementation detail dependent on the details of the problem at hand.

And that relates to the central message of my talk, which was that an analytics platform (in this case I was talking about our Canopy Platform) should abstract away the tedious complexities, and help an expert get to the heart of the hard problem at hand.

At AIChE, the “hard problems” at hand seemed invariably to involve one or both of two things: (1) increasing safety/reliability, and (2) increasing plant output.

To solve these problems, two general kinds of activity were on display: different pattern recognition algorithms and tools, and modeling, typically through some kind of regression-based approach. Both of these things are straightforward in the Canopy Platform.

The Canopy Platform is a collection of related technologies that work together in an integrated way to support the scientist/analyst/engineer.

What is the Canopy Platform?

If you’re using Python for science or engineering, you have probably used or heard of Canopy, Enthought’s Python-based data analytics application offering an integrated code editor and interactive command prompt, package manager, documentation browser, debugger, variable browser, data import tool, and lots of hidden features like support for many kinds of proxy systems that work behind the scenes to make a seamless work environment in enterprise settings.

However, this is just one part of the Canopy Platform. Over the years, Enthought has been building other components and related technologies that work together in an integrated way to support the engineer/analyst/scientist solving hard problems.

At the center of the this is the Enthought Python Distribution, with runtime interpreters for Python 2.7 and 3.x and over 450 pre-built Python packages for scientific computing, including tools for machine learning and the kind of regression modeling that was shown in some of the other presentations in the Big Data sessions. Other components of the Canopy Platform include interface modules for Excel (PyXLL) and for National Instruments’ LabView software (Python Integration Toolkit for LabVIEW), among others.

A key component of our Canopy Platform is our Deployment Server, which simplifies the tricky tasks of deploying proprietary applications and packages or creating customized, reproducible Python environments inside an organization, especially behind a firewall or an air-gapped network.

Finally, (and this is what we were really showing off at the AIChE Big Data Analytics session) there are the Data Catalog and the Cloud Compute layers within the Canopy Platform.

The Data Catalog provides an indexed interface to potentially heterogeneous data sources, making them available for search and query based on various kinds of metadata.

The Data Catalog provides an indexed interface to potentially heterogeneous data sources. These can range from a simple network directory with a collection of HDF5 files to a server hosting files with the Byzantine complexity of the IRIG 106 Ch. 10 Digital Recorder Standard used by US military test flight ranges. The nice thing about the Data Catalog is that it lets you query and select data based on computed metadata, for example “factory A, on Tuesdays when Ethylene output was below 10kg/hr”, or in a test flight data example “test flights involving a T-38 that exceeded 10,000 ft but stayed subsonic.”

With the Cloud Compute layer, an expert user can write code and test it locally on some subset of data from the Data Catalog. Then, when it is working to satisfaction, he or she can publish the code as a computational kernel to run on some other, larger subset of the data in the Data Catalog, using remote compute resources, which might be an HPC cluster or an Apache Spark server. That kernel is then available to other users in the organization, who do not have to understand the algorithm to run it on other data queries.

In the demo below, I showed hooking up the Data Catalog to some historical factory data stored on a remote machine.

Data Catalog View The Data Catalog allows selection of subsets of the data set for inspection and ad hoc analysis. Here, three channels are compared using a time window set on the time series data shown on the top plot.

Then using a locally tested and developed compute kernel, I did a principal component analysis on the frequencies of the channel data for a subset of the data in the Data Catalog. Then I published the kernel and ran it on the entire data set using the remote compute resource.

After the compute kernel has been published and run on the entire data set, then the result explorer tool enables further interactions.

Ultimately, the Canopy Platform is for building and distributing applications that solve hard problems.  Some of the products we have built on the platform are available today (for instance, Canopy Geoscience and Virtual Core), others are in prototype stage or have been developed for other companies with proprietary components and are not publicly available.

It was exciting to participate in the Big Data Analytics track this year, to see what others are doing in this area, and to be a part of many interesting and fruitful discussions. Thanks to Ivan Castillo and Chris Reed at Dow for arranging our participation.

Webinar: Using Python and LabVIEW Together to Rapidly Solve Engineering Problems

What: Presentation, demo, and Q&A with Collin Draughon, Software Product Manager, National Instruments, and Andrew Collette, Scientific Software Developer, Enthought

View Now  


Engineers and scientists all over the world are using Python and LabVIEW to solve hard problems in manufacturing and test automation, by taking advantage of the vast ecosystem of Python software.  But going from an engineer’s proof-of-concept to a stable, production-ready version of Python, smoothly integrated with LabVIEW, has long been elusive.

In this on-demand webinar and demo, we take a LabVIEW data acquisition app and extend it with Python’s machine learning capabilities, to automatically detect and classify equipment vibration.  Using a modern Python platform and the Python Integration Toolkit for LabVIEW, we show how easy and fast it is to install heavy-hitting Python analysis libraries, take advantage of them from live LabVIEW code, and finally deploy the entire solution, Python included, using LabVIEW Application Builder.


Python_LabVIEW_VI_Diagram

In this webinar, you’ll see how easy it is to solve an engineering problem by using LabVIEW and Python together.

What You’ll Learn:

  • How Python’s machine learning libraries can simplify a hard engineering problem
  • How to extend an existing LabVIEW VI using Python analysis libraries
  • How to quickly bundle Python and LabVIEW code into an installable app

Who Should Watch:

  • Engineers and managers interested in extending LabVIEW with Python’s ecosystem
  • People who need to easily share and deploy software within their organization
  • Current LabVIEW users who are curious what Python brings to the table
  • Current Python users in organizations where LabVIEW is used

How LabVIEW users can benefit from Python:

  • High-level, general purpose programming language ideally suited to the needs of engineers, scientists, and analysts
  • Huge, international user base representing industries such as aerospace, automotive, manufacturing, military and defense, research and development, biotechnology, geoscience, electronics, and many more
  • Tens of thousands of available packages, ranging from advanced 3D visualization frameworks to nonlinear equation solvers
  • Simple, beginner-friendly syntax and fast learning curve

View Now  

Presenters:

Collin Draughon, National Instruments, Software Product Manager Collin Draughon, National Instruments
Software Product Manager
Andrew Collette, Enthought, Scientific Software Developer Andrew Collette, Enthought
Scientific Software Developer
Python Integration Toolkit for LabVIEW core developer

Continue reading

Webinar – Python for Professionals: The Complete Guide to Enthought’s Technical Training Courses

View the Python for Professionals Webinar

What: Presentation and Q&A with Dr. Michael Connell, VP, Enthought Training Solutions
Who Should Watch: Anyone who wants to develop proficiency in Python for scientific, engineering, analytic, quantitative, or data science applications, including team leaders considering Python training for a group, learning and development coordinators supporting technical teams, or individuals who want to develop their Python skills for professional applications

View Recording  


Python is an uniquely flexible language – it can be used for everything from software engineering (writing applications) to web app development, system administration to “scientific computing” — which includes scientific analysis, engineering, modeling, data analysis, data science, and the like.

Unlike some “generalist” providers who teach generic Python to the lowest common denominator across all these roles, Enthought specializes in Python training for professionals in scientific and analytic fields. In fact, that’s our DNA, as we are first and foremost scientists, engineers, and data scientists ourselves, who just happen to use Python to drive our daily data wrangling, modeling, machine learning, numerical analysis, simulation, and more.

If you’re a professional using Python, you’ve probably had the thought, “how can I be better, smarter, and faster in using Python to get my work done?” That’s where Enthought comes in – we know that you don’t just want to learn generic Python syntax, but instead you want to learn the key tools that fit the work you do, you want hard-won expert insights and tips without having to discover them yourself through trial and error, and you want to be able to immediately apply what you learn to your work.

Bottom line: you want results and you want the best value for your invested time and money. These are some of the guiding principles in our approach to training.

In this webinar, we’ll give you the information you need to decide whether Enthought’s Python training is the right solution for your or your team’s unique situation, helping answer questions such as:

  • What kinds of Python training does Enthought offer? Who is it designed for? 
  • Who will benefit most from Enthought’s training (current skill levels, roles, job functions)?
  • What are the key things that make Enthought’s training different from other providers and resources?
  • What are the differences between Enthought’s training courses and who is each one best for?
  • What specific skills will I have after taking an Enthought training course?
  • Will I enjoy the curriculum, the way the information is presented, and the instructor?
  • Why do people choose to train with Enthought? Who has Enthought worked with and what is their feedback?

We’ll also provide a guided tour and insights about our our five primary course offerings to help you understand the fit for you or your team:

View Recording  


michael_connell-enthought-vp-training

Presenter: Dr. Michael Connell, VP, Enthought Training Solutions

Ed.D, Education, Harvard University
M.S., Electrical Engineering and Computer Science, MIT


Continue reading

Traits and TraitsUI: Reactive User Interfaces for Rapid Application Development in Python

 The open-source pi3Diamond application built with Traits, TraitsUI and Chaco by Swabian Instruments.The Enthought Tool Suite team is pleased to announce the release of Traits 4.6. Together with the release of TraitsUI 5.1 last year, these core packages of Enthought’s open-source rapid application development tools are now compatible with Python 3 as well as Python 2.7.  Long-time fans of Enthought’s open-source offerings will be happy to hear about the recent updates and modernization we’ve been working on, including the recent release of Mayavi 4.5 with Python 3 support, while newcomers to Python will be pleased that there is an easy way to get started with GUI programming which grows to allow you to build applications with sophisticated, interactive 2D and 3D visualizations.

A Brief Introduction to Traits and TraitsUI

Traits is a mature reactive programming library for Python that allows application code to respond to changes on Python objects, greatly simplifying the logic of an application.  TraitsUI is a tool for building desktop applications on top of the Qt or WxWidgets cross-platform GUI toolkits. Traits, together with TraitsUI, provides a programming model for Python that is similar in concept to modern and popular Javascript frameworks like React, Vue and Angular but targeting desktop applications rather than the browser.

Traits is also the core of Enthought’s open source 2D and 3D visualization libraries Chaco and Mayavi, drives the internal application logic of Enthought products like Canopy, Canopy Geoscience and Virtual Core, and Enthought’s consultants appreciate its the way it facilitates the rapid development of desktop applications for our consulting clients. It is also used by several open-source scientific software projects such as the HyperSpy multidimensional data analysis library and the pi3Diamond application for controlling diamond nitrogen-vacancy quantum physics experiments, and in commercial projects such as the PyRX Virtual Screening software for computational drug discovery.

 The open-source pi3Diamond application built with Traits, TraitsUI and Chaco by Swabian Instruments.

The open-source pi3Diamond application built with Traits, TraitsUI and Chaco by Swabian Instruments.

Traits is part of the Enthought Tool Suite of open source application development packages and is available to install through Enthought Canopy’s Package Manager (you can download Canopy here) or via Enthought’s new edm command line package and environment management tool. Running

edm install traits

at the command line will install Traits into your current environment.

Traits

The Traits library provides a new type of Python object which has an event stream associated with each attribute (or “trait”) of the object that tracks changes to the attribute.  This means that you can decouple your application model much more cleanly: rather than an object having to know all the work which might need to be done when it changes its state, instead other parts of the application register the pieces of work that each of them need when the state changes and Traits automatically takes care running that code.  This results in simpler, more modular and loosely-coupled code that is easier to develop and maintain.

Traits also provides optional data validation and initialization that dramatically reduces the amount of boilerplate code that you need to write to set up objects into a working state and ensure that the state remains valid.  This makes it more likely that your code is correct and does what you expect, resulting in fewer subtle bugs and more immediate and useful errors when things do go wrong.

When you consider all the things that Traits does, it would be reasonable to expect that it may have some impact on performance, but the heart of Traits is written in C and knows more about the structure of the data it is working with than general Python code. This means that it can make some optimizations that the Python interpreter can’t, the net result of which is that code written with Traits is often faster than equivalent pure Python code.

Continue reading

New Year, New Enthought Products!

We’ve had a number of major product development efforts underway over the last year, and we’re pleased to share a lot of new announcements for 2017:

A New Chapter for the Enthought Python Distribution (EPD):
Python 3 and Intel MKL 2017

In 2004, Enthought released the first “Python: Enthought Edition,” a Python package distribution tailored for a scientific and analytic audience. In 2008 this became the Enthought Python Distribution (EPD), a self-contained installer with the "enpkg" command-line tool to update and manage packages. Since then, over a million users have benefited from Enthought’s tested, pre-compiled set of Python packages, allowing them to focus on their science by eliminating the hassle of setting up tools.

Enthought Python Distribution logo

Fast forward to 2017, and we now offer over 450 Python packages and a new era for the Enthought Python Distributionaccess to all of the packages in the new EPD is completely free to all users and includes packages and runtimes for both Python 2 and Python 3 with some exciting new additions. Our ever-growing list of packages includes, for example, the 2017 release of the MKL (Math Kernel Library), the fruit of an ongoing collaboration with Intel.

The New Enthought Deployment Server:
Secure, Onsite Access to EPD and Private Packages

enthought-deployment-server-centralized-management-illustration-v2

For those who are interested in having a private copy of the Enthought Python Distribution behind their firewall, as well as the ability to upload and manage internal private packages alongside it, we now offer the Enthought Deployment Server, an onsite version of the server we have been using for years to serve millions of Python packages to our users.

enthought-deployment-server-logoWith a local Enthought Deployment Server, your private copy will periodically synchronize with our master repository, on a schedule of your choosing, to keep you up to date with the latest releases. You can also set up private package repositories and control access to them using your existing LDAP or Active Directory service in a way that suits your organization.  We can even give you access to the packages (and their historical versions) inside of air-gapped networks! See our webinar introducing the Enthought Deployment Server.

Command Line Access to the New EPD and Flat Environments
via the Enthought Deployment Manager (EDM)

In 2013, we expanded the original EPD to introduce Enthought Canopy, coupling an integrated analysis environment with additional features such as a graphical package manager, documentation browser, and other user-friendly tools together with the Enthought Python Distribution to provide even more features to help “make science and analysis easy.”

With its MATLAB-like experience, Canopy has enabled countless engineers, scientists and analysts to perform sophisticated analysis, build models, and create cutting-edge data science algorithms. The all-in-one analysis platform for Python has also been widely adopted in organizations who want to provide a single, unified platform that can be used by everyone from data analysts to software engineers.

But we heard from a number of you that you also still wanted the capability to have flat, standalone environments not coupled to any editor or graphical tool. And we listened!  

enthought-deployment-manager-cli-screenshot2So last year, we finished building out our next-generation command-line tool that makes producing flat, standalone Python environments super easy.  We call it the Enthought Deployment Manager (or EDM for short), because it’s a tool to quickly deploy one or multiple Python environments with the full control over package versions and runtime environments.

EDM is also a valuable tool for use cases such as command line deployment on local machines or servers, web application deployment on AWS using Ansible and Amazon CloudFormation, rapid environment setup on continuous integration systems such as Travis-CI, Appveyor, or Jenkins/TeamCity, and more.

Finally, a new state-of-the-art package dependency solver included in the tool guarantees the consistency of your environment, and if your workflow requires switching between different environments, its sandboxed architecture makes it a snap to switch contexts.  All of this has also been designed with a focus on providing robust backward compatibility to our customers over time.  Find out more about EDM here.

Enthought Canopy 2.0:
Python 3 packages and New EDM Back End Infrastructure

Enthought Canopy LogoThe new Enthought Python Distribution (EPD) and Enthought Deployment Manager (EDM) will also provide additional benefits for Canopy.  Canopy 2.0 is just around the corner, which will be the first version to include Python 3 packages from EPD.

In addition, we have re-worked Canopy’s graphical package manager to use EDM as its back end, to take advantage of both the consistency and stability of the environments EDM provides, as well as its new package dependency solver.  By itself, this will provide a big boost in stability for users (ever found yourself wrapped up in a tangle of inconsistent package versions?).  Alongside the conversion of Canopy’s back end infrastructure to EDM, we have also included a substantial number of stability improvements and bug fixes.

Canopy’s Graphical Debugger adds external IPython kernel debugging support

On the integrated analysis environment side of Canopy, the graphical debugger and variable browser, first introduced in 2015, has gotten some nifty new features, including the ability to connect to and debug an external IPython kernel, in addition to a number of stability improvements.  (Weren’t aware you could connect to an external process?  Look for the context menu in the IPython console, use it to connect to the IPython kernel running, say, a Jupyter notebook, and debug away!)

Canopy Data Import Tool adds CSV exports and input file templates

Enthought Canopy Data Import ToolAlso, we’ve continued to add new features to the Canopy Data Import Tool since its initial release in May of 2016. The Data Import Tool allows users to quickly and easily import CSVs and other structured text files into Pandas DataFrames through a graphical interface, manipulate the data, and create reusable Python scripts to speed future data wrangling.

The latest version of the tool (v. 1.0.9, shipping with Canopy 2.0) has some nice new features like CSV exporting, input file templates, and more. See Enthought’s blog for some great examples of how the Data Import Tool speeds data loading, wrangling and analysis.

What to Look Forward to in 2017

So where are we headed in 2017?  We have put a lot of effort into building a strong foundation with our core suite of products, and now we’re focused on continuing to deliver new value (our enterprise users in particular have a number of new features to look forward to).  First up, for example, you can look for expanded capabilities around Python environments, making it easy to manage multiple environments, or even standardize and distribute them in your organization.  With the tremendous advancements in our core products that took place in 2016, there are a lot of follow-on features we can deliver. Stay tuned for updates!

Have a specific feature you’d like to see in one of Enthought’s products? E-mail our product team at canopy.support@enthought.com and tell us about it!

Webinar: An Exclusive Peek “Under the Hood” of Enthought Training and the Pandas Mastery Workshop

See the webinar

Enthought’s Pandas Mastery Workshop is designed to accelerate the development of skill and confidence with Python’s Pandas data analysis package — in just three days, you’ll look like an old pro! This course was created ground up by our training experts based on insights from the science of human learning, as well as what we’ve learned from over a decade of extensive practical experience of teaching thousands of scientists, engineers, and analysts to use Python effectively in their everyday work.

In this webinar, we’ll give you the key information and insight you need to evaluate whether the Pandas Mastery Workshop is the right solution to advance your data analysis skills in Python, including:

  • Who will benefit most from the course
  • A guided tour through the course topics
  • What skills you’ll take away from the course, how the instructional design supports that
  • What the experience is like, and why it is different from other training alternatives (with a sneak peek at actual course materials)
  • What previous workshop attendees say about the course

See the Webinar


michael_connell-enthought-vp-trainingPresenter: Dr. Michael Connell, VP, Enthought Training Solutions

Ed.D, Education, Harvard University
M.S., Electrical Engineering and Computer Science, MIT


Continue reading