Category Archives: Canopy

Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way

This is the second blog in a series. See the first blog here: Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way

No dataset is perfect and most datasets that we have to deal with on a day-to-day basis have values missing, often represented by “NA” or “NaN”. One of the reasons why the Pandas library is as popular as it is in the data science community is because of its capabilities in handling data that contains NaN values.

But spending time looking up the relevant Pandas commands might be cumbersome when you are exploring raw data or prototyping your data analysis pipeline. This is one of the places where the Canopy Data Import Tool helps make data munging faster and easier, by simplifying the task of identifying missing values in your raw data and removing/replacing them.

Why are missing values a problem you ask? We can answer that question in the context of machine learning. scikit-learn and TensorFlow are popular and widely used libraries for machine learning in Python. Both of them caution the user about missing values in their datasets. Various machine learning algorithms expect all the input values to be numerical and to hold meaning. Both of the libraries suggest removing rows and/or columns that contain missing values.

If removing the missing values is not an option, given the size of your dataset, then they suggest replacing the missing values. The scikit-learn library provides an Imputer class, which can be used to replace missing values. See the sci-kit learn documentation for an example of how the Imputer class is used. Similarly, the decode_csv function in the TensorFlow library can be passed a record_defaults argument, which will replace missing values in the dataset. See the TensorFlow documentation for specifics.

The Data Import Tool provides capabilities to handle missing values in your dataset because we strongly believe that discovering and handling missing values in your dataset is a part of the data import and cleaning phase and not the analysis phase of the data science process.

Digging into the specifics, here we’ll compare how you can go about handling missing values with three typical scenarios, first using the Pandas library, then contrasting with the Data Import Tool:

  1. Identifying missing values in data
  2. Replacing missing values in data, and
  3. Removing missing values from data.

Note : Pandas’ internal representation of your data is called a DataFrame. A DataFrame is simply a tabular data structure, similar to a spreadsheet or a SQL table.


Identifying Missing Values – The Hard Way: Using Pandas

If you are interested in identifying missing values in a row/column of a DataFrame, you need to understand the isnull, any, all methods on a DataFrame.

Taking a detour, we have so far described missing values as being represented by NA or NaN. Instead, what if missing values in a column are values that aren’t of the same type as the rest of the cells in the column, say for example a string in a column containing integers? Doing so in Pandas is not trivial.

Identifying Missing Values – The Easy Way: Using the Data Import Tool

Highlighting Null Values using the Data Import Tool

Highlighting null values using the Data Import Tool

Instead of giving you the column names and index values of the cells containing missing values, the Data Import Tool shows them to you. Simply checking the `Highlight Missing Values` checkbox in the bottom-left corner of the Data Import Tool will paint the DataFrame to show you the cells that contain missing values. Further, the Data Import Tool understands that your data file might have errors, like having a string value in a column otherwise containing integers. The Data Import Tool highlights the cell and displays the underlying content too.

The Data Import Tool can highlight missing value cells, helping you easily identify columns or rows containing NaN values

The Data Import Tool can highlight missing value cells, helping you easily identify columns or rows containing NaN values


Replacing Missing Values – The Hard Way: Using Pandas

While Pandas does a great job at handling column operations even if the columns contain NaN values, our data analysis workflow might need us to replace the missing values in our data.

After spending a little time browsing through the Pandas documentation, you will come across the `fillna` method on a DataFrame, which can be used to replace a missing values. The arguments you pass to the fillna method will determine what value the missing values in your DataFrame are replaced with and how the underlying column dtypes change after replacing the missing values.

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Replacing Missing Values – The Easy Way: Using the Data Import Tool

With the Data Import Tool, you can replace missing values by right-clicking on the column containing missing values selecting the appropriate Fill Missing Values item. Opting to replace missing values in the column with a specific column will open an additional dialog, prompting you to enter the value.

Fill missing values

Replace missing values in your DataFrame using the Canopy Data Import Tool


Removing Missing Values – The Hard Way: Using Pandas

While removing columns or rows containing missing values might be a little extreme, it might be necessary. Pandas suggests that you use the dropna method on the DataFrame to drop columns or rows that contain missing values. The arguments you pass to the dropna method will determine what rows/columns are removed from the DataFrame.

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Removing Missing Values – The Easy Way: Using the Data Import Tool

With the Data Import Tool on the other hand, you can remove rows/columns containing missing values by selecting the “Delete Empty Columns” or “Delete Empty Rows” item from the “Transform” menu. An additional dialog will pop up asking you how lenient you want to be in removing rows/columns containing missing values – if you choose ‘any’, the Data Import Tool will remove rows/columns that contain any missing values; if you choose ‘all’, the Data Import Tool will only remove those rows/columns which contain only missing values.

Delete Empty Rows & Columns

Delete empty cells in rows/columns using the Canopy Data Import Tool

Delete Empty Columns

Choose to delete columns containing any null value or columns full of null values using the Canopy Data Import Tool

Finally, we have data that contains no missing values. So far, we’ve used the DIT to easily discover the missing values in our dataset and to remove/replace the missing values. Finally, by clicking on ‘Use DataFrame’, you can import the dataset as a pandas DataFrame into the IPython workspace of the Canopy Editor. If you’re a data scientist, your data is now void of missing values and can be converted to arrays or variables and passed on to scikit-learn, TensorFlow or any other Machine Learning library of your choice.

Ready to try the Canopy Data Import Tool?

Download Canopy (free) and click on the icon to start a free trial of the Data Import Tool today

This is the second blog in a series. See the first blog here: Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way


Additional resources:

Watch a 2-minute demo video to see how the Canopy Data Import Tool works:

See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging:

New Year, New Enthought Products!

We’ve had a number of major product development efforts underway over the last year, and we’re pleased to share a lot of new announcements for 2017:

A New Chapter for the Enthought Python Distribution (EPD):
Python 3 and Intel MKL 2017

In 2004, Enthought released the first “Python: Enthought Edition,” a Python package distribution tailored for a scientific and analytic audience. In 2008 this became the Enthought Python Distribution (EPD), a self-contained installer with the "enpkg" command-line tool to update and manage packages. Since then, over a million users have benefited from Enthought’s tested, pre-compiled set of Python packages, allowing them to focus on their science by eliminating the hassle of setting up tools.

Enthought Python Distribution logo

Fast forward to 2017, and we now offer over 450 Python packages and a new era for the Enthought Python Distributionaccess to all of the packages in the new EPD is completely free to all users and includes packages and runtimes for both Python 2 and Python 3 with some exciting new additions. Our ever-growing list of packages includes, for example, the 2017 release of the MKL (Math Kernel Library), the fruit of an ongoing collaboration with Intel.

The New Enthought Deployment Server:
Secure, Onsite Access to EPD and Private Packages

enthought-deployment-server-centralized-management-illustration-v2

For those who are interested in having a private copy of the Enthought Python Distribution behind their firewall, as well as the ability to upload and manage internal private packages alongside it, we now offer the Enthought Deployment Server, an onsite version of the server we have been using for years to serve millions of Python packages to our users.

enthought-deployment-server-logoWith a local Enthought Deployment Server, your private copy will periodically synchronize with our master repository, on a schedule of your choosing, to keep you up to date with the latest releases. You can also set up private package repositories and control access to them using your existing LDAP or Active Directory service in a way that suits your organization.  We can even give you access to the packages (and their historical versions) inside of air-gapped networks! See our webinar introducing the Enthought Deployment Server.

Command Line Access to the New EPD and Flat Environments
via the Enthought Deployment Manager (EDM)

In 2013, we expanded the original EPD to introduce Enthought Canopy, coupling an integrated analysis environment with additional features such as a graphical package manager, documentation browser, and other user-friendly tools together with the Enthought Python Distribution to provide even more features to help “make science and analysis easy.”

With its MATLAB-like experience, Canopy has enabled countless engineers, scientists and analysts to perform sophisticated analysis, build models, and create cutting-edge data science algorithms. The all-in-one analysis platform for Python has also been widely adopted in organizations who want to provide a single, unified platform that can be used by everyone from data analysts to software engineers.

But we heard from a number of you that you also still wanted the capability to have flat, standalone environments not coupled to any editor or graphical tool. And we listened!  

enthought-deployment-manager-cli-screenshot2So last year, we finished building out our next-generation command-line tool that makes producing flat, standalone Python environments super easy.  We call it the Enthought Deployment Manager (or EDM for short), because it’s a tool to quickly deploy one or multiple Python environments with the full control over package versions and runtime environments.

EDM is also a valuable tool for use cases such as command line deployment on local machines or servers, web application deployment on AWS using Ansible and Amazon CloudFormation, rapid environment setup on continuous integration systems such as Travis-CI, Appveyor, or Jenkins/TeamCity, and more.

Finally, a new state-of-the-art package dependency solver included in the tool guarantees the consistency of your environment, and if your workflow requires switching between different environments, its sandboxed architecture makes it a snap to switch contexts.  All of this has also been designed with a focus on providing robust backward compatibility to our customers over time.  Find out more about EDM here.

Enthought Canopy 2.0:
Python 3 packages and New EDM Back End Infrastructure

Enthought Canopy LogoThe new Enthought Python Distribution (EPD) and Enthought Deployment Manager (EDM) will also provide additional benefits for Canopy.  Canopy 2.0 is just around the corner, which will be the first version to include Python 3 packages from EPD.

In addition, we have re-worked Canopy’s graphical package manager to use EDM as its back end, to take advantage of both the consistency and stability of the environments EDM provides, as well as its new package dependency solver.  By itself, this will provide a big boost in stability for users (ever found yourself wrapped up in a tangle of inconsistent package versions?).  Alongside the conversion of Canopy’s back end infrastructure to EDM, we have also included a substantial number of stability improvements and bug fixes.

Canopy’s Graphical Debugger adds external IPython kernel debugging support

On the integrated analysis environment side of Canopy, the graphical debugger and variable browser, first introduced in 2015, has gotten some nifty new features, including the ability to connect to and debug an external IPython kernel, in addition to a number of stability improvements.  (Weren’t aware you could connect to an external process?  Look for the context menu in the IPython console, use it to connect to the IPython kernel running, say, a Jupyter notebook, and debug away!)

Canopy Data Import Tool adds CSV exports and input file templates

Enthought Canopy Data Import ToolAlso, we’ve continued to add new features to the Canopy Data Import Tool since its initial release in May of 2016. The Data Import Tool allows users to quickly and easily import CSVs and other structured text files into Pandas DataFrames through a graphical interface, manipulate the data, and create reusable Python scripts to speed future data wrangling.

The latest version of the tool (v. 1.0.9, shipping with Canopy 2.0) has some nice new features like CSV exporting, input file templates, and more. See Enthought’s blog for some great examples of how the Data Import Tool speeds data loading, wrangling and analysis.

What to Look Forward to in 2017

So where are we headed in 2017?  We have put a lot of effort into building a strong foundation with our core suite of products, and now we’re focused on continuing to deliver new value (our enterprise users in particular have a number of new features to look forward to).  First up, for example, you can look for expanded capabilities around Python environments, making it easy to manage multiple environments, or even standardize and distribute them in your organization.  With the tremendous advancements in our core products that took place in 2016, there are a lot of follow-on features we can deliver. Stay tuned for updates!

Have a specific feature you’d like to see in one of Enthought’s products? E-mail our product team at canopy.support@enthought.com and tell us about it!

Canopy Geoscience: Python-Based Analysis Environment for Geoscience Data

Today we officially release Canopy Geoscience 0.10.0, our Python-based analysis environment for geoscience data.

Canopy Geoscience integrates data I/O, visualization, and programming, in an easy-to-use environment. Canopy Geoscience is tightly integrated with Enthought Canopy’s Python distribution, giving you access to hundreds of high-performance scientific libraries to extract information from your data.


The Canopy Geoscience environment allows easy exploration of your data in 2D or 3D. The data is accessible from the embedded Python environment, and can be analyzed, modified, and immediately visualized with simple Python commands.

Feature and capability highlights for Canopy Geoscience version 0.10.0 include:

  • Read and write common geoscience data formats (LAS, SEG-Y, Eclipse, …)
  • 3D and 2D visualization tools
  • Well log visualization
  • Conversion from depth to time domain is integrated in the visualization tools using flexible depth-time models
  • Integrated IPython shell to programmatically access and analyse the data
  • Integrated with the Canopy editor for scripting
  • Extensible with custom-made plugins to fit your personal workflow

Contact us to learn more about Canopy Geoscience! Continue reading

PyXLL: Deploy Python to Excel Easily

PyXLL Solution Home | Buy PyXLL | Press Release

Today Enthought announced that it is now the worldwide distributor for PyXLL, and we’re excited to offer this key product for deploying Python models, algorithms and code to Excel. Technical teams can use the full power of Enthought Canopy, or another Python distro, and end-users can access the results in their familiar Excel environment. And it’s straightforward to set up and use.

Installing PyXLL from Enthought Canopy

PyXLL is available as a package subscription (with significant discounts for multiple users). Once you’ve purchased a subscription you can easily install it via Canopy’s Package Manager as shown in the screenshots below (note that at this time PyXLL is only available for Windows users). The rest of the configuration instructions are in the Quick Start portion of the documentation. PyXLL itself is a plug-in to Excel. When you start Excel, PyXLL loads into Excel and reads in Python modules that you have created for PyXLL. This makes PyXLL especially useful for organizations that want to manage their code centrally and deploy to multiple Excel users.

Enthought Canopy Package Manager   Install PyXLL from Enthought Canopy's Package Manager

Creating Excel Functions with PyXLL

To create a PyXLL Python Excel function, you use the @xl_func decorator to tell PyXLL the following function should be registered with Excel, what its argument types are, and optionally what its return type is. PyXLL also reads the function’s docstring and provides that in the Excel function description. As an example, I created a module my_pyxll_module.py and registered it with PyXLL via the Continue reading

Enthought Canopy 1.3 Released: Includes Move to Python 2.7.6

Enthought Canopy Product Page | Download Enthought Canopy

Enthought Canopy 1.3 is now available and users should see the update notification in the bottom right corner of the Canopy welcome screen (as shown in the image below). This is a fairly small update primarily focused on bug fixing and stability improvement. The biggest change is the move to Python 2.7.6 from 2.7.3.

Enthought Canopy Update Available Notification
The bottom right of the Enthought Canopy window notifies users to available updates

Python 2.7.6 rolls up a couple of minor updates to the core Python environment. The most important changes from our perspective are a number of security fixes required by some users as well as fixes for Mac OS “Mavericks.” Details can be found in the Python release notes, but in general the change should be transparent to most users. The only caveat is for users building Python eggs with native C or FORTRAN extensions and publishing those eggs to users who may still be running earlier versions of Canopy or Python 2.7.3 in general. In this case, it is safest to continue building against earlier versions of Canopy.

But isn’t updating Python versions painful you may ask? In the past, yes, updating to a new Python version often required a new Python install and then re-installing all of your custom packages. However, with Canopy’s auto-update mechanism, it’s simply a matter of clicking the “Update available” link and choosing “Install and relaunch” or “Install after quit.” Canopy will automatically update the core Python installation and restart without impacting your environment. Additionally, whether you are running Canopy 1.1, 1.1.1, or 1.2, Canopy will jump straight to 1.3 and get you all of the latest updates.

We encourage all users to update to Canopy 1.3 as the 1.2 and 1.3 versions include a large number of stability fixes as well as cleaning up a lot of other less serious, but still important aspects of the user experience. For those new to Canopy, you can get Canopy here.

Enthought Canopy makes Python updates convenient
Enthought Canopy makes updates convenient with automatic downloads that install without impacting user environments

Keep up with all of the latest news from Enthought on our social media channels:  Linked In | Twitter | Google+ | Facebook | YouTube

Enthought Canopy v1.2 is Out: PTVS, Mavericks, and Qt

Author: Jason McCampbell

Canopy 1.2 is out! The release of Mac OS “Mavericks” as a free update broke a few features, primarily IPython, so we held the release to try to make sure everything worked. That ended up taking longer than we wanted, but 1.2 is finally out and adds support for Mavericks. There is one Mavericks-specific, Qt font issue that we are working on correcting which causes the wrong system font to be selected so UI’s look less-nice than they should.

Enthought Canopy integrated into PTVS

Enthought Canopy integrated into PTVS

The biggest new feature is integration with Microsoft’s Python Tools for Visual Studio (PTVS) package. PTVS is a full, professional-grade development IDE for Python based on Visual Studio and provides mixed Python/C debugging. The ability to do mixed-mode debugging is a huge boon to software developers creating C (or FORTRAN) extensions to Python. Canopy v1.2 includes a custom DLL that allows us to integrate more completely with PTVS and solves some issues with auto-completion of Python standard library calls.

Beyond PTVS, we have added the Qt development tools, such as qmake and the UIC compiler, to the Canopy installation tree. These tools are available on all platforms now and enable Qt developers to access them from Canopy directly rather than having to build the tools themselves.

Canopy 1.2 includes a large number of smaller additions and stability improvements. Highlights can be found in the release notes and we encourage all users to update existing installs. As always, thanks for using Canopy and please don’t hesitate to drop us a note letting us know what you like or what you would like to see improved. You can contact us via the Help -> Suggestions/Feedback menu item or by sending email to canopy.support@enthought.com.

And you can download Canopy from the Enthought Store page.

Enthought Tool Suite Release 4.4 (Traits, Chaco, and more)

Authors: The ETS Developers

We’re happy to announce the release of multiple major projects, including:

  • Traits 4.4.0
  • Chaco 4.4.1
  • TraitsUI 4.4.0
  • Envisage 4.4.0
  • Pyface 4.4.0
  • Codetools 4.2.0
  • ETS 4.4.1

These packages form the core of the Enthought Tool Suite (ETS, http://code.enthought.com/projects), a collection of free, open-source components developed by Enthought and our partners to construct custom scientific applications. ETS includes a wide variety of components, including:

  • an extensible application framework (Envisage)

  • application building blocks (Traits, TraitsUI, Enaml, Pyface, Codetools)

  • 2-D and 3-D graphics libraries (Chaco, Mayavi, Enable)

  • scientific and math libraries (Scimath)

  • developer tools (Apptools)

You can install any of the packages using Canopy‘s package manager, using the Canopy or EPD ‘enpkg \’ command, from PyPI (using pip or easy_install),  or by building them from source code on github. For more details, see the ETS intallation page.

Contributors

==================

This set of releases was an 8-month effort of Enthought developers along with:

  • Yves Delley
  • Pieter Aarnoutse
  • Jordan Ilott
  • Matthieu Dartiailh
  • Ian Delaney
  • Gregor Thalhammer

Many thanks to them!

General release notes

==================

  1. The major new feature in this Traits release is a new adaptation mechanism in the “traits.adaptation“ package.  The new mechanism is intended to replace the older traits.protocols package.  Code written against “traits.protocols“ will continue to work, although the “traits.protocols“ API has been deprecated, and a warning will be logged on first use of “traits.protocols“.  See the ‘Advanced Topics’ section of the user manual for more details.

  2. These new releases of TraitsUI, Envisage, Pyface and Codetools include an update to this new adaptation mechanism.

  3. All ETS projects are now on TravisCI, making it easier to contribute to them.

  4. As of this release, the only Python versions that are actively supported are 2.6 and 2.7. As we are moving to future-proof ETS over the coming months, more code that supported Python 2.5 will be removed.

  5. We will retire chaco-users@enthought.com since it is lightly used and are now recommending all users of Chaco to send questions, requests and comments to enthought-dev@enthought.com or to StackOverflow (tag “enthought” and possibly “chaco”).

More details about the release of each project are given below. Please see the CHANGES.txt file inside each project for full details of the changes.

Happy coding!

The ETS developers

Traits 4.4.0 release notes

=====================

The Traits library enhances Python by adding optional type-checking and an event notification system, making it an ideal platform for writing data-driven applications.  It forms the foundation of the Enthought Tool Suite.

In addition to the above-mentioned rework of the adaptation mechanism, the release also includes improved support for using Cython with `HasTraits` classes, some new helper utilities for writing unit tests for Traits events, and a variety of bug fixes, stability enhancements, and internal code improvements.

Chaco 4.4.0 release notes

=====================

Chaco is a Python package for building efficient, interactive and custom 2-D plots and visualizations. While Chaco generates attractive static plots, it works particularly well for interactive data visualization and exploration.

This release introduces many improvements and bug fixes, including fixes to the generation of image files from plots, improvements to the ArrayPlotData to change multiple arrays at a time, and improvements to multiple elements of the plots such as tick labels and text overlays.

TraitsUI 4.4.0 release notes

======================

The TraitsUI project contains a toolkit-independent GUI abstraction layer, which is used to support the “visualization” features of the Traits package. TraitsUI allows developers to write against the TraitsUI API (views, items, editors, etc.), and let TraitsUI and the selected toolkit and back-end take care of the details of displaying them.

In addition to the above-mentioned update to the new Traits 4.4.0 adaptation mechanism, there have also been a number of improvements to drag and drop support for the Qt backend and some modernization of the use of WxPython to support Wx 2.9.  This release also includes a number of bug-fixes and minor functionality enhancements.

Envisage 4.4.0 release notes

=======================

Envisage is a Python-based framework for building extensible applications, providing a standard mechanism for features to be added to an

application, whether by the original developer or by someone else.

In addition to the above-mentioned update to the new Traits 4.4.0 adaptation mechanism, this release also adds a new method to retrieve a service that is required by the application and provides documentation and test updates.

Pyface 4.4.0 release notes

======================

The pyface project provides a toolkit-independent library of Traits-aware widgets and GUI components, which are used to support the “visualization” features of Traits.

The biggest change in this release is support for the new adaptation mechanism in Traits 4.4.0. This release also includes Tasks support for Enaml 0.8 and a number of other minor changes, improvements and bug-fixes.

Codetools release notes

====================

The codetools project includes packages that simplify meta-programming and help the programmer separate data from code in Python. This library provides classes for performing dependency-analysis on blocks of Python code, and Traits-enhanced execution contexts that can be used as execution namespaces.

In addition to the above-mentioned update to the new Traits 4.4.0 adaptation mechanism, this release also includes a number of modernizations of the code base, including the consistent use of absolute imports, and a new execution manager for deferring events from Contexts.

PyQL and QuantLib: A Comprehensive Finance Framework

Authors: Kelsey Jordahl, Brett Murphy

Earlier this month at the first New York Finance Python User’s Group (NY FPUG) meetup, Kelsey Jordahl talked about how PyQL streamlines the development of Python-based finance applications using QuantLib. There were about 30 people attending the talk at the Cornell Club in New York City. We have a recording of the presentation below.

FPUG Meetup Presentation Screenshot

QuantLib is a free, open-source (BSD-licensed) quantitative finance package. It provides tools for financial instruments, yield curves, pricing engines, creating simulations, and date / time management. There is a lot more detail on the QuantLib website along with the latest downloads. Kelsey refers to a really useful blog / open-source book by one of the core QuantLib developers on implementing QuantLib. Quantlib also comes with different language bindings, including Python.

So why use PyQL if there are already Python bindings in QuantLib? Well, PyQL provides a much more Pythonic set of APIs, in short. Kelsey discusses some of the differences between the original QuantLib Python API and the PyQL API and how PyQL streamlines the resulting Python code. You get better integration with other packages like NumPy, better namespace usage and better documentation. PyQL is available up on GitHub in the PyQL repo. Kelsey uses the IPython Notebooks in the examples directory to explore PyQL and QuantLib and compares the use of PyQL versus the standard (SWIG) QuantLib Python APIs.

PyQL remains a work in progress, with goals to make its QuantLib coverage more complete, the API even more Pythonic, and getting a successful build on Windows (works on Mac OS and Linux now). It’s open source, so feel free to step up and contribute!

For the details, check out the video of Kelsey’s presentation (44 minutes).

And here are the slides online if you want to check the links in the presentation.

If you are interested in working on either QuantLib or PyQL, let the maintainers know!