Tag Archives: Data Import Tool

What’s New in the Canopy Data Import Tool Version 1.1

New features in the Canopy Data Import Tool Version 1.1:
Support for Pandas v. 20, Excel / CSV export capabilities, and more

Enthought Canopy Data Import ToolWe’re pleased to announce a significant new feature release of the Canopy Data Import Tool, version 1.1. The Data Import Tool allows users to quickly and easily import CSVs and other structured text files into Pandas DataFrames through a graphical interface, manipulate the data, and create reusable Python scripts to speed future data wrangling. Here are some of the notable updates in version 1.1:

1. Support for PyQt
The Data Import Tool now supports both PyQt and PySide backends. Python 3 support will also be available shortly.

2. Exporting DataFrames to csv/xlsx file formats
We understand that data exploration and manipulation are only one part of your data analysis process, which is why the Data Import Tool now provides a way for you to save the DataFrame as a CSV/XLSX file. This way, you can share processed data with your colleagues or feed this processed file to the next step in your data analysis pipeline.

3. Column Sort Indicators
In earlier versions of the Data Import Tool, it was not obvious that clicking on the right-end of the column header sorted the columns. With this release, we added sort indicators on every column, which can be pressed to sort the column in an ascending or descending fashion. And given the complex nature of the data we get, we know sorting the data based on single column is never enough, so we also made sorting columns using the Data Import Tool stable (ie, sorting preserves any existing order in the DataFrame).

Continue reading

Enthought Announces Canopy 2.1: A Major Milestone Release for the Python Analysis Environment and Package Distribution

Python 3 and multi-environment support, new state of the art package dependency solver, and over 450 packages now available free for all users

Enthought Canopy logoEnthought is pleased to announce the release of Canopy 2.1, a significant feature release that includes Python 3 and multi-environment support, a new state of the art package dependency solver, and access to over 450 pre-built and tested scientific and analytic Python packages completely free for all users. We highly recommend that all current Canopy users upgrade to this new release.

Ready to dive in? Download Canopy 2.1 here.


For those currently familiar with Canopy, in this blog we’ll review the major new features in this exciting milestone release, and for those of you looking for a tool to improve your workflow with Python, or perhaps new to Python from a language like MATLAB or R, we’ll take you through the key reasons that scientists, engineers, data scientists, and analysts use Canopy to enable their work in Python.

First, let’s talk about the latest and greatest in Canopy 2.1!

  1. Support for Python 3 user environments: Canopy can now be installed with a Python 3.5 user environment. Users can benefit from all the Canopy features already available for Python 2.7 (syntax checking, debugging, etc.) in the new Python 3 environments. Python 3.6 is also available (and will be the standard Python 3 in Canopy 2.2).
  2. All 450+ Python 2 and Python 3 packages are now completely free for all users: Technical support, full installers with all packages for offline or shared installation, and the premium analysis environment features (graphical debugger and variable browser and Data Import Tool) remain subscriber-exclusive benefits. See subscription options here to take advantage of those benefits.
  3. Built in, state of the art dependency solver (EDM or Enthought Deployment Manager): the new EDM back end (which replaces the previous enpkg) provides additional features for robust package compatibility. EDM integrates a specialized dependency solver which automatically ensures you have a consistent package set after installation, removal, or upgrade of any packages.
  4. Environment bundles, which allow users to easily share environments directly with co-workers, or across various deployment solutions (such as the Enthought Deployment Server, continuous integration processes like Travis-CI and Appveyor, cloud solutions like AWS or Google Compute Engine, or deployment tools like Ansible or Docker). EDM environment bundles not only allow the user to replicate the set of installed dependencies but also support persistence for constraint modifiers, the list of manually installed packages, and the runtime version and implementation. Continue reading

Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way

The Data Import Tool can highlight missing value cells, helping you easily identify columns or rows containing NaN valuesThis is the second blog in a series. See the first blog here: Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way

No dataset is perfect and most datasets that we have to deal with on a day-to-day basis have values missing, often represented by “NA” or “NaN”. One of the reasons why the Pandas library is as popular as it is in the data science community is because of its capabilities in handling data that contains NaN values.

But spending time looking up the relevant Pandas commands might be cumbersome when you are exploring raw data or prototyping your data analysis pipeline. This is one of the places where the Canopy Data Import Tool helps make data munging faster and easier, by simplifying the task of identifying missing values in your raw data and removing/replacing them.

Why are missing values a problem you ask? We can answer that question in the context of machine learning. scikit-learn and TensorFlow are popular and widely used libraries for machine learning in Python. Both of them caution the user about missing values in their datasets. Various machine learning algorithms expect all the input values to be numerical and to hold meaning. Both of the libraries suggest removing rows and/or columns that contain missing values.

If removing the missing values is not an option, given the size of your dataset, then they suggest replacing the missing values. The scikit-learn library provides an Imputer class, which can be used to replace missing values. See the sci-kit learn documentation for an example of how the Imputer class is used. Similarly, the decode_csv function in the TensorFlow library can be passed a record_defaults argument, which will replace missing values in the dataset. See the TensorFlow documentation for specifics.

The Data Import Tool provides capabilities to handle missing values in your dataset because we strongly believe that discovering and handling missing values in your dataset is a part of the data import and cleaning phase and not the analysis phase of the data science process.

Digging into the specifics, here we’ll compare how you can go about handling missing values with three typical scenarios, first using the Pandas library, then contrasting with the Data Import Tool:

  1. Identifying missing values in data
  2. Replacing missing values in data, and
  3. Removing missing values from data.

Note : Pandas’ internal representation of your data is called a DataFrame. A DataFrame is simply a tabular data structure, similar to a spreadsheet or a SQL table.

Continue reading

Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way

This is the first blog in a series. See the second blog here: Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way

Importing files or data into Pandas with the Canopy Data Import ToolData exploration, manipulation, and visualization start with loading data, be it from files or from a URL. Pandas has become the go-to library for all things data analysis in Python, but if your intention is to jump straight into data exploration and manipulation, the Canopy Data Import Tool can help, instead of having to learn the details of programming with the Pandas library.

The Data Import Tool leverages the power of Pandas while providing an interactive UI, allowing you to visually explore and experiment with the DataFrame (the Pandas equivalent of a spreadsheet or a SQL table), without having to know the details of the Pandas-specific function calls and arguments. The Data Import Tool keeps track of all of the changes you make (in the form of Python code). That way, when you are done finding the right workflow for your data set, the Tool has a record of the series of actions you performed on the DataFrame, and you can apply them to future data sets for even faster data wrangling in the future.

At the same time, the Tool can help you pick up how to use the Pandas library, while still getting work done. For every action you perform in the graphical interface, the Tool generates the appropriate Pandas/Python code, allowing you to see and relate the tasks to the corresponding Pandas code.

With the Data Import Tool, loading data is as simple as choosing a file or pasting a URL. If a file is chosen, it automatically determines the format of the file, whether or not the file is compressed, and intelligently loads the contents of the file into a Pandas DataFrame. It does so while taking into account various possibilities that often throw a monkey wrench into initial data loading: that the file might contain lines that are comments, it might contain a header row, the values in different columns could be of different types e.g. DateTime or Boolean, and many more possibilities as well.

Importing files or data into Pandas with the Canopy Data Import Tool

The Data Import Tool makes loading data into a Pandas DataFrame as simple as choosing a file or pasting a URL.

A Glimpse into Loading Data into Pandas DataFrames (The Hard Way)

The following 4 “inconvenience” examples show typical problems (and the manual solutions) that might arise if you are writing Pandas code to load data, which are automatically solved by the Data Import Tool, saving you time and frustration, and allowing you to get to the important work of data analysis more quickly.

Continue reading