Author Archives: cgodshall

Using the Canopy Data Import Tool to Speed Cleaning and Transformation of Data & New Release Features

Enthought Canopy Data Import Tool

Download Canopy to try the Data Import Tool

In November 2016, we released Version 1.0.6 of the Data Import Tool (DIT), an addition to the Canopy data analysis environment. With the Data Import Tool, you can quickly import structured data files as Pandas DataFrames, clean and manipulate the data using a graphical interface, and create reusable Python scripts to speed future data wrangling.

For example, the Data Import Tool lets you delete rows and columns containing Null values or replace the Null values in the DataFrame with a specific value. It also allows you to create new columns from existing ones. All operations are logged and are reversible in the Data Import Tool so you can experiment with various workflows with safeguards against errors or forgetting steps.


What’s New in the Data Import Tool November 2016 Release

Pandas 0.19 support, re-usable templates for data munging, and more.

Over the last couple of releases, we added a number of new features and enhanced a number of existing ones. A few notable changes are:

  1. The Data Import Tool now supports the recently released Pandas version 0.19.0. With this update, the Tool now supports Pandas versions 0.16 through 0.19.
  2. The Data Import Tool now allows you to delete empty columns in the DataFrame, similar to existing option to delete empty rows.
  3. Tdelete-empty-columnshe Data Import Tool allows you to choose how to delete rows or columns containing Null values: “Any” or “All” methods are available.
  4. autosaved_scripts

    The Data Import Tool automatically generates a corresponding Python script for data manipulations performed in the GUI and saves it in your home directory re-use in future data wrangling.

    Every time you successfully import a DataFrame, the Data Import Tool automatically saves a generated Python script in your home directory. This way, you can easily review and reproduce your earlier work.

  5. The Data Import Tool generates a Template with every successful import. A Template is a file that contains all of the commands or actions you performed on the DataFrame and a unique Template file is generated for every unique data file. With this feature, when you load a data file, if a Template file exists corresponding to the data file, the Data Import Tool will automatically perform the operations you performed the last time. This way, you can save progress on a data file and resume your work.

Along with the feature additions discussed above, based on continued user feedback, we implemented a number of UI/UX improvements and bug fixes in this release. For a complete list of changes introduced in Version 1.0.6 of the Data Import Tool, please refer to the Release Notes page in the Tool’s documentation.

 

 


Example Use Case: Using the Data Import Tool to Speed Data Cleaning and Transformation

Now let’s take a look at how the Data Import Tool can be used to speed up the process of cleaning up and transforming data sets. As an example data set, let’s take a look at the Employee Compensation data from the city of San Francisco.

NOTE: You can follow the example step-by-step by downloading Canopy and starting a free 7 day trial of the data import tool

Step 1: Load data into the Data Import Tool

import-data-canopy-menuFirst we’ll download the data as a .csv file from the San Francisco Government data website, then open it from File -> Import Data -> From File… menu item in the Canopy Editor (see screenshot at right).

After loading the file, you should see the DataFrame below in the Data Import Tool:
data-frame-view
Continue reading