Tag Archives: Pandas

Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way

This is the second blog in a series. See the first blog here: Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way

No dataset is perfect and most datasets that we have to deal with on a day-to-day basis have values missing, often represented by “NA” or “NaN”. One of the reasons why the Pandas library is as popular as it is in the data science community is because of its capabilities in handling data that contains NaN values.

But spending time looking up the relevant Pandas commands might be cumbersome when you are exploring raw data or prototyping your data analysis pipeline. This is one of the places where the Canopy Data Import Tool helps make data munging faster and easier, by simplifying the task of identifying missing values in your raw data and removing/replacing them.

Why are missing values a problem you ask? We can answer that question in the context of machine learning. scikit-learn and TensorFlow are popular and widely used libraries for machine learning in Python. Both of them caution the user about missing values in their datasets. Various machine learning algorithms expect all the input values to be numerical and to hold meaning. Both of the libraries suggest removing rows and/or columns that contain missing values.

If removing the missing values is not an option, given the size of your dataset, then they suggest replacing the missing values. The scikit-learn library provides an Imputer class, which can be used to replace missing values. See the sci-kit learn documentation for an example of how the Imputer class is used. Similarly, the decode_csv function in the TensorFlow library can be passed a record_defaults argument, which will replace missing values in the dataset. See the TensorFlow documentation for specifics.

The Data Import Tool provides capabilities to handle missing values in your dataset because we strongly believe that discovering and handling missing values in your dataset is a part of the data import and cleaning phase and not the analysis phase of the data science process.

Digging into the specifics, here we’ll compare how you can go about handling missing values with three typical scenarios, first using the Pandas library, then contrasting with the Data Import Tool:

  1. Identifying missing values in data
  2. Replacing missing values in data, and
  3. Removing missing values from data.

Note : Pandas’ internal representation of your data is called a DataFrame. A DataFrame is simply a tabular data structure, similar to a spreadsheet or a SQL table.


Identifying Missing Values – The Hard Way: Using Pandas

If you are interested in identifying missing values in a row/column of a DataFrame, you need to understand the isnull, any, all methods on a DataFrame.

Taking a detour, we have so far described missing values as being represented by NA or NaN. Instead, what if missing values in a column are values that aren’t of the same type as the rest of the cells in the column, say for example a string in a column containing integers? Doing so in Pandas is not trivial.

Identifying Missing Values – The Easy Way: Using the Data Import Tool

Highlighting Null Values using the Data Import Tool

Highlighting null values using the Data Import Tool

Instead of giving you the column names and index values of the cells containing missing values, the Data Import Tool shows them to you. Simply checking the `Highlight Missing Values` checkbox in the bottom-left corner of the Data Import Tool will paint the DataFrame to show you the cells that contain missing values. Further, the Data Import Tool understands that your data file might have errors, like having a string value in a column otherwise containing integers. The Data Import Tool highlights the cell and displays the underlying content too.

The Data Import Tool can highlight missing value cells, helping you easily identify columns or rows containing NaN values

The Data Import Tool can highlight missing value cells, helping you easily identify columns or rows containing NaN values


Replacing Missing Values – The Hard Way: Using Pandas

While Pandas does a great job at handling column operations even if the columns contain NaN values, our data analysis workflow might need us to replace the missing values in our data.

After spending a little time browsing through the Pandas documentation, you will come across the `fillna` method on a DataFrame, which can be used to replace a missing values. The arguments you pass to the fillna method will determine what value the missing values in your DataFrame are replaced with and how the underlying column dtypes change after replacing the missing values.

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Replacing Missing Values – The Easy Way: Using the Data Import Tool

With the Data Import Tool, you can replace missing values by right-clicking on the column containing missing values selecting the appropriate Fill Missing Values item. Opting to replace missing values in the column with a specific column will open an additional dialog, prompting you to enter the value.

Fill missing values

Replace missing values in your DataFrame using the Canopy Data Import Tool


Removing Missing Values – The Hard Way: Using Pandas

While removing columns or rows containing missing values might be a little extreme, it might be necessary. Pandas suggests that you use the dropna method on the DataFrame to drop columns or rows that contain missing values. The arguments you pass to the dropna method will determine what rows/columns are removed from the DataFrame.

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Removing Missing Values – The Easy Way: Using the Data Import Tool

With the Data Import Tool on the other hand, you can remove rows/columns containing missing values by selecting the “Delete Empty Columns” or “Delete Empty Rows” item from the “Transform” menu. An additional dialog will pop up asking you how lenient you want to be in removing rows/columns containing missing values – if you choose ‘any’, the Data Import Tool will remove rows/columns that contain any missing values; if you choose ‘all’, the Data Import Tool will only remove those rows/columns which contain only missing values.

Delete Empty Rows & Columns

Delete empty cells in rows/columns using the Canopy Data Import Tool

Delete Empty Columns

Choose to delete columns containing any null value or columns full of null values using the Canopy Data Import Tool

Finally, we have data that contains no missing values. So far, we’ve used the DIT to easily discover the missing values in our dataset and to remove/replace the missing values. Finally, by clicking on ‘Use DataFrame’, you can import the dataset as a pandas DataFrame into the IPython workspace of the Canopy Editor. If you’re a data scientist, your data is now void of missing values and can be converted to arrays or variables and passed on to scikit-learn, TensorFlow or any other Machine Learning library of your choice.

Ready to try the Canopy Data Import Tool?

Download Canopy (free) and click on the icon to start a free trial of the Data Import Tool today

This is the second blog in a series. See the first blog here: Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way


Additional resources:

Watch a 2-minute demo video to see how the Canopy Data Import Tool works:

See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging:

Webinar: An Exclusive Peek “Under the Hood” of Enthought Training and the Pandas Mastery Workshop

See the webinar

Enthought’s Pandas Mastery Workshop is designed to accelerate the development of skill and confidence with Python’s Pandas data analysis package — in just three days, you’ll look like an old pro! This course was created ground up by our training experts based on insights from the science of human learning, as well as what we’ve learned from over a decade of extensive practical experience of teaching thousands of scientists, engineers, and analysts to use Python effectively in their everyday work.

In this webinar, we’ll give you the key information and insight you need to evaluate whether the Pandas Mastery Workshop is the right solution to advance your data analysis skills in Python, including:

  • Who will benefit most from the course
  • A guided tour through the course topics
  • What skills you’ll take away from the course, how the instructional design supports that
  • What the experience is like, and why it is different from other training alternatives (with a sneak peek at actual course materials)
  • What previous workshop attendees say about the course

See the Webinar


michael_connell-enthought-vp-trainingPresenter: Dr. Michael Connell, VP, Enthought Training Solutions

Ed.D, Education, Harvard University
M.S., Electrical Engineering and Computer Science, MIT


Continue reading

Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way

This is the first blog in a series. See the second blog here: Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way

Data exploration, manipulation, and visualization start with loading data, be it from files or from a URL. Pandas has become the go-to library for all things data analysis in Python, but if your intention is to jump straight into data exploration and manipulation, the Canopy Data Import Tool can help, instead of having to learn the details of programming with the Pandas library.

The Data Import Tool leverages the power of Pandas while providing an interactive UI, allowing you to visually explore and experiment with the DataFrame (the Pandas equivalent of a spreadsheet or a SQL table), without having to know the details of the Pandas-specific function calls and arguments. The Data Import Tool keeps track of all of the changes you make (in the form of Python code). That way, when you are done finding the right workflow for your data set, the Tool has a record of the series of actions you performed on the DataFrame, and you can apply them to future data sets for even faster data wrangling in the future.

At the same time, the Tool can help you pick up how to use the Pandas library, while still getting work done. For every action you perform in the graphical interface, the Tool generates the appropriate Pandas/Python code, allowing you to see and relate the tasks to the corresponding Pandas code.

With the Data Import Tool, loading data is as simple as choosing a file or pasting a URL. If a file is chosen, it automatically determines the format of the file, whether or not the file is compressed, and intelligently loads the contents of the file into a Pandas DataFrame. It does so while taking into account various possibilities that often throw a monkey wrench into initial data loading: that the file might contain lines that are comments, it might contain a header row, the values in different columns could be of different types e.g. DateTime or Boolean, and many more possibilities as well.

Importing files or data into Pandas with the Canopy Data Import Tool

The Data Import Tool makes loading data into a Pandas DataFrame as simple as choosing a file or pasting a URL.

A Glimpse into Loading Data into Pandas DataFrames (The Hard Way)

The following 4 “inconvenience” examples show typical problems (and the manual solutions) that might arise if you are writing Pandas code to load data, which are automatically solved by the Data Import Tool, saving you time and frustration, and allowing you to get to the important work of data analysis more quickly.

Continue reading

Webinar: Fast Forward Through the “Dirty Work” of Data Analysis: New Python Data Import and Manipulation Tool Makes Short Work of Data Munging Drudgery

Python Import & Manipulation Tool Intro Webinar

Whether you are a data scientist, quantitative analyst, or an engineer, or if you are evaluating consumer purchase behavior, stock portfolios, or design simulation results, your data analysis workflow probably looks a lot like this:

Acquire > Wrangle > Analyze and Model > Share and Refine > Publish

The problem is that often 50 to 80 percent of time is spent wading through the tedium of the first two stepsacquiring and wrangling data – before even getting to the real work of analysis and insight. (See The New York Times, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights)

WHAT YOU’LL LEARN:

Enthought Canopy Data Import Tool

Try the Data Import Tool with your own data. Download here.

In this webinar we’ll demonstrate how the new Canopy Data Import Tool can significantly reduce the time you spend on data analysis “dirty work,” by helping you:

  • Load various data file types and URLs containing embedded tables into Pandas DataFrames
  • Perform common data munging tasks that improve raw data
  • Handle complicated and/or messy data
  • Extend the work done with the tool to other data files

WEBINAR RECORDING:
Continue reading