This is the second blog in a series. See the first blog here: Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way
No dataset is perfect and most datasets that we have to deal with on a day-to-day basis have values missing, often represented by “NA” or “NaN”. One of the reasons why the Pandas library is as popular as it is in the data science community is because of its capabilities in handling data that contains NaN values.
But spending time looking up the relevant Pandas commands might be cumbersome when you are exploring raw data or prototyping your data analysis pipeline. This is one of the places where the Canopy Data Import Tool helps make data munging faster and easier, by simplifying the task of identifying missing values in your raw data and removing/replacing them.
Why are missing values a problem you ask? We can answer that question in the context of machine learning. scikit-learn and TensorFlow are popular and widely used libraries for machine learning in Python. Both of them caution the user about missing values in their datasets. Various machine learning algorithms expect all the input values to be numerical and to hold meaning. Both of the libraries suggest removing rows and/or columns that contain missing values.
If removing the missing values is not an option, given the size of your dataset, then they suggest replacing the missing values. The scikit-learn library provides an
Imputer class, which can be used to replace missing values. See the sci-kit learn documentation for an example of how the
Imputer class is used. Similarly, the
decode_csv function in the TensorFlow library can be passed a
record_defaults argument, which will replace missing values in the dataset. See the TensorFlow documentation for specifics.
The Data Import Tool provides capabilities to handle missing values in your dataset because we strongly believe that discovering and handling missing values in your dataset is a part of the data import and cleaning phase and not the analysis phase of the data science process.
Digging into the specifics, here we’ll compare how you can go about handling missing values with three typical scenarios, first using the Pandas library, then contrasting with the Data Import Tool:
- Identifying missing values in data
- Replacing missing values in data, and
- Removing missing values from data.
Note : Pandas’ internal representation of your data is called a DataFrame. A DataFrame is simply a tabular data structure, similar to a spreadsheet or a SQL table.