Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way

This is the first blog in a series. See the second blog here: Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way

Importing files or data into Pandas with the Canopy Data Import ToolData exploration, manipulation, and visualization start with loading data, be it from files or from a URL. Pandas has become the go-to library for all things data analysis in Python, but if your intention is to jump straight into data exploration and manipulation, the Canopy Data Import Tool can help, instead of having to learn the details of programming with the Pandas library.

The Data Import Tool leverages the power of Pandas while providing an interactive UI, allowing you to visually explore and experiment with the DataFrame (the Pandas equivalent of a spreadsheet or a SQL table), without having to know the details of the Pandas-specific function calls and arguments. The Data Import Tool keeps track of all of the changes you make (in the form of Python code). That way, when you are done finding the right workflow for your data set, the Tool has a record of the series of actions you performed on the DataFrame, and you can apply them to future data sets for even faster data wrangling in the future.

At the same time, the Tool can help you pick up how to use the Pandas library, while still getting work done. For every action you perform in the graphical interface, the Tool generates the appropriate Pandas/Python code, allowing you to see and relate the tasks to the corresponding Pandas code.

With the Data Import Tool, loading data is as simple as choosing a file or pasting a URL. If a file is chosen, it automatically determines the format of the file, whether or not the file is compressed, and intelligently loads the contents of the file into a Pandas DataFrame. It does so while taking into account various possibilities that often throw a monkey wrench into initial data loading: that the file might contain lines that are comments, it might contain a header row, the values in different columns could be of different types e.g. DateTime or Boolean, and many more possibilities as well.

Importing files or data into Pandas with the Canopy Data Import Tool

The Data Import Tool makes loading data into a Pandas DataFrame as simple as choosing a file or pasting a URL.

A Glimpse into Loading Data into Pandas DataFrames (The Hard Way)

The following 4 “inconvenience” examples show typical problems (and the manual solutions) that might arise if you are writing Pandas code to load data, which are automatically solved by the Data Import Tool, saving you time and frustration, and allowing you to get to the important work of data analysis more quickly.

Let’s say you were to load data from the file by yourself. After searching the Pandas documentation a bit, you will come across the pandas.read_table function which loads the contents of a file into a Pandas DataFrame. But it’s never so easy in practice: pandas.read_table and other functions you might find assume certain defaults, which might be at odds with the data in your file.

Inconvenience #1: Data in the first row will automatically be used as a header.  Let’s say that your file (like this one: [wind.data]) uses whitespace as the separator between columns and doesn’t have a row containing column names. pandas.read_table assumes by default that your file contains a header row and uses tabs for delimiters. If you don’t tell it otherwise, Pandas will use the data from the first row in your file as column names, which is clearly wrong in this case.

From the docs, you can discover that this behavior can be turned off by passing header=None and use sep=\s+ to pandas.read_table, to use varying whitespace as the separator and to inform pandas that a header column doesn’t exist:

In [1]: df = pandas.read_table('wind.data', sep='\s+')
In [2]: df.head()
61  1  1.1  15.04  14.96  13.17   9.29  13.96  9.87  13.67  10.25  10.83  \
0  61  1    2  14.71  16.88  10.83   6.50  12.62  7.67  11.50  10.04   9.79
1  61  1    3  18.50  16.88  12.33  10.13  11.17  6.17  11.25   8.04   8.50
12.58  18.50  15.04.1
0   9.67  17.54    13.83
1   7.67  12.75    12.71

Without the header=None kwarg, you can see that the first row of data is being considered as column names:

In [3]: df = pandas.read_table('wind.data', header=None, sep='\s+')
In [4]: df.head()
0   1   2      3      4      5      6      7     8      9      10     11  \
0  61   1   1  15.04  14.96  13.17   9.29  13.96  9.87  13.67  10.25  10.83
1  61   1   2  14.71  16.88  10.83   6.50  12.62  7.67  11.50  10.04   9.79
12     13     14
0  12.58  18.50  15.04
1   9.67  17.54  13.83

The behavior we expected, after we tell Pandas that the file does not contain a row containing column names using header=None and specify the separator:

[File : test_data_comments.txt]

Inconvenience #2: Commented lines cause the data load to fail.  Next let’s say that your file contains commented lines which start with a #. Pandas doesn’t understand this by default and trying to load the data into a DataFrame will either fail with an Error or worse, succeed without notifying you that one row in the DataFrame might contain erroneous data, from the commented line.  (This might also prevent correct inference of column types.)

Again, you can tell pandas.read_table that commented lines exist in your file and to skip them using comment=#:

In [1]: df = pandas.read_table('test_data_comments.txt', sep=',', header=None)
CParserError                              Traceback (most recent call last)
<ipython-input-10-b5cd8eee4851> in <module>()
----> 1 df = pandas.read_table('catalyst/tests/data/test_data_comments.txt', sep=',', header=None)
CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 5

As mentioned earlier, if you are lucky, Pandas will fail with a CParserError, complaining that each row contains a different number of columns in the data file.  Needless to say, it’s not obvious to tell that this is an unidentified comment line:

In [2]: df = pandas.read_table('test_data_comments.txt', sep=',', comment='#', header=None)
In [3]: df
0   1    2      3            4
0  1  False  1.0    one   2015-01-01
1  2   True  2.0    two   2015-01-02
2  3  False  3.0  three  2015-01-03
3  4   True  4.0  four  2015-01-04

And we can read the file contents correctly when we tell pandas that ‘#’ is the character that commented lines in the file start with, as is seen in the following file:

[File : ncaa_basketball_2016.txt]

Inconvenience #3: Fixed-width formatted data will cause data load to fail.  Now let’s say that your file contains data in a fixed-width format. Trying to load this data using pandas.read_table will fail.

Dig around a little and you will come across the function pandas.read_fwf, which is the suggested way to load data from fixed-width files, not pandas.read_table.

In [1]: df = pandas.read_table('ncaa_basketball_2016.txt', header=None)
In [2]: df.head()
0  2016-02-25 @Ark Little Rock          72  UT Ar...
1  2016-02-25  ULM                      66 @South...

Those of you familiar with Pandas will recognize that the above DataFrame, created from the file, contains only one column, labelled 0. Which is clearly wrong, because there are 4 distinct columns in the file.

In [3]: df = pandas.read_table('ncaa_basketball_2016.txt', header=None, sep='\s+')
CParserError                              Traceback (most recent call last)
<ipython-input-28-db4f2f128b37> in <module>()
----> 1 df = pandas.read_table('functional_tests/data/ncaa_basketball_2016.txt', header=None, sep='\s+')
CParserError: Error tokenizing data. C error: Expected 8 fields in line 55, saw 9

If we didn’t know better, we would’ve assumed that the delimiter/separator character used in the file was whitespace. We can tell Pandas to load the file again, assuming that the separator was whitespace, represented using \s+. But, as you can clearly see above, that raises a CParserError, complaining that it noticed more columns of data in one row than the previous.

In [4]: df = pandas.read_fwf('ncaa_basketball_2016.txt', header=None)
In [5]: df.head()
0                 1   2               3   4    5
0  2016-02-25  @Ark Little Rock  72    UT Arlington  60  NaN
1  2016-02-25               ULM  66  @South Alabama  59  NaN

And finally, using pandas.read_fwf instead of pandas.read_table gives us a DataFrame that is close to what we expected, given the data in the file.

Inconvenience #4: NA is not recognized as text; automatically converted to ‘None’:  Finally, let’s assume that you have raw data containing the string NA, which is this specific case is used to represent North America. By default pandas.read_csv interprets these string values to represent None and automatically converts them to None. And Pandas does all of this underneath the hood, without informing the user. One of the things that the Zen of Python says is that Explicit is better than implicit. In that spirit, the Tool explicitly lists the values which will be interpreted as None/NaN.

The user can remove NA (or any of the other values) from this list, to prevent it from being interpreted as None, as shown in the following file:

[File : test_data_na_values.csv]

In [2]: df = pandas.read_table('test_data_na_values.csv', sep=',', header=None)
In [3]: df
0  1       2
0 NaN  1    True
1 NaN  2   False
2 NaN  3   False
3 NaN  4    True
In [4]: df = pandas.read_table('test_data_na_values.csv', sep=',', header=None, keep_default_na=False, na_values=[])
In [5]: df
0  1       2
0  NA  1    True
1  NA  2   False
2  NA  3   False
3  NA  4    True

If your intentions were to jump straight into data exploration and manipulation, then the above points are some of the inconveniences that you will have to deal with, requiring you to learn the various arguments that need to be passed to pandas.read_table before can load your data correctly and get to your analysis.

Loading Data with the Data Import Tool (The Easy Way)

Use the Data Import Tool to automatically set up the correct file assumptions

The Canopy Data Import Tool automatically circumvents several common data loading inconveniences and errors by simply setting up the correct file assumptions in the Edit Command dialog box.

The Data Import Tool takes care of all of these problems for you, allowing you to fast forward to the important work of data exploration and manipulation. It automatically:

  1. Infers if your file contains a row of column names or not;
  2. Intelligently infers if your file contains any commented lines and what the comment character is;
  3. Infers what delimiter is used in the file or if the file contains data in a fixed-width format.

The Data Import Tool as a Learning Resource: Using Auto-Generated Python/Pandas code

So far, we talked about how the Tool can help you get started with data exploration, without the need for you to understand the Pandas library and its intricacies. But, what if you were also interested in learning about the Pandas library? That’s where the Python Code pane in the Data Import Tool can help.

As you can see from the screenshot below, the Data Import Tool generates Pandas/Python code for every command you execute. This way, you can explore and learn about the Pandas library using the Tool.

See generated Python / Pandas code to help learn underlying code for data wrangling tasks.

View the underlying Python / Pandas code in the Data Import Tool to help learn Pandas code, without slowing down your work.

Finally, once you are done loading data from the file and manipulating the DataFrame, you can export the DataFrame to Canopy’s IPython console for further analysis and visualization. Simply click Use DataFrame at the bottom-right corner and the Tool will export the DataFrame to Canopy’s IPython pane, as you can see below.

mport the cleaned data into the Canopy IPython console for further data analysis and visualization.

Import the cleaned data into the Canopy IPython console for further data analysis and visualization.

This is the first blog in a series. See the second blog here: Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way

Additional resources:

Watch a 2-minute demo video to see how the Canopy Data Import Tool works:

See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging:

One thought on “Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way

  1. Pingback: ImportPython 107 – Don Dash

Leave a Reply

Your email address will not be published. Required fields are marked *

Please leave these two fields as-is:

Protected by Invisible Defender. Showed 403 to 120,669 bad guys.