Tag Archives: DataFrames

Cheat Sheets: Pandas, the Python Data Analysis Library

Download all 8 Pandas Cheat Sheets

Learn more about the Python for Data Analysis and Pandas Mastery Workshop training courses

Pandas (the Python Data Analysis library) provides a powerful and comprehensive toolset for working with data. Fundamentally, Pandas provides a data structure, the DataFrame, that closely matches real world data, such as experimental results, SQL tables, and Excel spreadsheets, that no other mainstream Python package provides. In addition to that, it includes tools for reading and writing diverse files, data cleaning and reshaping, analysis and modeling, and visualization. Using Pandas effectively can give you super powers, regardless of whether you’re working in data science, finance, neuroscience, economics, advertising, web analytics, statistics, social science, or engineering.

However, learning Pandas can be a daunting task because the API is so rich and large. This is why we created a set of cheat sheets built around the data analysis workflow illustrated below. Each cheat sheet focuses on a given task. It shows you the 20% of functions you will be using 80% of the time, accompanied by simple and clear illustrations of the different concepts. Use them to speed up your learning, or as a quick reference to refresh your mind.

Here’s the summary of the content of each cheat sheet:

  1. Reading and Writing Data with Pandas: This cheat sheet presents common usage patterns when reading data from text files with read_table, from Excel documents with read_excel, from databases with read_sql, or when scraping web pages with read_html. It also introduces how to write data to disk as text files, into an HDF5 file, or into a database.
  2. Pandas Data Structures: Series and DataFrames: It presents the two main data structures, the DataFrame, and the Series. It explain how to think about them in terms of common Python data structure and how to create them. It gives guidelines about how to select subsets of rows and columns, with clear explanations of the difference between label-based indexing, with .loc, and position-based indexing, with .iloc.
  3. Plotting with Series and DataFrames: This cheat sheet presents some of the most common kinds of plots together with their arguments. It also explains the relationship between Pandas and matplotlib and how to use them effectively. It highlights the similarities and difference of plotting data stored in Series or DataFrames.
  4. Computation with Series and DataFrames: This one codifies the behavior of DataFrames and Series as following 3 rules: alignment first, element-by-element mathematical operations, and column-based reduction operations. It covers the built-in methods for most common statistical operations, such as mean or sum. It also covers how missing values are handled by Pandas.
  5. Manipulating Dates and Times Using Pandas: The first part of this cheatsheet describes how to create and manipulate time series data, one of Pandas’ most celebrated features. Having a Series or DataFrame with a Datetime index allows for easy time-based indexing and slicing, as well as for powerful resampling and data alignment. The second part covers “vectorized” string operations, which is the ability to apply string transformations on each element of a column, while automatically excluding missing values.
  6. Combining Pandas DataFrames: The sixth cheat sheet presents the tools for combining Series and DataFrames together, with SQL-type joins and concatenation. It then goes on to explain how to clean data with missing values, using different strategies to locate, remove, or replace them.
  7. Split/Apply/Combine with DataFrames: “Group by” operations involve splitting the data based on some criteria, applying a function to each group to aggregate, transform, or filter them and then combining the results. It’s an incredibly powerful and expressive tool. The cheat sheet also highlights the similarity between “group by” operations and window functions, such as resample, rolling and ewm (exponentially weighted functions).
  8. Reshaping Pandas DataFrames and Pivot Tables: The last cheatsheet introduces the concept of “tidy data”, where each observation, or sample, is a row, and each variable is a column. Tidy data is the optimal layout when working with Pandas. It illustrates various tools, such as stack, unstack, melt, and pivot_table, to reshape data into a tidy form or to a “wide” form.

Download all 8 Pandas Cheat Sheets

Data Analysis Workflow

Ready to accelerate your skills with Pandas?

Enthought’s Pandas Mastery Workshop (for experienced Python users) and Python for Data Analysis (for those newer to Python) classes are ideal for those who work heavily with data. Contact us to learn more about onsite corporate or open class sessions.


Using the Canopy Data Import Tool to Speed Cleaning and Transformation of Data & New Release Features

Enthought Canopy Data Import Tool

Download Canopy to try the Data Import Tool

In November 2016, we released Version 1.0.6 of the Data Import Tool (DIT), an addition to the Canopy data analysis environment. With the Data Import Tool, you can quickly import structured data files as Pandas DataFrames, clean and manipulate the data using a graphical interface, and create reusable Python scripts to speed future data wrangling.

For example, the Data Import Tool lets you delete rows and columns containing Null values or replace the Null values in the DataFrame with a specific value. It also allows you to create new columns from existing ones. All operations are logged and are reversible in the Data Import Tool so you can experiment with various workflows with safeguards against errors or forgetting steps.

What’s New in the Data Import Tool November 2016 Release

Pandas 0.19 support, re-usable templates for data munging, and more.

Over the last couple of releases, we added a number of new features and enhanced a number of existing ones. A few notable changes are:

  1. The Data Import Tool now supports the recently released Pandas version 0.19.0. With this update, the Tool now supports Pandas versions 0.16 through 0.19.
  2. The Data Import Tool now allows you to delete empty columns in the DataFrame, similar to existing option to delete empty rows.
  3. Tdelete-empty-columnshe Data Import Tool allows you to choose how to delete rows or columns containing Null values: “Any” or “All” methods are available.
  4. autosaved_scripts

    The Data Import Tool automatically generates a corresponding Python script for data manipulations performed in the GUI and saves it in your home directory re-use in future data wrangling.

    Every time you successfully import a DataFrame, the Data Import Tool automatically saves a generated Python script in your home directory. This way, you can easily review and reproduce your earlier work.

  5. The Data Import Tool generates a Template with every successful import. A Template is a file that contains all of the commands or actions you performed on the DataFrame and a unique Template file is generated for every unique data file. With this feature, when you load a data file, if a Template file exists corresponding to the data file, the Data Import Tool will automatically perform the operations you performed the last time. This way, you can save progress on a data file and resume your work.

Along with the feature additions discussed above, based on continued user feedback, we implemented a number of UI/UX improvements and bug fixes in this release. For a complete list of changes introduced in Version 1.0.6 of the Data Import Tool, please refer to the Release Notes page in the Tool’s documentation.



Example Use Case: Using the Data Import Tool to Speed Data Cleaning and Transformation

Now let’s take a look at how the Data Import Tool can be used to speed up the process of cleaning up and transforming data sets. As an example data set, let’s take a look at the Employee Compensation data from the city of San Francisco.

NOTE: You can follow the example step-by-step by downloading Canopy and starting a free 7 day trial of the data import tool

Step 1: Load data into the Data Import Tool

import-data-canopy-menuFirst we’ll download the data as a .csv file from the San Francisco Government data website, then open it from File -> Import Data -> From File… menu item in the Canopy Editor (see screenshot at right).

After loading the file, you should see the DataFrame below in the Data Import Tool:
Continue reading