Pandas (the Python Data Analysis library) provides a powerful and comprehensive toolset for working with data. Fundamentally, Pandas provides a data structure, the DataFrame, that closely matches real world data, such as experimental results, SQL tables, and Excel spreadsheets, that no other mainstream Python package provides. In addition to that, it includes tools for reading and writing diverse files, data cleaning and reshaping, analysis and modeling, and visualization. Using Pandas effectively can give you super powers, regardless of whether you’re working in data science, finance, neuroscience, economics, advertising, web analytics, statistics, social science, or engineering.
However, learning Pandas can be a daunting task because the API is so rich and large. This is why we created a set of cheat sheets built around the data analysis workflow illustrated below. Each cheat sheet focuses on a given task. It shows you the 20% of functions you will be using 80% of the time, accompanied by simple and clear illustrations of the different concepts. Use them to speed up your learning, or as a quick reference to refresh your mind.
Here’s the summary of the content of each cheat sheet:
- Reading and Writing Data with Pandas: This cheat sheet presents common usage patterns when reading data from text files with read_table, from Excel documents with read_excel, from databases with read_sql, or when scraping web pages with read_html. It also introduces how to write data to disk as text files, into an HDF5 file, or into a database.
- Pandas Data Structures: Series and DataFrames: It presents the two main data structures, the DataFrame, and the Series. It explain how to think about them in terms of common Python data structure and how to create them. It gives guidelines about how to select subsets of rows and columns, with clear explanations of the difference between label-based indexing, with .loc, and position-based indexing, with .iloc.
- Plotting with Series and DataFrames: This cheat sheet presents some of the most common kinds of plots together with their arguments. It also explains the relationship between Pandas and matplotlib and how to use them effectively. It highlights the similarities and difference of plotting data stored in Series or DataFrames.
- Computation with Series and DataFrames: This one codifies the behavior of DataFrames and Series as following 3 rules: alignment first, element-by-element mathematical operations, and column-based reduction operations. It covers the built-in methods for most common statistical operations, such as mean or sum. It also covers how missing values are handled by Pandas.
- Manipulating Dates and Times Using Pandas: The first part of this cheatsheet describes how to create and manipulate time series data, one of Pandas’ most celebrated features. Having a Series or DataFrame with a Datetime index allows for easy time-based indexing and slicing, as well as for powerful resampling and data alignment. The second part covers “vectorized” string operations, which is the ability to apply string transformations on each element of a column, while automatically excluding missing values.
- Combining Pandas DataFrames: The sixth cheat sheet presents the tools for combining Series and DataFrames together, with SQL-type joins and concatenation. It then goes on to explain how to clean data with missing values, using different strategies to locate, remove, or replace them.
- Split/Apply/Combine with DataFrames: “Group by” operations involve splitting the data based on some criteria, applying a function to each group to aggregate, transform, or filter them and then combining the results. It’s an incredibly powerful and expressive tool. The cheat sheet also highlights the similarity between “group by” operations and window functions, such as resample, rolling and ewm (exponentially weighted functions).
- Reshaping Pandas DataFrames and Pivot Tables: The last cheatsheet introduces the concept of “tidy data”, where each observation, or sample, is a row, and each variable is a column. Tidy data is the optimal layout when working with Pandas. It illustrates various tools, such as stack, unstack, melt, and pivot_table, to reshape data into a tidy form or to a “wide” form.
Data Analysis Workflow
Ready to accelerate your skills with Pandas?
Enthought’s Pandas Mastery Workshop (for experienced Python users) and Python for Data Analysis (for those newer to Python) classes are ideal for those who work heavily with data. Contact us to learn more about onsite corporate or open class sessions.