Tag Archives: Python

Cheat Sheets: Pandas, the Python Data Analysis Library

Download all 8 Pandas Cheat Sheets

Learn more about the Python for Data Analysis and Pandas Mastery Workshop training courses

Pandas (the Python Data Analysis library) provides a powerful and comprehensive toolset for working with data. Fundamentally, Pandas provides a data structure, the DataFrame, that closely matches real world data, such as experimental results, SQL tables, and Excel spreadsheets, that no other mainstream Python package provides. In addition to that, it includes tools for reading and writing diverse files, data cleaning and reshaping, analysis and modeling, and visualization. Using Pandas effectively can give you super powers, regardless of whether you’re working in data science, finance, neuroscience, economics, advertising, web analytics, statistics, social science, or engineering.

However, learning Pandas can be a daunting task because the API is so rich and large. This is why we created a set of cheat sheets built around the data analysis workflow illustrated below. Each cheat sheet focuses on a given task. It shows you the 20% of functions you will be using 80% of the time, accompanied by simple and clear illustrations of the different concepts. Use them to speed up your learning, or as a quick reference to refresh your mind.

Here’s the summary of the content of each cheat sheet:

  1. Reading and Writing Data with Pandas: This cheat sheet presents common usage patterns when reading data from text files with read_table, from Excel documents with read_excel, from databases with read_sql, or when scraping web pages with read_html. It also introduces how to write data to disk as text files, into an HDF5 file, or into a database.
  2. Pandas Data Structures: Series and DataFrames: It presents the two main data structures, the DataFrame, and the Series. It explain how to think about them in terms of common Python data structure and how to create them. It gives guidelines about how to select subsets of rows and columns, with clear explanations of the difference between label-based indexing, with .loc, and position-based indexing, with .iloc.
  3. Plotting with Series and DataFrames: This cheat sheet presents some of the most common kinds of plots together with their arguments. It also explains the relationship between Pandas and matplotlib and how to use them effectively. It highlights the similarities and difference of plotting data stored in Series or DataFrames.
  4. Computation with Series and DataFrames: This one codifies the behavior of DataFrames and Series as following 3 rules: alignment first, element-by-element mathematical operations, and column-based reduction operations. It covers the built-in methods for most common statistical operations, such as mean or sum. It also covers how missing values are handled by Pandas.
  5. Manipulating Dates and Times Using Pandas: The first part of this cheatsheet describes how to create and manipulate time series data, one of Pandas’ most celebrated features. Having a Series or DataFrame with a Datetime index allows for easy time-based indexing and slicing, as well as for powerful resampling and data alignment. The second part covers “vectorized” string operations, which is the ability to apply string transformations on each element of a column, while automatically excluding missing values.
  6. Combining Pandas DataFrames: The sixth cheat sheet presents the tools for combining Series and DataFrames together, with SQL-type joins and concatenation. It then goes on to explain how to clean data with missing values, using different strategies to locate, remove, or replace them.
  7. Split/Apply/Combine with DataFrames: “Group by” operations involve splitting the data based on some criteria, applying a function to each group to aggregate, transform, or filter them and then combining the results. It’s an incredibly powerful and expressive tool. The cheat sheet also highlights the similarity between “group by” operations and window functions, such as resample, rolling and ewm (exponentially weighted functions).
  8. Reshaping Pandas DataFrames and Pivot Tables: The last cheatsheet introduces the concept of “tidy data”, where each observation, or sample, is a row, and each variable is a column. Tidy data is the optimal layout when working with Pandas. It illustrates various tools, such as stack, unstack, melt, and pivot_table, to reshape data into a tidy form or to a “wide” form.

Download all 8 Pandas Cheat Sheets

Data Analysis Workflow

Ready to accelerate your skills with Pandas?

Enthought’s Pandas Mastery Workshop (for experienced Python users) and Python for Data Analysis (for those newer to Python) classes are ideal for those who work heavily with data. Contact us to learn more about onsite corporate or open class sessions.

 

Webinar: Machine Learning Mastery Workshop: An Exclusive Peek “Under the Hood” of Enthought Training

What: A guided walkthrough and live Q&A about Enthought’s new “Machine Learning Mastery Workshop” training course.

Who Should Watch: If predictive modeling and analytics would be valuable in your work, come to the webinar to find out what all the fuss is about and what there is to know. Whether you are looking to get started with machine learning, interested in refining your machine learning skills, or want to transfer your skills from another toolset to Python, come to the webinar to find out if Enthought’s highly interactive, expertly taught Machine Learning Mastery Workshop might be a good fit for accelerating your development!

View


Why Has Machine Learning Become So Popular?

Artificial Intelligence and Machine Learning are a defining feature of the 21st century and are quickly becoming a key factor in gaining and maintaining competitive advantage in each industry which incorporates them. Why is machine learning so beneficial?  Because it provides a fast and flexible way to build models that can surface signal, find patterns, and predict future behavior.  These powerful models are used for:

  • Forecasting supply chain availability
  • Clustering product defects for QA
  • Anticipating movements in financial markets
  • Predicting chemical tolerances
  • Optimizing the placement of advertisements
  • Managing process engineering
  • Modeling reservoir production
  • and much more.

In response to growing demand for Machine Learning expertise, Enthought has developed an intensive 3-day guided practicum to bring you up to speed quickly on key concepts and skills in this exciting realm. Join us in this webinar for an in-depth overview of Enthought’s Machine Learning Mastery Workshop — a training course designed to accelerate the development of intuition, skill, and confidence in applying machine learning methods to solve real-world problems.

In the webinar we’ll describe how Enthought’s training course combines conceptual knowledge of machine learning models with intensive experience applying them to real-world data to develop skill in applying Python’s machine learning tools, such as the scikit-learn package, to make predictions about complicated phenomena by leveraging the information contained in numerical data, natural language, 2D images, and discrete categories.

The hands-on, interactive course was created ground up by our training experts to enable you to develop transferable skills in Machine Learning that you can apply back at work the next day.

In this webinar, we’ll give you the key information and insight you need to quickly evaluate whether Enthought’s Machine Learning Mastery Workshop course is the right solution for you to build skills in using Python for advanced analytics, including:

  • Who will benefit most from the course, and what pre-requisite knowledge is required
  • What topics the course covers – a guided tour
  • What new knowledge, skills, and capabilities you’ll take away, and how the course design supports those outcomes
  • What the (highly interactive) learning experience is like
  • Why this course is different from other training alternatives (with a preview of actual course materials!)
  • What previous workshop attendees say about our courses

View


Presenter: Dr. Dillon Niederhut,

Enthought Training Instructor

Ph.D., University of California at Berkeley

 


 

Additional Resources

Upcoming Open Machine Learning Mastery Workshop Sessions:

Austin, TX, Feb. 21-23, 2017
Houston, TX, Apr. 18-20, 2018
Cambridge, UK, May 9-11, 2018

Upcoming Open Python for Data Science Sessions:

New York City, NY, Dec. 4-8, 2018
London, UK, Feb. 19-23, 2018
Washington, DC, Apr. 23-27, 2018
San Jose, CA, May 14-18, 2018

Have a group interested in training? We specialize in group and corporate training. Contact us or call 512.536.1057.

Download Enthought’s Machine Learning with Python’s Scikit-Learn Cheat Sheets

Enthought's Machine Learning with Python Cheat Sheets

Additional Webinars in the Training Series:

Python for MATLAB Users: What You Need to Know

Python for Scientists and Engineers: A Tour of Enthought’s Professional Technical Training Course

Python for Data Science: A Tour of Enthought’s Professional Technical Training Course

Python for Professionals: The Complete Guide to Enthought’s Technical Training Courses

An Exclusive Peek “Under the Hood” of Enthought Training and the Pandas Mastery Workshop

What’s New in the Canopy Data Import Tool Version 1.1

New features in the Canopy Data Import Tool Version 1.1:
Support for Pandas v. 20, Excel / CSV export capabilities, and more

Enthought Canopy Data Import ToolWe’re pleased to announce a significant new feature release of the Canopy Data Import Tool, version 1.1. The Data Import Tool allows users to quickly and easily import CSVs and other structured text files into Pandas DataFrames through a graphical interface, manipulate the data, and create reusable Python scripts to speed future data wrangling. Here are some of the notable updates in version 1.1:

1. Support for PyQt
The Data Import Tool now supports both PyQt and PySide backends. Python 3 support will also be available shortly.

2. Exporting DataFrames to csv/xlsx file formats
We understand that data exploration and manipulation are only one part of your data analysis process, which is why the Data Import Tool now provides a way for you to save the DataFrame as a CSV/XLSX file. This way, you can share processed data with your colleagues or feed this processed file to the next step in your data analysis pipeline.

3. Column Sort Indicators
In earlier versions of the Data Import Tool, it was not obvious that clicking on the right-end of the column header sorted the columns. With this release, we added sort indicators on every column, which can be pressed to sort the column in an ascending or descending fashion. And given the complex nature of the data we get, we know sorting the data based on single column is never enough, so we also made sorting columns using the Data Import Tool stable (ie, sorting preserves any existing order in the DataFrame).

Continue reading

Enthought Announces Canopy 2.1: A Major Milestone Release for the Python Analysis Environment and Package Distribution

Python 3 and multi-environment support, new state of the art package dependency solver, and over 450 packages now available free for all users

Enthought Canopy logoEnthought is pleased to announce the release of Canopy 2.1, a significant feature release that includes Python 3 and multi-environment support, a new state of the art package dependency solver, and access to over 450 pre-built and tested scientific and analytic Python packages completely free for all users. We highly recommend that all current Canopy users upgrade to this new release.

Ready to dive in? Download Canopy 2.1 here.


For those currently familiar with Canopy, in this blog we’ll review the major new features in this exciting milestone release, and for those of you looking for a tool to improve your workflow with Python, or perhaps new to Python from a language like MATLAB or R, we’ll take you through the key reasons that scientists, engineers, data scientists, and analysts use Canopy to enable their work in Python.

First, let’s talk about the latest and greatest in Canopy 2.1!

  1. Support for Python 3 user environments: Canopy can now be installed with a Python 3.5 user environment. Users can benefit from all the Canopy features already available for Python 2.7 (syntax checking, debugging, etc.) in the new Python 3 environments. Python 3.6 is also available (and will be the standard Python 3 in Canopy 2.2).
  2. All 450+ Python 2 and Python 3 packages are now completely free for all users: Technical support, full installers with all packages for offline or shared installation, and the premium analysis environment features (graphical debugger and variable browser and Data Import Tool) remain subscriber-exclusive benefits. See subscription options here to take advantage of those benefits.
  3. Built in, state of the art dependency solver (EDM or Enthought Deployment Manager): the new EDM back end (which replaces the previous enpkg) provides additional features for robust package compatibility. EDM integrates a specialized dependency solver which automatically ensures you have a consistent package set after installation, removal, or upgrade of any packages.
  4. Environment bundles, which allow users to easily share environments directly with co-workers, or across various deployment solutions (such as the Enthought Deployment Server, continuous integration processes like Travis-CI and Appveyor, cloud solutions like AWS or Google Compute Engine, or deployment tools like Ansible or Docker). EDM environment bundles not only allow the user to replicate the set of installed dependencies but also support persistence for constraint modifiers, the list of manually installed packages, and the runtime version and implementation. Continue reading

SciPy 2017 Conference to Showcase Leading Edge Developments in Scientific Computing with Python

Renowned scientists, engineers and researchers from around the world to gather July 10-16, 2017 in Austin, TX to share and collaborate to advance scientific computing tool


AUSTIN, TX – June 6, 2017 –
Enthought, as Institutional Sponsor, today announced the SciPy 2017 Conference will be held July 10-16, 2017 in Austin, Texas. At this 16th annual installment of the conference, scientists, engineers, data scientists and researchers will participate in tutorials, talks and developer sprints designed to foster the continued rapid growth of the scientific Python ecosystem. This year’s attendees hail from over 25 countries and represent academia, government, national research laboratories, and industries such as aerospace, biotechnology, finance, oil and gas and more.

“Since 2001, the SciPy Conference has been a highly anticipated annual event for the scientific and analytic computing community,” states Dr. Eric Jones, CEO at Enthought and SciPy Conference co-founder. “Over the last 16 years we’ve witnessed Python emerge as the de facto open source programming language for science, engineering and analytics with widespread adoption in research and industry. The powerful tools and libraries the SciPy community has developed are used by millions of people to advance scientific inquest and innovation every day.”

Special topical themes for this year’s conference are “Artificial Intelligence and Machine Learning Applications” and the “Scientific Python (SciPy) Tool Stack.” Keynote speakers include:

  • Kathryn Huff, Assistant Professor in the Department of Nuclear, Plasma, and Radiological Engineering at the University of Illinois at Urbana-Champaign  
  • Sean Gulick, Research Professor at the Institute for Geophysics at the University of Texas at Austin
  • Gaël Varoquaux, faculty researcher in the Neurospin brain research institute at INRIA (French Institute for Research in Computer Science and Automation)

In addition to the special conference themes, there will also be over 100 talk and poster paper speakers/presenters covering eight mini-symposia tracks including: Astronomy; Biology, Biophysics, and Biostatistics; Computational Science and Numerical Techniques; Data Science; Earth, Ocean, and Geo Sciences; Materials Science and Engineering; Neuroscience; and Open Data and Reproducibility.

Continue reading

Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way

The Data Import Tool can highlight missing value cells, helping you easily identify columns or rows containing NaN valuesThis is the second blog in a series. See the first blog here: Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way

No dataset is perfect and most datasets that we have to deal with on a day-to-day basis have values missing, often represented by “NA” or “NaN”. One of the reasons why the Pandas library is as popular as it is in the data science community is because of its capabilities in handling data that contains NaN values.

But spending time looking up the relevant Pandas commands might be cumbersome when you are exploring raw data or prototyping your data analysis pipeline. This is one of the places where the Canopy Data Import Tool helps make data munging faster and easier, by simplifying the task of identifying missing values in your raw data and removing/replacing them.

Why are missing values a problem you ask? We can answer that question in the context of machine learning. scikit-learn and TensorFlow are popular and widely used libraries for machine learning in Python. Both of them caution the user about missing values in their datasets. Various machine learning algorithms expect all the input values to be numerical and to hold meaning. Both of the libraries suggest removing rows and/or columns that contain missing values.

If removing the missing values is not an option, given the size of your dataset, then they suggest replacing the missing values. The scikit-learn library provides an Imputer class, which can be used to replace missing values. See the sci-kit learn documentation for an example of how the Imputer class is used. Similarly, the decode_csv function in the TensorFlow library can be passed a record_defaults argument, which will replace missing values in the dataset. See the TensorFlow documentation for specifics.

The Data Import Tool provides capabilities to handle missing values in your dataset because we strongly believe that discovering and handling missing values in your dataset is a part of the data import and cleaning phase and not the analysis phase of the data science process.

Digging into the specifics, here we’ll compare how you can go about handling missing values with three typical scenarios, first using the Pandas library, then contrasting with the Data Import Tool:

  1. Identifying missing values in data
  2. Replacing missing values in data, and
  3. Removing missing values from data.

Note : Pandas’ internal representation of your data is called a DataFrame. A DataFrame is simply a tabular data structure, similar to a spreadsheet or a SQL table.

Continue reading

Traits and TraitsUI: Reactive User Interfaces for Rapid Application Development in Python

 The open-source pi3Diamond application built with Traits, TraitsUI and Chaco by Swabian Instruments.The Enthought Tool Suite team is pleased to announce the release of Traits 4.6. Together with the release of TraitsUI 5.1 last year, these core packages of Enthought’s open-source rapid application development tools are now compatible with Python 3 as well as Python 2.7.  Long-time fans of Enthought’s open-source offerings will be happy to hear about the recent updates and modernization we’ve been working on, including the recent release of Mayavi 4.5 with Python 3 support, while newcomers to Python will be pleased that there is an easy way to get started with GUI programming which grows to allow you to build applications with sophisticated, interactive 2D and 3D visualizations.

A Brief Introduction to Traits and TraitsUI

Traits is a mature reactive programming library for Python that allows application code to respond to changes on Python objects, greatly simplifying the logic of an application.  TraitsUI is a tool for building desktop applications on top of the Qt or WxWidgets cross-platform GUI toolkits. Traits, together with TraitsUI, provides a programming model for Python that is similar in concept to modern and popular Javascript frameworks like React, Vue and Angular but targeting desktop applications rather than the browser.

Traits is also the core of Enthought’s open source 2D and 3D visualization libraries Chaco and Mayavi, drives the internal application logic of Enthought products like Canopy, Canopy Geoscience and Virtual Core, and Enthought’s consultants appreciate its the way it facilitates the rapid development of desktop applications for our consulting clients. It is also used by several open-source scientific software projects such as the HyperSpy multidimensional data analysis library and the pi3Diamond application for controlling diamond nitrogen-vacancy quantum physics experiments, and in commercial projects such as the PyRX Virtual Screening software for computational drug discovery.

 The open-source pi3Diamond application built with Traits, TraitsUI and Chaco by Swabian Instruments.

The open-source pi3Diamond application built with Traits, TraitsUI and Chaco by Swabian Instruments.

Traits is part of the Enthought Tool Suite of open source application development packages and is available to install through Enthought Canopy’s Package Manager (you can download Canopy here) or via Enthought’s new edm command line package and environment management tool. Running

edm install traits

at the command line will install Traits into your current environment.

Traits

The Traits library provides a new type of Python object which has an event stream associated with each attribute (or “trait”) of the object that tracks changes to the attribute.  This means that you can decouple your application model much more cleanly: rather than an object having to know all the work which might need to be done when it changes its state, instead other parts of the application register the pieces of work that each of them need when the state changes and Traits automatically takes care running that code.  This results in simpler, more modular and loosely-coupled code that is easier to develop and maintain.

Traits also provides optional data validation and initialization that dramatically reduces the amount of boilerplate code that you need to write to set up objects into a working state and ensure that the state remains valid.  This makes it more likely that your code is correct and does what you expect, resulting in fewer subtle bugs and more immediate and useful errors when things do go wrong.

When you consider all the things that Traits does, it would be reasonable to expect that it may have some impact on performance, but the heart of Traits is written in C and knows more about the structure of the data it is working with than general Python code. This means that it can make some optimizations that the Python interpreter can’t, the net result of which is that code written with Traits is often faster than equivalent pure Python code.

Continue reading

Webinar: An Exclusive Peek “Under the Hood” of Enthought Training and the Pandas Mastery Workshop

See the webinar

Enthought’s Pandas Mastery Workshop is designed to accelerate the development of skill and confidence with Python’s Pandas data analysis package — in just three days, you’ll look like an old pro! This course was created ground up by our training experts based on insights from the science of human learning, as well as what we’ve learned from over a decade of extensive practical experience of teaching thousands of scientists, engineers, and analysts to use Python effectively in their everyday work.

In this webinar, we’ll give you the key information and insight you need to evaluate whether the Pandas Mastery Workshop is the right solution to advance your data analysis skills in Python, including:

  • Who will benefit most from the course
  • A guided tour through the course topics
  • What skills you’ll take away from the course, how the instructional design supports that
  • What the experience is like, and why it is different from other training alternatives (with a sneak peek at actual course materials)
  • What previous workshop attendees say about the course

See the Webinar


michael_connell-enthought-vp-trainingPresenter: Dr. Michael Connell, VP, Enthought Training Solutions

Ed.D, Education, Harvard University
M.S., Electrical Engineering and Computer Science, MIT


Continue reading