Who Should Watch: individuals, team leaders, and learning & development coordinators who are looking to better understand the options to increase professional capabilities in Python for data science and machine learning applications
Enthought’s Python for Data Science training course is designed to accelerate the development of skill and confidence in using Python’s core data science tools — including the standard Python language, the fast array programming package NumPy, and the Pandas data analysis package, as well as tools for database access (DBAPI2, SQLAlchemy), machine learning (scikit-learn), and visual exploration (Matplotlib, Seaborn).
In this webinar, we give you the key information and insight you need to evaluate whether Enthought’s Python for Data Science course is the right solution to advance your professional data science skills in Python, including:
Who will benefit most from the course
A guided tour through the course topics
What skills you’ll take away from the course, how the instructional design supports that
What the experience is like, and why it is different from other training alternatives (with a sneak peek at actual course materials)
What previous course attendees say about the course
Presenter: Dr. Michael Connell, VP, Enthought Training Solutions
Ed.D, Education, Harvard University
M.S., Electrical Engineering and Computer Science, MIT
Considering Moving to Python for Data Science?
Then Enthought’s Python for Data Science training course is definitely for you! This class has been particularly appealing to people who have been using other tools like R or SAS (or even Excel) for their data science work and want to start applying their analytic skills using the Python toolset. And it’s no wonder — Python has been identified as the most popular coding language for five years in a row for good reason.
One reason for Python’s broad popularity across a range of disciplines is its efficiency and ease-of-use. Many people consider Python more fun to work in than other languages (and we agree!). Another reason for its popularity among data analysts and data scientists in particular is Python’s extensive (and growing) open source library of powerful tools for preparing, visualizing, analyzing, and modeling data.
Python is also an extraordinarily comprehensive toolset – it supports everything from interactive analysis to automation to software engineering to web app development within a single language and plays very well with other languages like C/C++ or FORTRAN so you can continue leveraging your existing code libraries written in those other languages.
Many organizations are moving to Python so they can consolidate all of their technical work streams under a single comprehensive toolset. In the first part of this class we’ll give you the fundamentals you need to switch from another language to Python and then we cover the core tools that will enable you to do in Python what you were doing with other tools, only faster!
Upcoming Open Python for Data Science Sessions:
Austin, TX, June 12-16, 2017
San Jose, CA, July 17-21, 2017Learn MoreHave a group interested in training? We specialize in group and corporate training. Contact us or call 512.536.1057.
No dataset is perfect and most datasets that we have to deal with on a day-to-day basis have values missing, often represented by “NA” or “NaN”. One of the reasons why the Pandas library is as popular as it is in the data science community is because of its capabilities in handling data that contains NaN values.
But spending time looking up the relevant Pandas commands might be cumbersome when you are exploring raw data or prototyping your data analysis pipeline. This is one of the places where the Canopy Data Import Tool helps make data munging faster and easier, by simplifying the task of identifying missing values in your raw data and removing/replacing them.
Why are missing values a problem you ask? We can answer that question in the context of machine learning. scikit-learn and TensorFlow are popular and widely used libraries for machine learning in Python. Both of them caution the user about missing values in their datasets. Various machine learning algorithms expect all the input values to be numerical and to hold meaning. Both of the libraries suggest removing rows and/or columns that contain missing values.
If removing the missing values is not an option, given the size of your dataset, then they suggest replacing the missing values. The scikit-learn library provides an Imputer class, which can be used to replace missing values. See the sci-kit learn documentation for an example of how the Imputer class is used. Similarly, the decode_csv function in the TensorFlow library can be passed a record_defaults argument, which will replace missing values in the dataset. See the TensorFlow documentation for specifics.
The Data Import Tool provides capabilities to handle missing values in your dataset because we strongly believe that discovering and handling missing values in your dataset is a part of the data import and cleaning phase and not the analysis phase of the data science process.
Digging into the specifics, here we’ll compare how you can go about handling missing values with three typical scenarios, first using the Pandas library, then contrasting with the Data Import Tool:
Identifying missing values in data
Replacing missing values in data, and
Removing missing values from data.
Note : Pandas’ internal representation of your data is called a DataFrame. A DataFrame is simply a tabular data structure, similar to a spreadsheet or a SQL table.
Identifying Missing Values – The Hard Way: Using Pandas
If you are interested in identifying missing values in a row/column of a DataFrame, you need to understand the isnull, any, all methods on a DataFrame.
Taking a detour, we have so far described missing values as being represented by NA or NaN. Instead, what if missing values in a column are values that aren’t of the same type as the rest of the cells in the column, say for example a string in a column containing integers? Doing so in Pandas is not trivial.
Identifying Missing Values – The Easy Way: Using the Data Import Tool
Highlighting null values using the Data Import Tool
Instead of giving you the column names and index values of the cells containing missing values, the Data Import Tool shows them to you. Simply checking the `Highlight Missing Values` checkbox in the bottom-left corner of the Data Import Tool will paint the DataFrame to show you the cells that contain missing values. Further, the Data Import Tool understands that your data file might have errors, like having a string value in a column otherwise containing integers. The Data Import Tool highlights the cell and displays the underlying content too.
The Data Import Tool can highlight missing value cells, helping you easily identify columns or rows containing NaN values
Replacing Missing Values – The Hard Way: Using Pandas
While Pandas does a great job at handling column operations even if the columns contain NaN values, our data analysis workflow might need us to replace the missing values in our data.
After spending a little time browsing through the Pandas documentation, you will come across the `fillna` method on a DataFrame, which can be used to replace a missing values. The arguments you pass to the fillna method will determine what value the missing values in your DataFrame are replaced with and how the underlying column dtypes change after replacing the missing values.
Replacing Missing Values – The Easy Way: Using the Data Import Tool
With the Data Import Tool, you can replace missing values by right-clicking on the column containing missing values selecting the appropriate Fill Missing Values item. Opting to replace missing values in the column with a specific column will open an additional dialog, prompting you to enter the value.
Replace missing values in your DataFrame using the Canopy Data Import Tool
Removing Missing Values – The Hard Way: Using Pandas
While removing columns or rows containing missing values might be a little extreme, it might be necessary. Pandas suggests that you use the dropna method on the DataFrame to drop columns or rows that contain missing values. The arguments you pass to the dropna method will determine what rows/columns are removed from the DataFrame.
Removing Missing Values – The Easy Way: Using the Data Import Tool
With the Data Import Tool on the other hand, you can remove rows/columns containing missing values by selecting the “Delete Empty Columns” or “Delete Empty Rows” item from the “Transform” menu. An additional dialog will pop up asking you how lenient you want to be in removing rows/columns containing missing values – if you choose ‘any’, the Data Import Tool will remove rows/columns that contain any missing values; if you choose ‘all’, the Data Import Tool will only remove those rows/columns which contain only missing values.
Delete empty cells in rows/columns using the Canopy Data Import Tool
Choose to delete columns containing any null value or columns full of null values using the Canopy Data Import Tool
Finally, we have data that contains no missing values. So far, we’ve used the DIT to easily discover the missing values in our dataset and to remove/replace the missing values. Finally, by clicking on ‘Use DataFrame’, you can import the dataset as a pandas DataFrame into the IPython workspace of the Canopy Editor. If you’re a data scientist, your data is now void of missing values and can be converted to arrays or variables and passed on to scikit-learn, TensorFlow or any other Machine Learning library of your choice.