# Data Cleaning¶

Ha Khanh Nguyen (hknguyen)

## 1. Handling Missing Data¶

• One of the goals of pandas is to make working with missing data as painless as possible.
• For example, all of the descriptive statistics on pandas objects exclude missing data by default.
• For numeric data, pandas uses the floating-point value NaN (Not a Number) to represent missing data.
• The isnull() function is applied to every value of the Series or DataFrame. It returns a Series or DataFrame of the same size but with Boolean values instead.
• The built-in Python None value is also treated as NA in object arrays:

## 2. Filtering Out Missing Data¶

• There are a few ways to filter out missing data. While you always have the option to do it by hand using pandas.isnull() and boolean indexing, the dropna() can be helpful.
• On a Series, it returns the Series with only the non-null data and index values:
• The above is equivalent to:
• With DataFrame objects, things are a bit more complex.
• You may want to drop rows or columns that are all NA or only those containing any NAs.
• dropna() by default drops any row containing a missing value:
• Passing how='all' will only drop rows that are all NA:
• To drop columns in the same way, pass axis=1:
• Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the thresh argument:

## 3. Filling In Missing Data¶

• Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways.
• For most purposes, the fillna() method is the workhorse function to use.
• Calling fillna() with a constant replaces missing values with that value:
• Calling fillna() with a dict, you can use a different fill value for each column:
• fillna() returns a new object, but you can modify the existing object with inplace=True.
• We can also filling in missing values using interpolation methods:
• With fillna() you can do lots of other things with a little creativity.
• For example, you might pass the mean or median value of a Series:

## 4. Exercise¶

Let's clean the Ramen Ratings dataset using the methods we just discussed! This time, we will be working with the "raw" version instead of the clean version that we have used previously.

This lecture notes reference materials from Chapter 7 of Wes McKinney's Python for Data Analysis 2nd Ed.