Data Transformation¶

Ha Khanh Nguyen (hknguyen)

1. Removing Duplicates¶

• Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:
• The DataFrame method duplicated() returns a Boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not:
• Relatedly, drop_duplicates() returns a DataFrame where the duplicated array is False:
• Both of these methods by default consider all of the columns; alternatively, you can specify any subset of them to detect duplicates.
• Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:
• duplicated() and drop_duplicates() by default keep the first observed value combination. Passing keep='last' will return the last one:

2. Transforming Data Using a Function or Mapping¶

• For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame.
• Consider the following hypothetical data collected about various kinds of meat:
• Suppose you wanted to add a column indicating the type of animal that each food came from.
• Let’s write down a mapping of each distinct meat type to the kind of animal:
• The map() method on a Series accepts a function or dict-like object containing a mapping.
• Notice the problem? That is because some of the food are capitalized, some are not!
• We need to convert each value to lowercase using the str.lower() method:
• Using map() is a convenient way to perform element-wise transformations and other data cleaning–related operations.

3. Replacing Values¶

• Filling in missing data with the fillna() method is a special case of more general value replacement.
• map() can be used to modify a subset of values in an object but replace() provides a simpler and more flexible way to do so.
• Let’s consider this Series:
• The -999 values might be sentinel values for missing data.
• To replace these with NA values that pandas understands, we can use replace(), producing a new Series (unless you pass inplace=True):
• If you want to replace multiple values at once, you instead pass a list and then the substitute value:
• To use a different replacement for each value, pass a list of substitutes:
• The argument passed can also be a dict:

4. Discretization and Binning¶

• Continuous data is often discretized or otherwise separated into “bins” for analysis.

4.1 Specified Bins¶

• Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets:
• Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do so, you have to use cut, a function in pandas:
• The object pandas returns is a special Categorical object.
• The output you see describes the bins computed by pandas.cut(). You can treat it like an array of strings indicating the bin name.
• Consistent with mathematical notation for intervals, a parenthesis means that the side is open, while the square bracket means it is closed (inclusive).
• You can change which side is closed by passing right=False or right=True (default):
• You can also pass your own bin names by passing a list or array to the labels option:

4.2 Number of bins¶

• If you pass an integer number of bins to cut instead of explicit bin edges, it will compute equal-length bins based on the minimum and maximum values in the data.
• Consider the case of some uniformly distributed data chopped into fourths:
• The precision=2 option limits the decimal precision to two digits.

4.3 Cut using quantiles¶

• A closely related function, qcut(), bins the data based on sample quantiles.
• Depending on the distribution of the data, using cut() will not usually result in each bin having the same number of data points.
• Since qcut() uses sample quantiles instead, by definition you will obtain roughly equal-size bins:
• Similar to cut you can pass your own quantiles (numbers between 0 and 1, inclusive):

5. Detecting and Filtering Outliers¶

• Filtering or transforming outliers is largely a matter of applying array operations.
• Consider a DataFrame with some normally distributed data:
• Suppose you wanted to find values in column 2 that exceed 3 in absolute value:
• To select all rows having a value exceeding 3 or –3, you can use the any method on a boolean DataFrame:
• Reminder: any(1) means any() applied on axis=1 which is across the columns (for each row).

This lecture notes reference materials from Chapter 7 of Wes McKinney's Python for Data Analysis 2nd Ed.