pandas: Summarizing and Computing Descriptive Statistics¶

Ha Khanh Nguyen (hknguyen)

1. Basic Statistics¶

• pandas objects are equipped with a set of common mathematical and statistical methods.
• Most of them are are in the category of reductions or summary statistics, methods that extract a single value (like sum or mean) from a Series or from rows or columns of a DataFrame.
• They are very similar to the Numpy methods we discussed previously, except that they have built-in handling for missing data.
• Calling DataFrame's sum() method returns a Series containing column sums:
• Passing axis='columns' or axis=1 sums across the columns instead:
• The NA values are being counted as 0 in this setting. This can be disabled with the skipna option:
• Now, it's NA values are behaving as we expected, that is any operation involved NA = NA.

2. Unique Values, Value Counts, and Membership¶

2.1 unique()¶

• We have seen the unique() function at work before.

2.2 value_counts()¶

• With categorical data, these functions are priceless in getting the summary statistics.
• value_counts() returns the number of times each value appears in the column.
• The output is a Series sorted by the value in descending order as a convenience.
• The above statement finds all the unique pairs of Brand and Country, then compute the number of ramen produced by a brand in a particular country.
• So, Nissin produces ramen in Japan, US, and Hong Kong (and a lot more actually).
• In some cases, you may want to compute a "histogram" on multiple related columns in a DataFrame. Here is an example:

2.3 isin()¶

• isin() performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series or column in a DataFrame.

This lecture notes reference materials from Chapter 5 of Wes McKinney's Python for Data Analysis 2nd Ed.