# Data Aggregation¶

Ha Khanh Nguyen (hknguyen)

## 1. Data Aggregation¶

• Aggregations refer to any data transformation that produces scalar values from arrays.
• Here is an example of data aggregation:
• Many common aggregations, such as those found in Table 10-1, have optimized implementations.

• But we're not limited to these methods.
• You can use aggregations of your own devising and additionally call any method that is also defined on the grouped object.
• For example, the quantile() function is not explicitly implemented for GroupBy, it is a Series method and thus available to use if each group is a Series.
• Internally, GroupBy split the origin Series into smaller Series, then apply the quantile() function to each piece (smaller Series), then assembles those results together into the output object.
• To use your own aggregation functions, pass any function that aggregates an array to the aggregate() or agg() method:

## 2. Column-wise and Multiple Function Application¶

• As you’ve already seen, aggregating a Series or all of the columns of a DataFrame is a matter of using aggregate() with the desired function or calling a method like mean or std.
• But what's if we want to aggregate using a different function depending on the column, or multiple functions at once?
• In this example, we will look at the tips.csv dataset provided by the textbook author.
• As the name suggested, this data focuses on the amount of tip received based on different attributes such as whether the guest was a smoker, which day of the week, etc.
• Let's say we care about the tip percentage (not the tip amount)!
• We need to add a column called tip_pct which is the tip percentage (tip/total_bill).
• We're interested in the tip percentage for each day of the week and whether the guest is a smoker.
• Note that for descriptive statistics like those in Table 10-1 (see above), you can pass the name of the function as a string:
• If you pass a list of functions or function names instead, you get back a DataFrame with column names taken from the functions:
• Here we passed a list of aggregation functions to agg() to evaluate indepedently on the data groups.
• If you pass a list of (name, function) tuples to agg(), the first element of each tuple will be used as the DataFrame column names (you can think of a list of 2-tuples as an ordered mapping):
• With a DataFrame you have more options, as you can specify a list of functions to apply to all of the columns OR different functions per column.
• To start, suppose we wanted to compute the same three statistics for the tip_pct and total_bill columns:
• Now, suppose you wanted to apply potentially different functions to one or more of the columns.
• To do this, pass a dict to agg() that contains a mapping of column names to any of the function specifications listed so far:

## 3. Returning Aggregated Data Without Row Indexes¶

• In the default option, the aggregated data comes back with an index, potentially hierarchical, composed from the unique group key combinations.
• Since this isn’t always desirable, you can disable this behavior in most cases by passing as_index=False to groupby():
• Another method to get this result is to call the reset_index() function on the resulting DataFrame/Series.
• But setting as_index=False is a more efficient method.

## 4. Example: Filling Missing Values with Group-specific Values¶

• We have "manually" done this before.
• But now, we will use groupby() to help us simplify as well as generalize this process.
• Here, we will read in the raw Ramen Ratings dataset.
• Just like in HW 7, we will fill the NA values in Stars column with the average ratings of observations from the same Brand and Country values.

This lecture notes reference materials from Chapter 10 of Wes McKinney's Python for Data Analysis 2nd Ed.