# Data Aggregation¶

Ha Khanh Nguyen (hknguyen)

## 1. Data Aggregation¶

• Aggregations refer to any data transformation that produces scalar values from arrays.
• Here is an example of data aggregation:
• Many common aggregations, such as those found in Table 10-1, have optimized implementations.

• But we're not limited to these methods.
• You can use aggregations of your own devising and additionally call any method that is also defined on the grouped object.
• For example, the quantile() function is not explicitly implemented for GroupBy, it is a Series method and thus available to use if each group is a Series.
• Internally, GroupBy split the origin Series into smaller Series, then apply the quantile() function to each piece (smaller Series), then assembles those results together into the output object.
• To use your own aggregation functions, pass any function that aggregates an array to the aggregate() or agg() method:

## 2. Column-wise and Multiple Function Application¶

• As you’ve already seen, aggregating a Series or all of the columns of a DataFrame is a matter of using aggregate() with the desired function or calling a method like mean or std.
• But what's if we want to aggregate using a different function depending on the column, or multiple functions at once?
• In this example, we will look at the tips.csv dataset provided by the textbook author.
• As the name suggested, this data focuses on the amount of tip received based on different attributes such as whether the guest was a smoker, which day of the week, etc.
• Let's say we care about the tip percentage (not the tip amount)!
• We need to add a column called tip_pct which is the tip percentage (tip/total_bill).
• We're interested in the tip percentage for each day of the week and whether the guest is a smoker.
• Note that for descriptive statistics like those in Table 10-1 (see above), you can pass the name of the function as a string:
• If you pass a list of functions or function names instead, you get back a DataFrame with column names taken from the functions:
• Here we passed a list of aggregation functions to agg() to evaluate indepedently on the data groups.
• If you pass a list of (name, function) tuples to agg(), the first element of each tuple will be used as the DataFrame column names (you can think of a list of 2-tuples as an ordered mapping):
• With a DataFrame you have more options, as you can specify a list of functions to apply to all of the columns OR different functions per column.
• To start, suppose we wanted to compute the same three statistics for the tip_pct and total_bill columns:
• Now, suppose you wanted to apply potentially different functions to one or more of the columns.
• To do this, pass a dict to agg() that contains a mapping of column names to any of the function specifications listed so far:

## 3. Returning Aggregated Data Without Row Indexes¶

• In the default option, the aggregated data comes back with an index, potentially hierarchical, composed from the unique group key combinations.
• Since this isn’t always desirable, you can disable this behavior in most cases by passing as_index=False to groupby():
• Another method to get this result is to call the reset_index() function on the resulting DataFrame/Series.
• But setting as_index=False is a more efficient method.

## 4. Example: Filling Missing Values with Group-specific Values¶

• We have "manually" done this before.
• But now, we will use groupby() to help us simplify as well as generalize this process.
• Here, we will read in the raw Ramen Ratings dataset.
• Just like in HW 7, we will fill the NA values in Stars column with the average ratings of observations from the same Brand and Country values.

