GroupBy Mechanics¶

Ha Khanh Nguyen (hknguyen)

1. GroupBy Mechanics¶

• Hadley Wickham, an author of many popular packages for the R programming language, coined the term split-apply-combine for describing group operations.
• Split:
• In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide.
• The splitting is performed on a particular axis of an object.
• For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1).
• Apply:
• Once this is done, a function is applied to each group, producing a new value.
• Combine:
• Finally, the results of all those function applications are combined into a result object.

• Suppose you wanted to compute the mean of the data1 column using the labels from key1.
• One way to do this is to access data1 and call groupby() with the column (a Series) at key1:
• This grouped variable is now a GroupBy object.
• It has not actually computed anything yet except for some intermediate data about the group key df['key1'].
• The idea is that this object has all of the information needed to then apply some operation to each of the groups.
• For example, to compute group means we can call the GroupBy’s mean() method:
• If instead we had passed multiple arrays as a list, we’d get something different:
• Since we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys observed:

2. Iterating Over Groups¶

• The GroupBy object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data. Consider the following:
• In the case of multiple keys, the first element in the tuple will be a tuple of key values:

3. Selecting a Column or Subset of Columns¶

• Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of column subsetting for aggregation.
• Especially for large datasets, it may be desirable to aggregate only a few columns.
• For example, to compute means for just the data2 column and get the result as a DataFrame, we could write:
• The object returned by this indexing operation is a grouped DataFrame if a list or array is passed OR a grouped Series if only a single column name is passed as a scalar:

4. Grouping with Dicts and Series¶

• Grouping information may exist in a form other than an array. Let’s consider another example DataFrame:
• Now, suppose we have a group correspondence for the columns and want to sum together the columns by group:
• The same functionality holds for Series, which can be viewed as a fixed-size mapping:

5. Grouping with Functions¶

• Suppose you wanted to group by the length of the names; while you could compute an array of string lengths, it’s simpler to just pass the len() function:
• Mixing functions with arrays, dicts, or Series is not a problem as everything gets converted to arrays internally:

6. Grouping by Index Levels¶

• A convenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an axis index.
• To group by level, pass the level number or name using the level keyword:

This lecture notes reference materials from Chapter 10 of Wes McKinney's Python for Data Analysis 2nd Ed.