Ha Khanh Nguyen (hknguyen)
import pandas as pd
ramen = pd.read_csv('https://stat430.hknguyen.org/files/datasets/clean-ramen.csv')
ramen
Brand | Variety | Style | Country | Stars | |
---|---|---|---|---|---|
0 | New Touch | T's Restaurant Tantanmen | Cup | Japan | 3.75 |
1 | Just Way | Noodles Spicy Hot Sesame Spicy Hot Sesame Guan... | Pack | Taiwan | 1.00 |
2 | Nissin | Cup Noodles Chicken Vegetable | Cup | USA | 2.25 |
3 | Wei Lih | GGE Ramen Snack Tomato Flavor | Pack | Taiwan | 2.75 |
4 | Ching's Secret | Singapore Curry | Pack | India | 3.75 |
... | ... | ... | ... | ... | ... |
2570 | Vifon | Hu Tiu Nam Vang ["Phnom Penh" style] Asian Sty... | Bowl | Vietnam | 3.50 |
2571 | Wai Wai | Oriental Style Instant Noodles | Pack | Thailand | 1.00 |
2572 | Wai Wai | Tom Yum Shrimp | Pack | Thailand | 2.00 |
2573 | Wai Wai | Tom Yum Chili Flavor | Pack | Thailand | 2.00 |
2574 | Westbrae | Miso Ramen | Pack | USA | 0.50 |
2575 rows × 5 columns
# compute the average rating for each brand
ramen['Stars'].groupby(ramen['Brand']).mean()
Brand 1 To 3 Noodles 4.000000 7 Select 3.750000 7 Select/Nissin 3.500000 A-One 2.750000 A-Sha Dry Noodle 4.067308 ... Yum Yum 3.750000 Yum-Mie 3.500000 Zow Zow 3.750000 iMee 3.500000 iNoodle 3.000000 Name: Stars, Length: 355, dtype: float64
ramen.groupby('Brand')['Stars'].mean()
Brand 1 To 3 Noodles 4.000000 7 Select 3.750000 7 Select/Nissin 3.500000 A-One 2.750000 A-Sha Dry Noodle 4.067308 ... Yum Yum 3.750000 Yum-Mie 3.500000 Zow Zow 3.750000 iMee 3.500000 iNoodle 3.000000 Name: Stars, Length: 355, dtype: float64
quantile()
function is not explicitly implemented for GroupBy, it is a Series method and thus available to use if each group is a Series.quantile()
function to each piece (smaller Series), then assembles those results together into the output object.groups = ramen.groupby('Brand')
groups['Stars'].quantile(0.9)
Brand 1 To 3 Noodles 4.000 7 Select 3.950 7 Select/Nissin 3.500 A-One 3.425 A-Sha Dry Noodle 5.000 ... Yum Yum 4.900 Yum-Mie 3.500 Zow Zow 3.750 iMee 4.200 iNoodle 3.600 Name: Stars, Length: 355, dtype: float64
aggregate()
or agg()
method:def peak_to_peak(array):
return array.max() - array.min()
groups.agg(peak_to_peak)
Stars | |
---|---|
Brand | |
1 To 3 Noodles | 0.00 |
7 Select | 0.50 |
7 Select/Nissin | 0.00 |
A-One | 2.25 |
A-Sha Dry Noodle | 3.00 |
... | ... |
Yum Yum | 2.00 |
Yum-Mie | 0.00 |
Zow Zow | 0.00 |
iMee | 1.75 |
iNoodle | 1.50 |
355 rows × 1 columns
aggregate()
with the desired function or calling a method like mean
or std
.tips.csv
dataset provided by the textbook author.tips = pd.read_csv('https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/examples/tips.csv')
tips
total_bill | tip | smoker | day | time | size | |
---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | No | Sun | Dinner | 4 |
... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | No | Thur | Dinner | 2 |
244 rows × 6 columns
tip_pct
which is the tip percentage (tip
/total_bill
).tips['tip_pct'] = tips['tip']/tips['total_bill']
tips
total_bill | tip | smoker | day | time | size | tip_pct | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | No | Sun | Dinner | 2 | 0.059447 |
1 | 10.34 | 1.66 | No | Sun | Dinner | 3 | 0.160542 |
2 | 21.01 | 3.50 | No | Sun | Dinner | 3 | 0.166587 |
3 | 23.68 | 3.31 | No | Sun | Dinner | 2 | 0.139780 |
4 | 24.59 | 3.61 | No | Sun | Dinner | 4 | 0.146808 |
... | ... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | No | Sat | Dinner | 3 | 0.203927 |
240 | 27.18 | 2.00 | Yes | Sat | Dinner | 2 | 0.073584 |
241 | 22.67 | 2.00 | Yes | Sat | Dinner | 2 | 0.088222 |
242 | 17.82 | 1.75 | No | Sat | Dinner | 2 | 0.098204 |
243 | 18.78 | 3.00 | No | Thur | Dinner | 2 | 0.159744 |
244 rows × 7 columns
groups = tips.groupby(['day', 'smoker'])
groups['tip_pct'].mean()
day smoker Fri No 0.151650 Yes 0.174783 Sat No 0.158048 Yes 0.147906 Sun No 0.160113 Yes 0.187250 Thur No 0.160298 Yes 0.163863 Name: tip_pct, dtype: float64
groups['tip_pct'].agg('mean')
day smoker Fri No 0.151650 Yes 0.174783 Sat No 0.158048 Yes 0.147906 Sun No 0.160113 Yes 0.187250 Thur No 0.160298 Yes 0.163863 Name: tip_pct, dtype: float64
groups['tip_pct'].agg(['mean', 'std', peak_to_peak])
mean | std | peak_to_peak | ||
---|---|---|---|---|
day | smoker | |||
Fri | No | 0.151650 | 0.028123 | 0.067349 |
Yes | 0.174783 | 0.051293 | 0.159925 | |
Sat | No | 0.158048 | 0.039767 | 0.235193 |
Yes | 0.147906 | 0.061375 | 0.290095 | |
Sun | No | 0.160113 | 0.042347 | 0.193226 |
Yes | 0.187250 | 0.154134 | 0.644685 | |
Thur | No | 0.160298 | 0.038774 | 0.193350 |
Yes | 0.163863 | 0.039389 | 0.151240 |
agg()
to evaluate indepedently on the data groups.(name, function)
tuples to agg()
, the first element of each tuple will be used as the DataFrame column names (you can think of a list of 2-tuples as an ordered mapping):import numpy as np
groups['tip_pct'].agg([('foo', 'mean'), ('bar', np.std)])
foo | bar | ||
---|---|---|---|
day | smoker | ||
Fri | No | 0.151650 | 0.028123 |
Yes | 0.174783 | 0.051293 | |
Sat | No | 0.158048 | 0.039767 |
Yes | 0.147906 | 0.061375 | |
Sun | No | 0.160113 | 0.042347 |
Yes | 0.187250 | 0.154134 | |
Thur | No | 0.160298 | 0.038774 |
Yes | 0.163863 | 0.039389 |
tip_pct
and total_bill
columns:groups[['tip_pct', 'total_bill']].agg(['count', 'mean', 'max'])
tip_pct | total_bill | ||||||
---|---|---|---|---|---|---|---|
count | mean | max | count | mean | max | ||
day | smoker | ||||||
Fri | No | 4 | 0.151650 | 0.187735 | 4 | 18.420000 | 22.75 |
Yes | 15 | 0.174783 | 0.263480 | 15 | 16.813333 | 40.17 | |
Sat | No | 45 | 0.158048 | 0.291990 | 45 | 19.661778 | 48.33 |
Yes | 42 | 0.147906 | 0.325733 | 42 | 21.276667 | 50.81 | |
Sun | No | 57 | 0.160113 | 0.252672 | 57 | 20.506667 | 48.17 |
Yes | 19 | 0.187250 | 0.710345 | 19 | 24.120000 | 45.35 | |
Thur | No | 45 | 0.160298 | 0.266312 | 45 | 17.113111 | 41.19 |
Yes | 17 | 0.163863 | 0.241255 | 17 | 19.190588 | 43.11 |
dict
to agg()
that contains a mapping of column names to any of the function specifications listed so far:groups.agg({
'tip': np.max,
'size': 'sum'
})
tip | size | ||
---|---|---|---|
day | smoker | ||
Fri | No | 3.50 | 9 |
Yes | 4.73 | 31 | |
Sat | No | 9.00 | 115 |
Yes | 10.00 | 104 | |
Sun | No | 6.00 | 167 |
Yes | 6.50 | 49 | |
Thur | No | 6.70 | 112 |
Yes | 5.00 | 40 |
as_index=False
to groupby()
:tips.groupby(['day', 'smoker'], as_index=False).mean()
day | smoker | total_bill | tip | size | tip_pct | |
---|---|---|---|---|---|---|
0 | Fri | No | 18.420000 | 2.812500 | 2.250000 | 0.151650 |
1 | Fri | Yes | 16.813333 | 2.714000 | 2.066667 | 0.174783 |
2 | Sat | No | 19.661778 | 3.102889 | 2.555556 | 0.158048 |
3 | Sat | Yes | 21.276667 | 2.875476 | 2.476190 | 0.147906 |
4 | Sun | No | 20.506667 | 3.167895 | 2.929825 | 0.160113 |
5 | Sun | Yes | 24.120000 | 3.516842 | 2.578947 | 0.187250 |
6 | Thur | No | 17.113111 | 2.673778 | 2.488889 | 0.160298 |
7 | Thur | Yes | 19.190588 | 3.030000 | 2.352941 | 0.163863 |
reset_index()
function on the resulting DataFrame/Series.as_index=False
is a more efficient method.groupby()
to help us simplify as well as generalize this process.raw_ramen = pd.read_csv('https://stat430.hknguyen.org/files/datasets/ramen-ratings.csv', na_values=['Unrated'])
raw_ramen
Review # | Brand | Variety | Style | Country | Stars | Top Ten | |
---|---|---|---|---|---|---|---|
0 | 2580 | New Touch | T's Restaurant Tantanmen | Cup | Japan | 3.75 | NaN |
1 | 2579 | Just Way | Noodles Spicy Hot Sesame Spicy Hot Sesame Guan... | Pack | Taiwan | 1.00 | NaN |
2 | 2578 | Nissin | Cup Noodles Chicken Vegetable | Cup | USA | 2.25 | NaN |
3 | 2577 | Wei Lih | GGE Ramen Snack Tomato Flavor | Pack | Taiwan | 2.75 | NaN |
4 | 2576 | Ching's Secret | Singapore Curry | Pack | India | 3.75 | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
2575 | 5 | Vifon | Hu Tiu Nam Vang ["Phnom Penh" style] Asian Sty... | Bowl | Vietnam | 3.50 | NaN |
2576 | 4 | Wai Wai | Oriental Style Instant Noodles | Pack | Thailand | 1.00 | NaN |
2577 | 3 | Wai Wai | Tom Yum Shrimp | Pack | Thailand | 2.00 | NaN |
2578 | 2 | Wai Wai | Tom Yum Chili Flavor | Pack | Thailand | 2.00 | NaN |
2579 | 1 | Westbrae | Miso Ramen | Pack | USA | 0.50 | NaN |
2580 rows × 7 columns
raw_ramen.dtypes
Review # int64 Brand object Variety object Style object Country object Stars float64 Top Ten object dtype: object
raw_ramen[raw_ramen['Stars'].isnull()]
Review # | Brand | Variety | Style | Country | Stars | Top Ten | |
---|---|---|---|---|---|---|---|
32 | 2548 | Ottogi | Plain Instant Noodle No Soup Included | Pack | South Korea | NaN | NaN |
122 | 2458 | Samyang Foods | Sari Ramen | Pack | South Korea | NaN | NaN |
993 | 1587 | Mi E-Zee | Plain Noodles | Pack | Malaysia | NaN | NaN |
NA
values in Stars
column with the average ratings of observations from the same Brand
and Country
values.groups = raw_ramen.groupby(['Brand', 'Country'])
def avg_rating(group):
mean = group['Stars'].mean()
group['Stars'] = group['Stars'].fillna(mean)
return group
raw_ramen = groups.apply(avg_rating)
raw_ramen
Review # | Brand | Variety | Style | Country | Stars | Top Ten | |
---|---|---|---|---|---|---|---|
0 | 2580 | New Touch | T's Restaurant Tantanmen | Cup | Japan | 3.75 | NaN |
1 | 2579 | Just Way | Noodles Spicy Hot Sesame Spicy Hot Sesame Guan... | Pack | Taiwan | 1.00 | NaN |
2 | 2578 | Nissin | Cup Noodles Chicken Vegetable | Cup | USA | 2.25 | NaN |
3 | 2577 | Wei Lih | GGE Ramen Snack Tomato Flavor | Pack | Taiwan | 2.75 | NaN |
4 | 2576 | Ching's Secret | Singapore Curry | Pack | India | 3.75 | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
2575 | 5 | Vifon | Hu Tiu Nam Vang ["Phnom Penh" style] Asian Sty... | Bowl | Vietnam | 3.50 | NaN |
2576 | 4 | Wai Wai | Oriental Style Instant Noodles | Pack | Thailand | 1.00 | NaN |
2577 | 3 | Wai Wai | Tom Yum Shrimp | Pack | Thailand | 2.00 | NaN |
2578 | 2 | Wai Wai | Tom Yum Chili Flavor | Pack | Thailand | 2.00 | NaN |
2579 | 1 | Westbrae | Miso Ramen | Pack | USA | 0.50 | NaN |
2580 rows × 7 columns
raw_ramen.iloc[[32, 122, 993], :]
Review # | Brand | Variety | Style | Country | Stars | Top Ten | |
---|---|---|---|---|---|---|---|
32 | 2548 | Ottogi | Plain Instant Noodle No Soup Included | Pack | South Korea | 3.341463 | NaN |
122 | 2458 | Samyang Foods | Sari Ramen | Pack | South Korea | 4.102041 | NaN |
993 | 1587 | Mi E-Zee | Plain Noodles | Pack | Malaysia | 3.687500 | NaN |
This lecture notes reference materials from Chapter 10 of Wes McKinney's Python for Data Analysis 2nd Ed.