11. GroupBy

What is the most common cause of wildfires in California? We can count the number of fires due to a particular cause, such as lightning, using a query:

calfire[calfire.get('cause') == '1 - Lightning'].shape[0]
2550

This query retrieves a table containing only those fires caused by lightning, and .shape[0] returns the number of rows in this table, effectively counting the number of such fires. To determine the number of fires due to each cause, we could write a query for each different type of fire:

calfire[calfire.get('cause') == '1 - Lightning'].shape[0]
calfire[calfire.get('cause') == '2 - Equipment Use'].shape[0]
calfire[calfire.get('cause') == '3 - Smoking'].shape[0]
...
calfire[calfire.get('cause') == '14 - Unknown'].shape[0]

But there is a better way. As we’ll see in this section, the powerful .groupby method allows to quickly group a table’s rows and perform the same analysis on each group.

11.1. Split, Aggregate, Combine

Let’s count the number of fires due to each cause using .groupby:

calfire.groupby('cause').count()
year month name acres county longitude latitude
cause
1 - Lightning 2550 2550 2550 2550 2550 2550 2550
10 - Vehicle 399 399 399 399 399 399 399
11 - Powerline 359 359 359 359 359 359 359
12 - Firefighter Training 2 2 2 2 2 2 2
13 - Non-Firefighter Training 5 5 5 5 5 5 5
... ... ... ... ... ... ... ...
5 - Debris 610 610 610 610 610 610 610
6 - Railroad 70 70 70 70 70 70 70
7 - Arson 763 763 763 763 763 763 763
8 - Playing with fire 151 151 151 151 151 151 151
9 - Miscellaneous 1603 1603 1603 1603 1603 1603 1603

17 rows × 7 columns

This result tells us that there were 2550 fires caused by lightning, 399 caused by vehicles, and so forth. But why do all of the columns contain exactly the same information? And what is going on with the column labels? Let’s dissect this piece of code to understand what, exactly, is going on.

Three things happen when we run calfire.groupby('cause').count():

  1. The table’s rows are split into groups according to cause. All of the rows whose cause is '1 - Lightning' are placed into one group, all of the rows whose cause is '2 - Equipment Use' are placed into another group and so on.

  2. Each group is aggregated into a single row by counting the number of entries in each of the group’s columns.

  3. The resulting rows are combined to form a new table, with one row for every group.

Fig. 11.1 below demonstrates this process on a small example table.

../_images/split-agg-combine.png

Fig. 11.1 caption: The split-aggregate-combine pattern.

11.1.1. .groupby

The “split” part of the “split-aggregate-combine” process is performed by the .groupby method. It accepts a single argument: the column whose values should be used to form groups. When we call .groupby('cause'), babypandas looks through our table and separates the rows into groups by their value in the cause column – all of the rows whose cause is '1 - Lightning' are placed into one group, all of the rows whose cause is '2 - Equipment Use' are placed into another group and so on.

The result of .groupby is something called a DataFrameGroupby object:

calfire.groupby('cause')
<babypandas.bpd.DataFrameGroupBy at 0x7f9b600a4460>

A DataFrameGroupBy object doesn’t look all that useful by itself. We need to tell babypandas what action should be taken on each group. We do this by specifying an aggregator.

11.1.2. Aggregators

After we have grouped the table’s rows using .groupby, we must next apply an aggregator. An aggregator takes each group, aggregates it into a single row, and finally combines these rows to form a new table, with one row for each group with the group’s name as its label. There are several aggregators to choose from, namely:

  • .mean()

  • .median()

  • .max()

  • .min()

  • .count()

We have already seen how the .count() aggregator can be used to count the number of fires due to each cause. If we instead wanted to find the average size of each type of fire, we could use the .mean() aggregator:

calfire.groupby('cause').mean()
year month acres longitude latitude
cause
1 - Lightning 1993.815294 7.480784 2436.783344 -120.810116 39.036871
10 - Vehicle 2008.689223 7.343358 1456.904712 -120.205692 36.708733
11 - Powerline 2006.729805 7.200557 1441.556715 -120.465738 36.601976
12 - Firefighter Training 2002.500000 8.000000 1082.385803 -120.401002 38.392531
13 - Non-Firefighter Training 2001.800000 3.800000 305.873876 -119.927214 34.978246
... ... ... ... ... ...
5 - Debris 1977.645902 7.457377 992.116907 -121.751834 39.450188
6 - Railroad 1997.042857 7.300000 1267.831665 -120.741341 38.573846
7 - Arson 1995.422018 7.546527 2509.532333 -120.137300 37.265328
8 - Playing with fire 1995.052980 7.039735 753.464156 -119.490352 36.122455
9 - Miscellaneous 1988.960075 7.471616 2901.573769 -119.527044 36.746385

17 rows × 5 columns

The average size of each kind of fire is contained in the 'acres' column.

When an aggregator is applied to a group, it performs its operation on each column independently. For example, the .mean() aggregator computes the mean of the 'year' column, the mean of the 'month' column, and so forth. So what we see in the table above is the mean year of all fires caused by lightning, the mean month, mean size in acres, and so on.

Notice that some columns are apparently missing. For instance, the original table has columns called 'name' and 'county', but the table above does not have these columns. This is because they contain strings, and it doesn’t make sense to find the mean of a collection of strings. As such, babypandas has automatically dropped these column from the result. Likewise, these columns will be dropped if we use the .sum() aggregator.

What if we use the .max() aggregator? You might be surprised to find the 'name' and 'county' columns in the result:

calfire.groupby('cause').max()
year month name acres county longitude latitude
cause
1 - Lightning 2019 12 ZIEGLER-Iron Complex 315511.500000 Yuba -114.987792 41.990849
10 - Vehicle 2019 12 ZACA 229651.406250 Yuba -115.569205 41.874351
11 - Powerline 2019 12 YUBA 153335.562500 Yuba -116.395032 41.925561
12 - Firefighter Training 2004 8 OREGON 1954.705933 Fresno -119.301603 39.597129
13 - Non-Firefighter Training 2003 6 UNIT #5 805.634644 Monterey -118.426778 35.824536
... ... ... ... ... ... ... ...
5 - Debris 2019 12 ZIEBRIGHT 161815.656250 Yuba -115.393351 41.995821
6 - Railroad 2016 11 WOODHOUSE 56076.332031 Yuba -116.193409 41.979014
7 - Arson 2019 12 YOLO 160833.109375 Yuba -116.072059 41.992000
8 - Playing with fire 2019 11 YORBA LINDA 38347.472656 Yuba -116.308956 41.997899
9 - Miscellaneous 2019 12 ZERMATT 281790.875000 Yuba -114.478988 42.002868

17 rows × 7 columns

Here, babypandas adopts the view that the maximum of a collection of strings is the last string in the alphabetical ordering. So for instance, the ZACA fire was the fire whose name was alphabetically last among all fires caused by vehicles. Likewise, the .min() aggregator will produce the name that is alphabetically first within the group.

Warning

The aggregator is applied to each column independently. A common mistake is to believe that, after running calfire.groupby('cause').max(), the 'acres' column will contain the size of the largest fire in each group, and the 'name' column contains its name. This is not the case: the 'name' column contains the name that appears last in the alphabetical ordering of names within the group.

Let’s take another look at the count aggregator. Recall the result of our first groupby:

calfire.groupby('cause').count()
year month name acres county longitude latitude
cause
1 - Lightning 2550 2550 2550 2550 2550 2550 2550
10 - Vehicle 399 399 399 399 399 399 399
11 - Powerline 359 359 359 359 359 359 359
12 - Firefighter Training 2 2 2 2 2 2 2
13 - Non-Firefighter Training 5 5 5 5 5 5 5
... ... ... ... ... ... ... ...
5 - Debris 610 610 610 610 610 610 610
6 - Railroad 70 70 70 70 70 70 70
7 - Arson 763 763 763 763 763 763 763
8 - Playing with fire 151 151 151 151 151 151 151
9 - Miscellaneous 1603 1603 1603 1603 1603 1603 1603

17 rows × 7 columns

Notice that all of the columns contain exactly the same information. We now know why this is: when we apply .count(), it counts the number of entries within the 'year' column, the 'month' column, 'name' column, etc. There are the name number of entries within each column, so we get the same number for each. Also notice that the column names are no longer very meaningful. The 'year' column no longer contains years – it contains the count of entries in the 'year' column within each group.

If our goal is to count the number of fires due to each cause, we could use any column in the result – I usually just pick the one that is easiest to type:

calfire.groupby('cause').count().get('year')
cause
1 - Lightning                    2550
10 - Vehicle                      399
11 - Powerline                    359
12 - Firefighter Training           2
13 - Non-Firefighter Training       5
                                 ... 
5 - Debris                        610
6 - Railroad                       70
7 - Arson                         763
8 - Playing with fire             151
9 - Miscellaneous                1603
Name: year, Length: 17, dtype: int64

11.2. Examples

Which year had the largest number of fires?

We can count the number of fires from each year with .groupby() and .count(). But this time we want to group by the year column, so that fires from the same year are placed within the same group.

calfire.groupby('year').count()
month name cause acres county longitude latitude
year
1898 3 3 3 3 3 3 3
1902 1 1 1 1 1 1 1
1903 1 1 1 1 1 1 1
1908 2 2 2 2 2 2 2
1909 2 2 2 2 2 2 2
... ... ... ... ... ... ... ...
2015 286 286 286 286 286 286 286
2016 342 342 342 342 342 342 342
2017 591 591 591 591 591 591 591
2018 397 397 397 397 397 397 397
2019 295 295 295 295 295 295 295

115 rows × 7 columns

Notice that the index now contains the different years, since we have grouped by year. To find the year with the largest number of fires, we’ll do what we’ve done before: sort the table and return the last entry in the index:

calfire.groupby('year').count().sort_values(by='month').index[-1]
2017

If you’re confused why we are sorting by the 'month' column, remember: all of the columns contain the exact same information after .count()! So we could have sorted by the 'county' column to get the same result:

calfire.groupby('year').count().sort_values(by='county').index[-1]
2017

What is the median fire size in each county?

We’ll group by the 'county' column, and apply the .median() aggregator:

calfire.groupby('county').median()
year month acres longitude latitude
county
Alameda 1988.0 7.0 346.514191 -121.624947 37.707623
Alpine 2012.0 8.0 186.335419 -119.822902 38.443338
Amador 1992.5 7.0 240.124207 -120.666797 38.438166
Butte 1999.0 8.0 471.250946 -121.559119 39.634084
Calaveras 1991.0 8.0 228.224197 -120.597127 38.180825
... ... ... ... ... ...
Tulare 1996.0 7.0 167.007103 -118.773436 36.336689
Tuolumne 2001.0 8.0 86.326668 -120.047074 37.946170
Ventura 1996.0 8.0 122.332626 -119.021737 34.333966
Yolo 1988.0 7.0 677.440491 -122.097986 38.744614
Yuba 2008.0 7.5 74.831673 -121.341850 39.201832

57 rows × 5 columns

If we want just the fire size, we could .get the 'acres' column:

calfire.groupby('county').median().get('acres')
county
Alameda      346.514191
Alpine       186.335419
Amador       240.124207
Butte        471.250946
Calaveras    228.224197
                ...    
Tulare       167.007103
Tuolumne      86.326668
Ventura      122.332626
Yolo         677.440491
Yuba          74.831673
Name: acres, Length: 57, dtype: float64

What was the size of the largest fire in 1995?

We will group by year and apply the max aggregator:

calfire.groupby('year').max()
month name cause acres county longitude latitude
year
1898 9 MATILIJA 14 - Unknown 20539.949219 Ventura -119.265380 34.488614
1902 8 FEROUD 14 - Unknown 731.481567 Ventura -119.320979 34.417515
1903 10 SAN ANTONIO 14 - Unknown 380.260590 Ventura -119.253422 34.430616
1908 9 REEVES 14 - Unknown 7433.901855 Ventura -119.179959 41.000197
1909 9 SHOEMAKER 14 - Unknown 3174.514648 Ventura -117.685816 34.482616
... ... ... ... ... ... ... ...
2015 12 ZEBRA 9 - Miscellaneous 151546.812500 Yuba -115.837487 41.989971
2016 12 YUCCA 9 - Miscellaneous 132104.281250 Yuba -116.557240 41.971890
2017 12 ZERMATT 9 - Miscellaneous 281790.875000 Yuba -115.268126 42.002868
2018 12 YOSEMITE 9 - Miscellaneous 410202.468750 Yuba -116.359866 41.979459
2019 12 WILLOWS 3 9 - Miscellaneous 77762.140625 Yuba -115.339065 42.007182

115 rows × 7 columns

Now we want to retrieve the size (in acres) of the fire from 1995. We’ll do this with our familiar .get().loc[] pattern:

calfire.groupby('year').max().get('acres').loc[1995]
21444.125

Tip

When method chaining, it is useful to keep in mind what type of object you are working with at each step. Any time you write ., you should be able to stop and say whether the piece of code to the left evaluates to a table, a series, a number, etc. For instance, calfire.groupby('year').max() evaluates to a table, so whatever follows should be a table method.

11.3. .groupby versus querying

In the last example above, we found the size of the largest fire from 1995. We did this with .groupby() and the .max() aggregator. While this worked, you might have noticed that we can obtain the same result with a simple query:

calfire[calfire.get('year') == 1995].sort_values(by='acres').get('acres').iloc[-1]
21444.125

Which is better? And how do we choose which method to use?

In many cases like this one, both approaches work equally well. In general, however, queries are most useful when asking a question about a single group (or subset) of a table, and .groupby is most useful when your question is about each group. As a rule of thumb, if you question contains the word “each” – like in “What was the size of the largest fire in each county?” – you probably want to use .groupby.

Queries have many advantages over .groupby. For one, queries can be performed with complex, compound conditions, while .groupby’s is limited to forming groups using the values in a particular column. For instance, suppose we want to find the size of the largest fire between 1990 and 2000. We can do this with a query, but not with .groupby directly. Second, a query results in a full DataFrame object, while .groupby must be followed by one of a limited number of available aggregators.

Jupyter Tip

If you’d like to know how efficient a particular piece of code is, you can use the %%timeit “magic function”. This runs a cell over and over, printing the average time it takes to execute. Create a new cell with %%timeit and the code you’d like to time, like so:

%%timeit
calfire.groupby('year').max().get('acres').loc[1995]

This will print something like the following:

62.4 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This tells us that the code took 62.4 milliseconds to run on average. Not bad!

11.4. Subgroups

Which month of which year had the most wildfires in California history? To answer this question, we’d like to count the number of wildfires for each month of every year. So for instance, the number of fires in January 1991, February 1991, …, January 1992, February 1992, …, and so on.

This looks like a job for .groupby with the .count aggregator. But what do we group by? If we group by year, we’ll get the number of fires in each year. If we just group by month, we’ll get the number of fires in each month over all years, which isn’t quite what we want:

calfire.groupby('month').count()
year name cause acres county longitude latitude
month
1 119 119 119 119 119 119 119
2 85 85 85 85 85 85 85
3 131 131 131 131 131 131 131
4 274 274 274 274 274 274 274
5 832 832 832 832 832 832 832
... ... ... ... ... ... ... ...
8 2945 2945 2945 2945 2945 2945 2945
9 2179 2179 2179 2179 2179 2179 2179
10 966 966 966 966 966 966 966
11 424 424 424 424 424 424 424
12 169 169 169 169 169 169 169

12 rows × 7 columns

What we’d like to do is to group rows which are in the same year and in the same month. It turns out that .groupby supports this via subgrouping. We can specify two columns to group by using a list. First, “outer” groups will be formed using the first column, then “inner” groups will be formed within each outer group using the values in the second column. Here is how it looks in action:

calfire.groupby(['year', 'month']).count()
name cause acres county longitude latitude
year month
1898 4 1 1 1 1 1 1
9 2 2 2 2 2 2
1902 8 1 1 1 1 1 1
1903 10 1 1 1 1 1 1
1908 7 1 1 1 1 1 1
... ... ... ... ... ... ... ...
2019 8 41 41 41 41 41 41
9 68 68 68 68 68 68
10 41 41 41 41 41 41
11 15 15 15 15 15 15
12 1 1 1 1 1 1

931 rows × 6 columns

We see something new in the result: both the year and the month are being used as the row label. This called a hierarchical index (or multi-index). Hierarchical indices are quite powerful, but introduce extra complexity. Instead of using multi-indexes, we’ll use the .reset_index() method to move each level of the index to its own column:

by_month = calfire.groupby(['year', 'month']).count().reset_index()
by_month
year month name cause acres county longitude latitude
0 1898 4 1 1 1 1 1 1
1 1898 9 2 2 2 2 2 2
2 1902 8 1 1 1 1 1 1
3 1903 10 1 1 1 1 1 1
4 1908 7 1 1 1 1 1 1
... ... ... ... ... ... ... ... ...
926 2019 8 41 41 41 41 41 41
927 2019 9 68 68 68 68 68 68
928 2019 10 41 41 41 41 41 41
929 2019 11 15 15 15 15 15 15
930 2019 12 1 1 1 1 1 1

931 rows × 8 columns

Which month of which year had the most wildfires? We’ll solve this by sorting, as usual:

by_month_sorted = by_month.sort_values(by='name')
by_month_sorted.get('year').iloc[-1]
2008
by_month_sorted.get('month').iloc[-1]
6