Applying Functions

12. Applying Functions

Earlier, in Chapter 1, we saw that it is possible to create our own custom functions in Python. Such functions are very useful for repeatedly performing the same series of actions on different inputs. We have seen how to write functions that accept numbers and strings, but you’ll be glad to know that they can accept any type of data, including DataFrames!

For example, suppose we frequently want to retrieve only those rows of a table whose entries lie between some thresholds. We might want only those fires in calfire from between 1995 and 2000, for instance. We can do so with a query:

calfire[(calfire.get("year") >= 1995) & (calfire.get("year") < 2000)]
year month name cause acres county longitude latitude
6374 1995 11 SEMINOLE 2 - Equipment Use 645.149780 Riverside -116.772176 33.890971
6375 1995 10 SHINN 7 - Arson 13.694411 Los Angeles -117.679804 34.181470
6376 1995 8 ECHO 14 - Unknown 372.055573 Riverside -117.209657 33.868816
6377 1995 10 FREEWAY FIRE NO II 14 - Unknown 1233.456909 Los Angeles -118.372160 34.446973
6378 1995 12 TOWSLEY FIRE 14 - Unknown 818.184509 Los Angeles -118.559865 34.345437
... ... ... ... ... ... ... ... ...
7333 1999 7 VINTAGE 2 - Equipment Use 34.438271 Kern -119.521235 35.207022
7334 1999 8 41 10 - Vehicle 167.177612 Kern -120.176735 35.768621
7335 1999 8 WASHBURN 1 - Lightning 272.034943 San Luis Obispo -119.806037 35.132310
7336 1999 8 WOODLAND 1 - Lightning 995.418335 Butte -121.699525 39.859890
7337 1999 8 BLOOMER 1 - Lightning 2609.674805 Butte -121.476863 39.644791

964 rows × 8 columns

By writing this query into a function accepting a table, a column, and the thresholds, we make it easy to repeat. Such a function definition may look like:

def between(table, column, start, stop):
    return table[(table.get(column) >= start) & (table.get(column) < stop)]

Then we can call our function to get only those fires from between 1995 and 2000:

between(calfire, 'year', 1995, 2000)
year month name cause acres county longitude latitude
6374 1995 11 SEMINOLE 2 - Equipment Use 645.149780 Riverside -116.772176 33.890971
6375 1995 10 SHINN 7 - Arson 13.694411 Los Angeles -117.679804 34.181470
6376 1995 8 ECHO 14 - Unknown 372.055573 Riverside -117.209657 33.868816
6377 1995 10 FREEWAY FIRE NO II 14 - Unknown 1233.456909 Los Angeles -118.372160 34.446973
6378 1995 12 TOWSLEY FIRE 14 - Unknown 818.184509 Los Angeles -118.559865 34.345437
... ... ... ... ... ... ... ... ...
7333 1999 7 VINTAGE 2 - Equipment Use 34.438271 Kern -119.521235 35.207022
7334 1999 8 41 10 - Vehicle 167.177612 Kern -120.176735 35.768621
7335 1999 8 WASHBURN 1 - Lightning 272.034943 San Luis Obispo -119.806037 35.132310
7336 1999 8 WOODLAND 1 - Lightning 995.418335 Butte -121.699525 39.859890
7337 1999 8 BLOOMER 1 - Lightning 2609.674805 Butte -121.476863 39.644791

964 rows × 8 columns

Because this function accepts the column name, it is very reusable. We can use it to get the fires whose size is between 10,000 and 20,000 acres:

between(calfire, 'acres', 10_000, 20_000)
year month name cause acres county longitude latitude
16 1910 7 COYOTE CREEK 14 - Unknown 11226.824219 Ventura -119.386960 34.421249
251 1924 8 UPPER DESOLATION VAL 14 - Unknown 10973.407227 El Dorado -120.506799 38.571552
317 1926 7 FORT BIDWELL 9 - Miscellaneous 13100.943359 Modoc -120.082147 41.934478
322 1927 8 LIEBRE 14 - Unknown 17957.339844 Los Angeles -118.611143 34.719225
739 1938 8 RED CAP 14 - Unknown 14867.953125 Humboldt -123.470725 41.174271
... ... ... ... ... ... ... ... ...
12949 2018 7 CRANSTON 7 - Arson 13229.158203 Riverside -116.696822 33.715582
12954 2018 6 LIONS 1 - Lightning 13462.742188 Madera -119.166248 37.577131
13290 2019 9 TABOOSE 1 - Lightning 10267.631836 Inyo -118.348358 37.021613
13305 2019 7 TUCKER 9 - Miscellaneous 14184.661133 Modoc -121.241082 41.803004
13365 2019 10 MARIA 14 - Unknown 10042.458984 Ventura -119.056671 34.314244

235 rows × 8 columns

Since the <= and > operators work on strings, too, we can get all of the fires whose name is between A and E:

between(calfire, 'name', 'A', 'E')
year month name cause acres county longitude latitude
2 1898 9 COZY DELL 14 - Unknown 2974.585205 Ventura -119.265380 34.482316
9 1910 8 CRAWFORD CREEK 2 7 - Arson 497.885071 Humboldt -123.552471 41.300052
10 1910 7 BLUFF CREEK 4 - Campfire 298.716553 Del Norte -123.760361 41.430391
16 1910 7 COYOTE CREEK 14 - Unknown 11226.824219 Ventura -119.386960 34.421249
17 1910 8 BULL CREEK 4 - Campfire 56.897217 Humboldt -123.621988 41.174766
... ... ... ... ... ... ... ... ...
13446 2019 9 ANTELOPE 9 - Miscellaneous 167.332794 San Benito -120.831821 36.558925
13451 2019 9 COW 10 - Vehicle 15.383965 Shasta -122.038452 40.612144
13456 2019 9 DEER 5 - Debris 9.367375 Santa Cruz -122.085541 37.183180
13457 2019 10 CABRILLO 2 - Equipment Use 61.750446 San Mateo -122.358885 37.171839
13460 2019 10 CROSS 14 - Unknown 289.151428 Monterey -120.726245 35.793698

3620 rows × 8 columns

12.1. The .apply Series Method

DataFrames come equipped with many useful methods, but defining our own functions allows us to make tables even more powerful. One way to use tables with functions is to pass the table into the function as one of its inputs, as we saw in the example above. In some situations, however, we don’t want to apply the function to the entire table, but rather to each entry in one of the table’s columns. In these cases, we can use the .apply method.

For instance, suppose we have a table containing a 'year' column, such as the calfire table we have been using, and we want to convert each year into the corresponding decade. We have already written a function that converts a single year to a decade: decade_from_year. Recall how it works:

decade_from_year(1987)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [7], line 1
----> 1 decade_from_year(1987)

NameError: name 'decade_from_year' is not defined

We’d like to apply this function to each entry in the 'year' column. To do so, we’ll use .apply:

calfire.get('year').apply(decade_from_year)

Notice the pattern here: we .get('year') to retrieve column we wish to work with, and then .apply(decade_from_year) to the column. The result is a Series with the same number of entries as the column containing the years. Each entry is the result of applying the function to the corresponding entry of the original column.

Warning

Note that we pass the function into .apply without trailing parentheses. That is, we write .apply(decade_from_year) and not .apply(decade_from_year()) or .apply(decade_from_year(calfire.get('year'))). The .apply method accepts the name of a function. It will then call the function many times on the given Series.

In many cases we’d like to add this new Series back to the table as a new column. We can do so with .assign:

with_decade = calfire.assign(
    decade=calfire.get('year').apply(decade_from_year)
)
with_decade

The .apply method is very useful for data cleaning. Data rarely comes to us in the exact form we need or prefer. For instance, we might wish to convert a year to its decade, or remove the leading number code from a fire’s cause. A common approach to doing so is to write a function capable of converting or cleaning a single entry, then .applying this function to the entire column.

Example: clean the cause column

The cause column contains the cause of each fire as string, such as '14 - Unknown'. The string contains a number encoding unique to the cause of the fire, but this is redundant since the cause is described immediately after. Let’s get rid of the number, leaving only the description.

First, we’ll write a function that accepts a cause and returns only the description:

def cause_description(cause):
    return cause.split('-')[-1].strip()
cause_description('2 - Equipment Use')

Now we .apply the function to the 'cause' column. We’ll save it back to the table using .assign:

calfire.assign(
    cause=calfire.get('cause').apply(cause_description)
)