Applying Functions
Contents
12. Applying Functions¶
Earlier, in Chapter 1, we saw that it is possible to create our own custom functions in Python. Such functions are very useful for repeatedly performing the same series of actions on different inputs. We have seen how to write functions that accept numbers and strings, but you’ll be glad to know that they can accept any type of data, including DataFrames!
For example, suppose we frequently want to retrieve only those rows of a table whose entries lie between some thresholds. We
might want only those fires in calfire
from between 1995 and 2000, for instance. We can do so with a query:
calfire[(calfire.get("year") >= 1995) & (calfire.get("year") < 2000)]
year | month | name | cause | acres | county | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
6374 | 1995 | 11 | SEMINOLE | 2 - Equipment Use | 645.149780 | Riverside | -116.772176 | 33.890971 |
6375 | 1995 | 10 | SHINN | 7 - Arson | 13.694411 | Los Angeles | -117.679804 | 34.181470 |
6376 | 1995 | 8 | ECHO | 14 - Unknown | 372.055573 | Riverside | -117.209657 | 33.868816 |
6377 | 1995 | 10 | FREEWAY FIRE NO II | 14 - Unknown | 1233.456909 | Los Angeles | -118.372160 | 34.446973 |
6378 | 1995 | 12 | TOWSLEY FIRE | 14 - Unknown | 818.184509 | Los Angeles | -118.559865 | 34.345437 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
7333 | 1999 | 7 | VINTAGE | 2 - Equipment Use | 34.438271 | Kern | -119.521235 | 35.207022 |
7334 | 1999 | 8 | 41 | 10 - Vehicle | 167.177612 | Kern | -120.176735 | 35.768621 |
7335 | 1999 | 8 | WASHBURN | 1 - Lightning | 272.034943 | San Luis Obispo | -119.806037 | 35.132310 |
7336 | 1999 | 8 | WOODLAND | 1 - Lightning | 995.418335 | Butte | -121.699525 | 39.859890 |
7337 | 1999 | 8 | BLOOMER | 1 - Lightning | 2609.674805 | Butte | -121.476863 | 39.644791 |
964 rows × 8 columns
By writing this query into a function accepting a table, a column, and the thresholds, we make it easy to repeat. Such a function definition may look like:
def between(table, column, start, stop):
return table[(table.get(column) >= start) & (table.get(column) < stop)]
Then we can call our function to get only those fires from between 1995 and 2000:
between(calfire, 'year', 1995, 2000)
year | month | name | cause | acres | county | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
6374 | 1995 | 11 | SEMINOLE | 2 - Equipment Use | 645.149780 | Riverside | -116.772176 | 33.890971 |
6375 | 1995 | 10 | SHINN | 7 - Arson | 13.694411 | Los Angeles | -117.679804 | 34.181470 |
6376 | 1995 | 8 | ECHO | 14 - Unknown | 372.055573 | Riverside | -117.209657 | 33.868816 |
6377 | 1995 | 10 | FREEWAY FIRE NO II | 14 - Unknown | 1233.456909 | Los Angeles | -118.372160 | 34.446973 |
6378 | 1995 | 12 | TOWSLEY FIRE | 14 - Unknown | 818.184509 | Los Angeles | -118.559865 | 34.345437 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
7333 | 1999 | 7 | VINTAGE | 2 - Equipment Use | 34.438271 | Kern | -119.521235 | 35.207022 |
7334 | 1999 | 8 | 41 | 10 - Vehicle | 167.177612 | Kern | -120.176735 | 35.768621 |
7335 | 1999 | 8 | WASHBURN | 1 - Lightning | 272.034943 | San Luis Obispo | -119.806037 | 35.132310 |
7336 | 1999 | 8 | WOODLAND | 1 - Lightning | 995.418335 | Butte | -121.699525 | 39.859890 |
7337 | 1999 | 8 | BLOOMER | 1 - Lightning | 2609.674805 | Butte | -121.476863 | 39.644791 |
964 rows × 8 columns
Because this function accepts the column name, it is very reusable. We can use it to get the fires whose size is between 10,000 and 20,000 acres:
between(calfire, 'acres', 10_000, 20_000)
year | month | name | cause | acres | county | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
16 | 1910 | 7 | COYOTE CREEK | 14 - Unknown | 11226.824219 | Ventura | -119.386960 | 34.421249 |
251 | 1924 | 8 | UPPER DESOLATION VAL | 14 - Unknown | 10973.407227 | El Dorado | -120.506799 | 38.571552 |
317 | 1926 | 7 | FORT BIDWELL | 9 - Miscellaneous | 13100.943359 | Modoc | -120.082147 | 41.934478 |
322 | 1927 | 8 | LIEBRE | 14 - Unknown | 17957.339844 | Los Angeles | -118.611143 | 34.719225 |
739 | 1938 | 8 | RED CAP | 14 - Unknown | 14867.953125 | Humboldt | -123.470725 | 41.174271 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
12949 | 2018 | 7 | CRANSTON | 7 - Arson | 13229.158203 | Riverside | -116.696822 | 33.715582 |
12954 | 2018 | 6 | LIONS | 1 - Lightning | 13462.742188 | Madera | -119.166248 | 37.577131 |
13290 | 2019 | 9 | TABOOSE | 1 - Lightning | 10267.631836 | Inyo | -118.348358 | 37.021613 |
13305 | 2019 | 7 | TUCKER | 9 - Miscellaneous | 14184.661133 | Modoc | -121.241082 | 41.803004 |
13365 | 2019 | 10 | MARIA | 14 - Unknown | 10042.458984 | Ventura | -119.056671 | 34.314244 |
235 rows × 8 columns
Since the <=
and >
operators work on strings, too, we can get all of the fires whose name is between A and E:
between(calfire, 'name', 'A', 'E')
year | month | name | cause | acres | county | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
2 | 1898 | 9 | COZY DELL | 14 - Unknown | 2974.585205 | Ventura | -119.265380 | 34.482316 |
9 | 1910 | 8 | CRAWFORD CREEK 2 | 7 - Arson | 497.885071 | Humboldt | -123.552471 | 41.300052 |
10 | 1910 | 7 | BLUFF CREEK | 4 - Campfire | 298.716553 | Del Norte | -123.760361 | 41.430391 |
16 | 1910 | 7 | COYOTE CREEK | 14 - Unknown | 11226.824219 | Ventura | -119.386960 | 34.421249 |
17 | 1910 | 8 | BULL CREEK | 4 - Campfire | 56.897217 | Humboldt | -123.621988 | 41.174766 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
13446 | 2019 | 9 | ANTELOPE | 9 - Miscellaneous | 167.332794 | San Benito | -120.831821 | 36.558925 |
13451 | 2019 | 9 | COW | 10 - Vehicle | 15.383965 | Shasta | -122.038452 | 40.612144 |
13456 | 2019 | 9 | DEER | 5 - Debris | 9.367375 | Santa Cruz | -122.085541 | 37.183180 |
13457 | 2019 | 10 | CABRILLO | 2 - Equipment Use | 61.750446 | San Mateo | -122.358885 | 37.171839 |
13460 | 2019 | 10 | CROSS | 14 - Unknown | 289.151428 | Monterey | -120.726245 | 35.793698 |
3620 rows × 8 columns
12.1. The .apply
Series Method¶
DataFrames come equipped with many useful methods, but defining our own functions allows us to make tables even more powerful. One way to use tables with functions is to pass the table into the function as one of its inputs, as we saw in the example above. In some situations, however, we don’t want to apply the function to the entire table, but rather to each entry in one of the table’s columns. In these cases, we can use the .apply
method.
For instance, suppose we have a table containing a 'year'
column, such as the calfire
table we have been using, and we want to convert each year into the corresponding decade. We have already written a function that converts a single year to a decade: decade_from_year
. Recall how it works:
decade_from_year(1987)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In [7], line 1
----> 1 decade_from_year(1987)
NameError: name 'decade_from_year' is not defined
We’d like to apply this function to each entry in the 'year'
column. To do so, we’ll use .apply
:
calfire.get('year').apply(decade_from_year)
Notice the pattern here: we .get('year')
to retrieve column we wish to work with, and then .apply(decade_from_year)
to the column. The result is a Series with the same number of entries as the column containing the years. Each entry is the result of applying the function to the corresponding entry of the original column.
Warning
Note that we pass the function into .apply
without trailing parentheses. That is, we write .apply(decade_from_year)
and not .apply(decade_from_year())
or .apply(decade_from_year(calfire.get('year')))
. The .apply
method accepts the name of a function. It will then call the function many times on the given Series.
In many cases we’d like to add this new Series back to the table as a new column. We can do so with .assign
:
with_decade = calfire.assign(
decade=calfire.get('year').apply(decade_from_year)
)
with_decade
The .apply
method is very useful for data cleaning. Data rarely comes to us in the exact form we need or prefer. For instance, we might wish to convert a year to its decade, or remove the leading number code from a fire’s cause. A common approach to doing so is to write a function capable of converting or cleaning a single entry, then .apply
ing this function to the entire column.
Example: clean the cause
column
The cause
column contains the cause of each fire as string, such as '14 - Unknown'
. The string contains a number encoding unique to the cause of the fire, but this is redundant since the cause is described immediately after. Let’s get rid of the number, leaving only the description.
First, we’ll write a function that accepts a cause and returns only the description:
def cause_description(cause):
return cause.split('-')[-1].strip()
cause_description('2 - Equipment Use')
Now we .apply
the function to the 'cause'
column. We’ll save it back to the table using .assign
:
calfire.assign(
cause=calfire.get('cause').apply(cause_description)
)