12. Applying Functions¶

Earlier, in Chapter 1, we saw that it is possible to create our own custom functions in Python. Such functions are very useful for repeatedly performing the same series of actions on different inputs. We have seen how to write functions that accept numbers and strings, but you’ll be glad to know that they can accept any type of data, including DataFrames!

For example, suppose we frequently want to retrieve only those rows of a table whose entries lie between some thresholds. We might want only those fires in calfire from between 1995 and 2000, for instance. We can do so with a query:

calfire[(calfire.get("year") >= 1995) & (calfire.get("year") < 2000)]

	year	month	name	cause	acres	county	longitude	latitude
6374	1995	11	SEMINOLE	2 - Equipment Use	645.149780	Riverside	-116.772176	33.890971
6375	1995	10	SHINN	7 - Arson	13.694411	Los Angeles	-117.679804	34.181470
6376	1995	8	ECHO	14 - Unknown	372.055573	Riverside	-117.209657	33.868816
6377	1995	10	FREEWAY FIRE NO II	14 - Unknown	1233.456909	Los Angeles	-118.372160	34.446973
6378	1995	12	TOWSLEY FIRE	14 - Unknown	818.184509	Los Angeles	-118.559865	34.345437
...	...	...	...	...	...	...	...	...
7333	1999	7	VINTAGE	2 - Equipment Use	34.438271	Kern	-119.521235	35.207022
7334	1999	8	41	10 - Vehicle	167.177612	Kern	-120.176735	35.768621
7335	1999	8	WASHBURN	1 - Lightning	272.034943	San Luis Obispo	-119.806037	35.132310
7336	1999	8	WOODLAND	1 - Lightning	995.418335	Butte	-121.699525	39.859890
7337	1999	8	BLOOMER	1 - Lightning	2609.674805	Butte	-121.476863	39.644791

964 rows × 8 columns

By writing this query into a function accepting a table, a column, and the thresholds, we make it easy to repeat. Such a function definition may look like:

def between(table, column, start, stop):
    return table[(table.get(column) >= start) & (table.get(column) < stop)]

Then we can call our function to get only those fires from between 1995 and 2000:

between(calfire, 'year', 1995, 2000)

	year	month	name	cause	acres	county	longitude	latitude
6374	1995	11	SEMINOLE	2 - Equipment Use	645.149780	Riverside	-116.772176	33.890971
6375	1995	10	SHINN	7 - Arson	13.694411	Los Angeles	-117.679804	34.181470
6376	1995	8	ECHO	14 - Unknown	372.055573	Riverside	-117.209657	33.868816
6377	1995	10	FREEWAY FIRE NO II	14 - Unknown	1233.456909	Los Angeles	-118.372160	34.446973
6378	1995	12	TOWSLEY FIRE	14 - Unknown	818.184509	Los Angeles	-118.559865	34.345437
...	...	...	...	...	...	...	...	...
7333	1999	7	VINTAGE	2 - Equipment Use	34.438271	Kern	-119.521235	35.207022
7334	1999	8	41	10 - Vehicle	167.177612	Kern	-120.176735	35.768621
7335	1999	8	WASHBURN	1 - Lightning	272.034943	San Luis Obispo	-119.806037	35.132310
7336	1999	8	WOODLAND	1 - Lightning	995.418335	Butte	-121.699525	39.859890
7337	1999	8	BLOOMER	1 - Lightning	2609.674805	Butte	-121.476863	39.644791

964 rows × 8 columns

Because this function accepts the column name, it is very reusable. We can use it to get the fires whose size is between 10,000 and 20,000 acres:

between(calfire, 'acres', 10_000, 20_000)

	year	month	name	cause	acres	county	longitude	latitude
16	1910	7	COYOTE CREEK	14 - Unknown	11226.824219	Ventura	-119.386960	34.421249
251	1924	8	UPPER DESOLATION VAL	14 - Unknown	10973.407227	El Dorado	-120.506799	38.571552
317	1926	7	FORT BIDWELL	9 - Miscellaneous	13100.943359	Modoc	-120.082147	41.934478
322	1927	8	LIEBRE	14 - Unknown	17957.339844	Los Angeles	-118.611143	34.719225
739	1938	8	RED CAP	14 - Unknown	14867.953125	Humboldt	-123.470725	41.174271
...	...	...	...	...	...	...	...	...
12949	2018	7	CRANSTON	7 - Arson	13229.158203	Riverside	-116.696822	33.715582
12954	2018	6	LIONS	1 - Lightning	13462.742188	Madera	-119.166248	37.577131
13290	2019	9	TABOOSE	1 - Lightning	10267.631836	Inyo	-118.348358	37.021613
13305	2019	7	TUCKER	9 - Miscellaneous	14184.661133	Modoc	-121.241082	41.803004
13365	2019	10	MARIA	14 - Unknown	10042.458984	Ventura	-119.056671	34.314244

235 rows × 8 columns

Since the <= and > operators work on strings, too, we can get all of the fires whose name is between A and E:

between(calfire, 'name', 'A', 'E')

	year	month	name	cause	acres	county	longitude	latitude
2	1898	9	COZY DELL	14 - Unknown	2974.585205	Ventura	-119.265380	34.482316
9	1910	8	CRAWFORD CREEK 2	7 - Arson	497.885071	Humboldt	-123.552471	41.300052
10	1910	7	BLUFF CREEK	4 - Campfire	298.716553	Del Norte	-123.760361	41.430391
16	1910	7	COYOTE CREEK	14 - Unknown	11226.824219	Ventura	-119.386960	34.421249
17	1910	8	BULL CREEK	4 - Campfire	56.897217	Humboldt	-123.621988	41.174766
...	...	...	...	...	...	...	...	...
13446	2019	9	ANTELOPE	9 - Miscellaneous	167.332794	San Benito	-120.831821	36.558925
13451	2019	9	COW	10 - Vehicle	15.383965	Shasta	-122.038452	40.612144
13456	2019	9	DEER	5 - Debris	9.367375	Santa Cruz	-122.085541	37.183180
13457	2019	10	CABRILLO	2 - Equipment Use	61.750446	San Mateo	-122.358885	37.171839
13460	2019	10	CROSS	14 - Unknown	289.151428	Monterey	-120.726245	35.793698

3620 rows × 8 columns

12.1. The `.apply` Series Method¶

DataFrames come equipped with many useful methods, but defining our own functions allows us to make tables even more powerful. One way to use tables with functions is to pass the table into the function as one of its inputs, as we saw in the example above. In some situations, however, we don’t want to apply the function to the entire table, but rather to each entry in one of the table’s columns. In these cases, we can use the .apply method.

For instance, suppose we have a table containing a 'year' column, such as the calfire table we have been using, and we want to convert each year into the corresponding decade. We have already written a function that converts a single year to a decade: decade_from_year. Recall how it works:

decade_from_year(1987)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [7], line 1
----> 1 decade_from_year(1987)

NameError: name 'decade_from_year' is not defined

We’d like to apply this function to each entry in the 'year' column. To do so, we’ll use .apply:

calfire.get('year').apply(decade_from_year)

Notice the pattern here: we .get('year') to retrieve column we wish to work with, and then .apply(decade_from_year) to the column. The result is a Series with the same number of entries as the column containing the years. Each entry is the result of applying the function to the corresponding entry of the original column.

Warning

Note that we pass the function into .apply without trailing parentheses. That is, we write .apply(decade_from_year) and not .apply(decade_from_year()) or .apply(decade_from_year(calfire.get('year'))). The .apply method accepts the name of a function. It will then call the function many times on the given Series.

In many cases we’d like to add this new Series back to the table as a new column. We can do so with .assign:

with_decade = calfire.assign(
    decade=calfire.get('year').apply(decade_from_year)
)
with_decade

The .apply method is very useful for data cleaning. Data rarely comes to us in the exact form we need or prefer. For instance, we might wish to convert a year to its decade, or remove the leading number code from a fire’s cause. A common approach to doing so is to write a function capable of converting or cleaning a single entry, then .applying this function to the entire column.

Example: clean the cause column

The cause column contains the cause of each fire as string, such as '14 - Unknown'. The string contains a number encoding unique to the cause of the fire, but this is redundant since the cause is described immediately after. Let’s get rid of the number, leaving only the description.

First, we’ll write a function that accepts a cause and returns only the description:

def cause_description(cause):
    return cause.split('-')[-1].strip()

cause_description('2 - Equipment Use')

Now we .apply the function to the 'cause' column. We’ll save it back to the table using .assign:

calfire.assign(
    cause=calfire.get('cause').apply(cause_description)
)

Notes on (Baby)Pandas

Applying Functions

Contents

12. Applying Functions¶

12.1. The `.apply` Series Method¶

Notes on (Baby)Pandas

Applying Functions

Contents

12. Applying Functions¶

12.1. The .apply Series Method¶

12.1. The `.apply` Series Method¶