12. Defining and Applying Functions

We have seen that Python comes with a bunch of useful functions for performing common tasks. For instance, the built-in round function rounds a number to a specified number of decimal places.

round(3.1415, 2)

We have also seen that we can access even more functions by installing and importing a library, like NumPy or babypandas.

In some cases, however, there might not be a library providing the function that you need. Luckily, Python allows us to define our own functions. In this section, we’ll see how to create functions and apply them to tables.

12.1. Defining Functions

Suppose you are working with a dataset containing a bunch of street addresses, such as the following address of the University of California, San Diego:

ucsd = '9500 Gilman Dr, La Jolla, CA 92093'

Suppose we only care about the city and state. That is, we’d like to extract the string 'La Jolla, CA' from the full address. Python doesn’t come with a function that does exactly this, but we can write our own without too much work.

12.1.1. Splitting Strings

A typical address has several parts: the street address, the city name, the state, and the zip code. The parts are separated by commas (with the exception of the state and zip code). Python strings have a helpful .split method which will split a string into parts according to whatever delimiter we provide. To split by a comma, we write:

['9500 Gilman Dr', ' La Jolla', ' CA 92093']

The result is a list of strings, each of them a part of the original list.

If we do not provide a delimiter, the default behavior of .split is to split based on whitespace (such as spaces):

['9500', 'Gilman', 'Dr,', 'La', 'Jolla,', 'CA', '92093']

We can use .split to retrieve the city and state name. Notice that when we split by commas, the city name will always be the second-to-last entry of the resulting list. This is because the last comma separates the city from the state and zip code. Remember that we can retrieve the second-to-last element of a list using square bracket notation, combined with using -2 as the index:

city = ucsd.split(',')[-2]
' La Jolla'

The result has a leading space that we might want to get rid of – we’ll deal with that in a moment. For now, let’s retrieve the state name. To do this, it might be easiest to split based on whitespace – then the state abbreviation will again be the second-to-last element of the list:

state = ucsd.split()[-2]

We’d like to put the city and state together into a single string, like 'La Jolla, CA'. To do so, remember that the + operator concatenates strings:

city_and_state = city + ', ' + state
' La Jolla, CA'

This is almost perfect, but let’s get rid of the leading space. We can do this with the .strip() string method, which removes leading and trailing whitespace.

'La Jolla, CA'

Great! Putting it all together, here’s the code we used to retrieve the city and state:

city = ucsd.split(',')[-2]
state = ucsd.split()[-2]
city_and_state = city + ', ' + state
'La Jolla, CA'

This code might seem simple enough, but suppose we have another address that we’d like to process:

lego = 'LEGOLAND California Resort 1 Legoland Dr, Carlsbad, CA 92008'

We could copy and paste the code above, but there is a better way: let’s define a function.

12.1.2. The def statement

In Python, new functions are created using the def statement. Here is an example of a function which retrieves the city and state name from an address:

def city_comma_state(address):
    """Return CITY, ST from an address string."""
    city = address.split(',')[-2]
    state = address.split()[-2]
    city_and_state = city + ', ' + state
    return city_and_state.strip()

There is a lot to say about this, but first let’s test the function to see if it works. We call user-defined functions just like any other function:

city_comma_state('9500 Gilman Dr, La Jolla, CA 92093')
'La Jolla, CA'
'La Jolla, CA'
'Carlsbad, CA'

Let’s take a closer look at the anatomy of a function definition. Fig. 12.1 below shows all of the different parts.


Fig. 12.1 The anatomy of a function. Name

A function definition starts with a name. Above, we’ve named our function city_comma_state, but any valid variable name would do. A function’s name should be short but descriptive. Arguments

Next come the function’s arguments. These are the “inputs” to the function. In this case, there is only one argument: the address that will be processed. We’ll see how to define functions with more than one argument in a moment. A function can also have zero arguments, in which case we would write def function_with_no_args():. The arguments can be named anything, as long as they are valid variable names. The arguments are surrounded by parentheses, and separated by commas. Body

The body of the function contains the code that will be executed when the function is called. The arguments can be used within the body of the function. The body of the function must be indented – we usually do this with the tab key. Docstring

The docstring is a piece of documentation that tells the reader what the function does. Including it is optional but recommended. If you ask Python for information on your function using help, the docstring will be displayed!

Help on function city_comma_state in module __main__:

    Return CITY, ST from an address string. Return

A function should usually return some value – this is done using the return statement, followed by an expression whose value will be returned.

12.1.3. Function Behavior

The code we include in a function behaves differently than the code we are used to writing in a couple of key ways. Functions are “recipes”

The code inside of a function is not executed until we call the function. For instance, suppose we try to do something impossible inside of a function – like dividing by zero:

def foo():
    x = 1/0
    return x

If you run the cell defining this function, everything will be fine: you won’t see an error. But when you call the function, Python let’s you know that you’re doing something that is mathematically impossible:

ZeroDivisionError                         Traceback (most recent call last)
/tmp/nix-shell.IKXEjZ/ipykernel_4688/3160684747.py in <module>
----> 1 foo()

/tmp/nix-shell.IKXEjZ/ipykernel_4688/1130574473.py in foo()
      1 def foo():
----> 2     x = 1/0
      3     return x

ZeroDivisionError: division by zero

This is because function definition are like recipes in the sense that handing someone a recipe is not the same as following the recipe and preparing the meal. Scope

Variables defined within a function are available only inside of the function. We can define variables inside a function just as we normally would:

def foo():
    x = 42
    y = 5
    return x + y

If we run the function, we’ll see the number 47 displayed:


However, if we try to use the variable x, Python will yell at us:

NameError                                 Traceback (most recent call last)
/tmp/nix-shell.IKXEjZ/ipykernel_4688/32546335.py in <module>
----> 1 x

NameError: name 'x' is not defined

This is because variables defined within a function are accessible only within the function. If we want to use that variable outside of the function, we need to pass it back to the caller using a return statement.

Note that arguments count as “variables defined within a function”. For instance:

def foo(my_argument):
    return my_argument + 2

If we call the function, everything will act as expected:


But if we try to access my_argument outside of the function, Python tells us that we can’t:

NameError                                 Traceback (most recent call last)
/tmp/nix-shell.IKXEjZ/ipykernel_4688/2574116524.py in <module>
----> 1 my_argument

NameError: name 'my_argument' is not defined

On the other hand, variables defined outside of a function are available inside the function. Consider for instance:

x = 42
def foo():
    return x + 10

Use this behavior sparingly – it is usually better to “isolate” a function from the outside world by passing in all of the variables that it needs. return exits the function

As soon as Python encounters a return statement, it stops executing the function and returns the corresponding value. As an example, consider the code below which has three returns. Only the first return statement will ever run:

def foo():
    print('Starting execution.')
    return 1
    print('Hey, I made it!')
    return 2
    print('On to number three...')
    return 3
Starting execution.
1 printing versus returning

As we saw above, functions are somewhat isolated from the rest of the world in the sense that variables defined within them cannot be used outside of the function. The “correct” way of transmitting values back to the world is to use a return statement. However, a common mistake is to think that print does the same thing. This is understandable, since printing and returning looks similar in a Jupyter notebook. For example, let’s define a function that both prints and returns:

def foo():
    x = 42
    y = 52
    return x

When we run this function, we’ll see both values:

z = foo()

Only 42 is the output of the cell and can be “saved” to a variable. 52, on the other hand, is simply displayed to the screen and is afterwards lost forever. This can be checked by displaying the value of z:


Nevertheless, using print inside of a function can be helpful in “debugging” – more on that in a moment. Lastly, if you truly want to return two values from a function, the right way to do so is by separating them with a comma, as follows:

def foo():
    x = 42
    y = 52
    return x, y

When the function is run, it will return a tuple of two things:

(42, 52)

A tuple is like a list, so we can use square bracket notation to retrieve each element:


We won’t usually need to return more than one thing from a function, though.

12.2. Examples

Given a year, produce the decade

Given a year, such as 1994, we’d like to retrieve the decade; in this case, 1990. At first we might think that round is useful:

round(1994, -1)

But it won’t work for years like 1997, since it will round up:

round(1997, -1)

There are a few approaches that do work. One way is to use the % operator. Remember that x % y returns the remainder upon dividing x by y. For example:

1992 % 10

To find the decade, we can simply subtract the remainder obtained by dividing by ten:

1992 - (1992 % 10)
1997 - (1997 % 10)
2000 - (2000 % 10)

Placing this code in a function makes it so we don’t have to remember this trick, and makes our code more readable:

def decade_from_year(year):
    return year - year % 10

Given height and width, compute the area of a triangle

We need to define a function with two variables. We do so by separating the argument names with a comma, like so:

def area_of_triangle(base, height):
    return 1/2 * base * height
area_of_triangle(10, 5)

Note that the order of the arguments matters. When area_of_triangle(10, 5) is executed, Python assigns the value of 10 to base and assigns the value of 5 to height. If you wish, you can use the keyword argument form to call the function, in which case arguments can be provided in any order. This is slightly more readable, too:

area_of_triangle(height=4, base=10)

Perform a frequent query

Suppose we frequently want to retrieve only those rows of a table whose entries lie between some thresholds. For instance, we might want only those fires in calfire from between 1995 and 2000. By writing this query into a function accepting a table, a column, and the thresholds, we make it easy to repeat:

def between(table, column, start, stop):
    return table[(table.get(column) >= start) & (table.get(column) < stop)]

For instance, to get only those fires from between 1995 and 2000:

between(calfire, 'year', 1995, 2000)
year month name cause acres county longitude latitude
6374 1995 11 SEMINOLE 2 - Equipment Use 645.149780 Riverside -116.772176 33.890971
6375 1995 10 SHINN 7 - Arson 13.694411 Los Angeles -117.679804 34.181470
6376 1995 8 ECHO 14 - Unknown 372.055573 Riverside -117.209657 33.868816
6377 1995 10 FREEWAY FIRE NO II 14 - Unknown 1233.456909 Los Angeles -118.372160 34.446973
6378 1995 12 TOWSLEY FIRE 14 - Unknown 818.184509 Los Angeles -118.559865 34.345437
... ... ... ... ... ... ... ... ...
7333 1999 7 VINTAGE 2 - Equipment Use 34.438271 Kern -119.521235 35.207022
7334 1999 8 41 10 - Vehicle 167.177612 Kern -120.176735 35.768621
7335 1999 8 WASHBURN 1 - Lightning 272.034943 San Luis Obispo -119.806037 35.132310
7336 1999 8 WOODLAND 1 - Lightning 995.418335 Butte -121.699525 39.859890
7337 1999 8 BLOOMER 1 - Lightning 2609.674805 Butte -121.476863 39.644791

964 rows × 8 columns

Because this function accepts the column name, it is very reusable. We can use it to get the fires whose size is between 10,000 and 20,000 acres:

between(calfire, 'acres', 10_000, 20_000)
year month name cause acres county longitude latitude
16 1910 7 COYOTE CREEK 14 - Unknown 11226.824219 Ventura -119.386960 34.421249
251 1924 8 UPPER DESOLATION VAL 14 - Unknown 10973.407227 El Dorado -120.506799 38.571552
317 1926 7 FORT BIDWELL 9 - Miscellaneous 13100.943359 Modoc -120.082147 41.934478
322 1927 8 LIEBRE 14 - Unknown 17957.339844 Los Angeles -118.611143 34.719225
739 1938 8 RED CAP 14 - Unknown 14867.953125 Humboldt -123.470725 41.174271
... ... ... ... ... ... ... ... ...
12949 2018 7 CRANSTON 7 - Arson 13229.158203 Riverside -116.696822 33.715582
12954 2018 6 LIONS 1 - Lightning 13462.742188 Madera -119.166248 37.577131
13290 2019 9 TABOOSE 1 - Lightning 10267.631836 Inyo -118.348358 37.021613
13305 2019 7 TUCKER 9 - Miscellaneous 14184.661133 Modoc -121.241082 41.803004
13365 2019 10 MARIA 14 - Unknown 10042.458984 Ventura -119.056671 34.314244

235 rows × 8 columns

Since the <= and > operators work on strings, too, we can get all of the fires whose name is between A and E:

between(calfire, 'name', 'A', 'E')
year month name cause acres county longitude latitude
2 1898 9 COZY DELL 14 - Unknown 2974.585205 Ventura -119.265380 34.482316
9 1910 8 CRAWFORD CREEK 2 7 - Arson 497.885071 Humboldt -123.552471 41.300052
10 1910 7 BLUFF CREEK 4 - Campfire 298.716553 Del Norte -123.760361 41.430391
16 1910 7 COYOTE CREEK 14 - Unknown 11226.824219 Ventura -119.386960 34.421249
17 1910 8 BULL CREEK 4 - Campfire 56.897217 Humboldt -123.621988 41.174766
... ... ... ... ... ... ... ... ...
13446 2019 9 ANTELOPE 9 - Miscellaneous 167.332794 San Benito -120.831821 36.558925
13451 2019 9 COW 10 - Vehicle 15.383965 Shasta -122.038452 40.612144
13456 2019 9 DEER 5 - Debris 9.367375 Santa Cruz -122.085541 37.183180
13457 2019 10 CABRILLO 2 - Equipment Use 61.750446 San Mateo -122.358885 37.171839
13460 2019 10 CROSS 14 - Unknown 289.151428 Monterey -120.726245 35.793698

3620 rows × 8 columns

12.3. The .apply Series Method

DataFrames come equipped with many useful methods, but defining our own functions allows us to make tables even more powerful. One way to use tables with functions is to pass the table into the function as one of its inputs, as we saw in the example above. In some situations, however, we don’t want to apply the function to the entire table, but rather to each entry in one of the table’s columns. In these cases, we can use the .apply method.

For instance, suppose we have a table containing a 'year' column, such as the calfire table we have been using, and we want to convert each year into the corresponding decade. We have already written a function that converts a single year to a decade: decade_from_year. Recall how it works:


We’d like to apply this function to each entry in the 'year' column. To do so, we’ll use .apply:

0        1890
1        1890
2        1890
3        1900
4        1900
13459    2010
13460    2010
13461    2010
13462    2010
13463    2010
Name: year, Length: 13464, dtype: int64

Notice the pattern here: we .get('year') to retrieve column we wish to work with, and then .apply(decade_from_year) to the column. The result is a Series with the same number of entries as the column containing the years. Each entry is the result of applying the function to the corresponding entry of the original column.


Note that we pass the function into .apply without trailing parentheses. That is, we write .apply(decade_from_year) and not .apply(decade_from_year()) or .apply(decade_from_year(calfire.get('year'))). The .apply method accepts the name of a function. It will then call the function many times on the given Series.

In many cases we’d like to add this new Series back to the table as a new column. We can do so with .assign:

with_decade = calfire.assign(
year month name cause acres county longitude latitude decade
0 1898 9 LOS PADRES 14 - Unknown 20539.949219 Ventura -119.367830 34.446830 1890
1 1898 4 MATILIJA 14 - Unknown 2641.123047 Ventura -119.299625 34.488614 1890
2 1898 9 COZY DELL 14 - Unknown 2974.585205 Ventura -119.265380 34.482316 1890
3 1902 8 FEROUD 14 - Unknown 731.481567 Ventura -119.320979 34.417515 1900
4 1903 10 SAN ANTONIO 14 - Unknown 380.260590 Ventura -119.253422 34.430616 1900
... ... ... ... ... ... ... ... ... ...
13459 2019 9 STAGE 7 - Arson 13.019149 Monterey -121.599207 36.764065 2010
13460 2019 10 CROSS 14 - Unknown 289.151428 Monterey -120.726245 35.793698 2010
13461 2019 9 FRUDDEN 2 - Equipment Use 11.789393 Monterey -120.908061 35.908627 2010
13462 2019 9 JOLON 11 - Powerline 61.592369 Monterey -121.010025 35.910750 2010
13463 2019 10 SADDLE RIDGE 14 - Unknown 8799.325195 Los Angeles -118.516473 34.321859 2010

13464 rows × 9 columns

The .apply method is very useful for data cleaning. Data rarely comes to us in the exact form we need or prefer. For instance, we might wish to convert a year to its decade, or remove the leading number code from a fire’s cause. A common approach to doing so is to write a function capable of converting or cleaning a single entry, then .applying this function to the entire column.

Example: clean the cause column

The cause column contains the cause of each fire as string, such as '14 - Unknown'. The string contains a number encoding unique to the cause of the fire, but this is redundant since the cause is described immediately after. Let’s get rid of the number, leaving only the description.

First, we’ll write a function that accepts a cause and returns only the description:

def cause_description(cause):
    return cause.split('-')[-1].strip()
cause_description('2 - Equipment Use')
'Equipment Use'

Now we .apply the function to the 'cause' column. We’ll save it back to the table using .assign:

year month name cause acres county longitude latitude
0 1898 9 LOS PADRES Unknown 20539.949219 Ventura -119.367830 34.446830
1 1898 4 MATILIJA Unknown 2641.123047 Ventura -119.299625 34.488614
2 1898 9 COZY DELL Unknown 2974.585205 Ventura -119.265380 34.482316
3 1902 8 FEROUD Unknown 731.481567 Ventura -119.320979 34.417515
4 1903 10 SAN ANTONIO Unknown 380.260590 Ventura -119.253422 34.430616
... ... ... ... ... ... ... ... ...
13459 2019 9 STAGE Arson 13.019149 Monterey -121.599207 36.764065
13460 2019 10 CROSS Unknown 289.151428 Monterey -120.726245 35.793698
13461 2019 9 FRUDDEN Equipment Use 11.789393 Monterey -120.908061 35.908627
13462 2019 9 JOLON Powerline 61.592369 Monterey -121.010025 35.910750
13463 2019 10 SADDLE RIDGE Unknown 8799.325195 Los Angeles -118.516473 34.321859

13464 rows × 8 columns