10. Querying

The 2018 wildfire season was the worst in California history. How many fires were in 2018? What was their combined size? How many were in San Diego County? These questions could be easily answered if we had a table consisting only of fires from 2018, but our dataset has fires from all years. Creating a new table by selecting only certain rows from an existing table which satisfy some condition is called a query. In this section, we’ll see how to perform queries.

10.1. Querying with Boolean Arrays

Let’s start with a simple example. Below, we have created a table containing the population of the five largest cities in California. The table is arranged in alphabetical order:

import babypandas as bpd
populations = (
    bpd.DataFrame()
    .assign(
        City=['Fresno', 'LA', 'SD', 'SF', 'SJ'],
        Population=[502_000, 4_094_000, 1_376_000, 856_000, 1_023_000]
        )
    .set_index('City')
)
populations
Population
City
Fresno 502000
LA 4094000
SD 1376000
SF 856000
SJ 1023000

Tip

To create a table by hand, start by creating an empty DataFrame with bpd.DataFrame(), then use .assign to add columns to the table.

Suppose we want to know which cities have populations larger than 1 million. We can do so with the following query, the result of which is a new table consisting only of the cities whose population is greater than 1 million.

populations[populations.get('Population') > 1_000_000]
Population
City
LA 4094000
SD 1376000
SJ 1023000

Let’s take a deeper look at what is going on insider of a query. At it’s heart is a comparison: populations.get('Population') > 1_000_000. Recall from Booleans that a comparison between two numbers results in a Boolean value (that is, the result is either True or False). For instance, is we ask if 2 million is larger than 1 million, we get:

2_000_000 > 1_000_000
True

Now, in this case of populations.get('Population') > 1_000_000, we are comparing an entire column of the table to a single number, 1_000_000. What will be the result? As with arithmetic on a table column, comparisons are performed elementwise:

populations.get('Population') > 1_000_000
City
Fresno    False
LA         True
SD         True
SF        False
SJ         True
Name: Population, dtype: bool

The result is a Series with one entry for every row of the original table. The Series is True where the comparison is true and False where the comparison is not true. Because each entry of the Series is a Boolean, the result is called a Boolean array (or, more precisely, a Boolean series). The entries of the Boolean array tell us exactly which rows to keep and which rows to discard. If an entry is True, we want to keep the corresponding row of the table – if it is False, we should discard that row.

We can carry out the query and produce a new table by passing the Boolean array into the table using square bracket notation:

populations[populations.get('Population') > 1_000_000]
Population
City
LA 4094000
SD 1376000
SJ 1023000

Tip

Using square brackets on a table can be read aloud as “where”.

So the expression

populations[populations.get('Population') > 1_000_000]

is read as “the rows where the populations is greater than 1 million”.

It should be mentioned that any Boolean array whose length matches the number of rows in the table can be used to select a subset of the table’s rows. For instance, we could construct our own Boolean array by hand and get the same result:

populations[[False, True, True, False, True]]
Population
City
LA 4094000
SD 1376000
SJ 1023000

Of course, it’s usually much more convenient to create the Boolean arrays with comparisons, like populations.get('Population') > 1_000_000.

Lastly, note that if no row satisfies your condition, an empty table will be returned. For instance, no cities have more than 10 million people:

populations[populations.get('Population') > 10_000_000]
Population
City

10.2. Examples

Let’s get some practice with querying on a larger data set. For this, we’ll use the full dataset of California wildfires. While the wildfire dataset we saw previously contained only the 50 largest fires in California history, this dataset will contain all named fires since the late 1800’s. It is in a file called calfire-full.csv in the data directory:

calfire = bpd.read_csv('data/calfire-full.csv')
calfire
year month name cause acres county longitude latitude
0 1898 9 LOS PADRES 14 - Unknown 20539.949219 Ventura -119.367830 34.446830
1 1898 4 MATILIJA 14 - Unknown 2641.123047 Ventura -119.299625 34.488614
2 1898 9 COZY DELL 14 - Unknown 2974.585205 Ventura -119.265380 34.482316
3 1902 8 FEROUD 14 - Unknown 731.481567 Ventura -119.320979 34.417515
4 1903 10 SAN ANTONIO 14 - Unknown 380.260590 Ventura -119.253422 34.430616
... ... ... ... ... ... ... ... ...
13459 2019 9 STAGE 7 - Arson 13.019149 Monterey -121.599207 36.764065
13460 2019 10 CROSS 14 - Unknown 289.151428 Monterey -120.726245 35.793698
13461 2019 9 FRUDDEN 2 - Equipment Use 11.789393 Monterey -120.908061 35.908627
13462 2019 9 JOLON 11 - Powerline 61.592369 Monterey -121.010025 35.910750
13463 2019 10 SADDLE RIDGE 14 - Unknown 8799.325195 Los Angeles -118.516473 34.321859

13464 rows × 8 columns

Notice that we haven’t set the index of the table. The natural choice of index would be the 'name' column, but it turns out that several fires have the same name. Remember that row labels should be unique!

10.2.1. How many fires were there in 2018?

We can determine how many fires there were in 2018 by selecting only those rows where the 'year' value is equal to 2018 and calculating the size of the resulting table. Remember: to ask if two values are equal, we use ==, not =.

calfire[calfire.get('year') == 2018]
year month name cause acres county longitude latitude
12772 2018 7 LAGUNA 2 - Equipment Use 1.116125 San Luis Obispo -120.632741 35.588189
12773 2018 12 OAK 11 - Powerline 2.180850 Los Angeles -118.692809 34.473468
12774 2018 11 COPCO 9 - Miscellaneous 6.980345 Los Angeles -118.797409 34.720091
12775 2018 10 SANTA MARIA 14 - Unknown 1.347234 Los Angeles -118.592681 34.123768
12776 2018 10 MUREAU 11 - Powerline 0.133597 Los Angeles -118.678461 34.151321
... ... ... ... ... ... ... ... ...
13164 2018 4 HARBOR 14 - Unknown 0.392633 Ventura -119.247964 34.214345
13165 2018 1 ERBES 14 - Unknown 0.028838 Ventura -118.846409 34.235447
13166 2018 10 GRIMES 2 - Equipment Use 0.202070 Ventura -118.904546 34.343482
13167 2018 6 BEACH 9 - Miscellaneous 0.521827 Ventura -119.341277 34.298223
13168 2018 7 ROLLING 14 - Unknown 0.007145 Ventura -118.868440 34.174372

397 rows × 8 columns

The resulting table has 397 rows, which means that there were 397 fires in 2018. We could also get this information with code by using the .shape attribute:

calfire[calfire.get('year') == 2018].shape[0]
397

In reality, there were probably many more wildfires than this: the dataset contains only those fires which were large enough to be named.

10.2.2. What was the combined size of all fires in 2018?

We know how to retrieve only those fires from 2018; we did this above with calfire[calfire.get('year') == 2018]. Now we simply ask for the sum of the 'acres' column:

area_burned_2018 = calfire[calfire.get('year') == 2018].get('acres').sum()
area_burned_2018
1586074.870139362

That certainly seems like a large number, but let’s put it in perspective. The island of Manhattan is 14,600 acres in area. Therefore, the area burned in 2018 was:

print(area_burned_2018 / 14_600, 'times the size of Manhattan.')
108.6352650780385 times the size of Manhattan.

10.2.3. What percentage of all fires are caused by “arson”?

We want to select the rows where the cause is due to someone deliberately setting the fire. The 'cause' column, however, contains strings in a special format, such as 1 - Lightning, 2 - Eqipment Use, and so on. To select the fires caused by arson we need to know the exact string to search for.

We can get this information by asking for the unique values that appear in a column with the .unique Series method:

calfire.get('cause').unique()
array(['14 - Unknown', '7 - Arson', '4 - Campfire', '1 - Lightning',
       '9 - Miscellaneous', '5 - Debris', '8 - Playing with fire',
       '3 - Smoking', '15 - Structure', '2 - Equipment Use',
       '10 - Vehicle', '6 - Railroad', '18 - Escaped Prescribed Burn',
       '11 - Powerline', '16 - Aircraft', '12 - Firefighter Training',
       '13 - Non-Firefighter Training'], dtype=object)

It looks like the right string for arson is 7 - Arson. Let’s perform the query:

arsons = calfire[calfire.get('cause') == '7 - Arson']
arsons
year month name cause acres county longitude latitude
9 1910 8 CRAWFORD CREEK 2 7 - Arson 497.885071 Humboldt -123.552471 41.300052
24 1911 7 ORCUTT FLAT 7 - Arson 94.598785 Humboldt -123.515171 41.319056
25 1911 7 PRIGMORE FLAT 7 - Arson 14.936650 Humboldt -123.520594 41.288310
29 1911 8 ULLATHORNE CREEK 7 - Arson 9.957536 Humboldt -123.576445 41.289008
31 1911 8 PEARCH CREEK 7 - Arson 398.317413 Humboldt -123.495326 41.302627
... ... ... ... ... ... ... ... ...
13387 2019 9 FRAME 7 - Arson 15.022148 Fresno -120.649290 36.163830
13405 2019 3 LINCOLN 7 - Arson 564.215393 Riverside -116.072059 33.534968
13409 2019 8 PATTERSON 7 - Arson 139.200714 San Joaquin -121.538773 37.715312
13425 2019 8 MARSH 5 7 - Arson 226.616867 Contra Costa -121.828352 37.888016
13459 2019 9 STAGE 7 - Arson 13.019149 Monterey -121.599207 36.764065

763 rows × 8 columns

To find the percentage of fires caused by arson, we divide the size of this table by the size of the original table:

arsons.shape[0] / calfire.shape[0]
0.05666963755199049

So about 5-6% of all wildfires are known to be caused by arson.

10.2.4. How many fires from 2018 occurred in San Diego County?

We can answer this question by first selecting only the rows from 2018. Using this smaller table, we’ll select the rows from San Diego County. This two step process is slightly cumbersome, however – in the next section, we’ll see a better way of performing the same query.

First, we’ll get the fires from 2018:

fires_from_2018 = calfire[calfire.get('year') == 2018]

Now we’ll select only the fires from San Diego County, making sure to use fires_from_2018, and not calfires:

fires_from_2018[fires_from_2018.get('county') == 'San Diego']
year month name cause acres county longitude latitude
12800 2018 7 PASQUAL 14 - Unknown 327.586121 San Diego -116.917288 33.078967
12819 2018 3 MCP BASILONE GATE 14 - Unknown 0.023540 San Diego -117.559119 33.440113
12821 2018 8 SAN MATEO CAMPGROUND 14 - Unknown 2.909111 San Diego -117.572772 33.419019
12889 2018 5 MONTE 14 - Unknown 13.088419 San Diego -116.767460 32.896771
12968 2018 7 WEST 2 - Equipment Use 505.399994 San Diego -116.745126 32.829008
... ... ... ... ... ... ... ... ...
13056 2018 7 ROCK 5 14 - Unknown 206.256805 San Diego -117.236669 33.419540
13069 2018 6 RECYCLE 4 - Campfire 248.380676 San Diego -116.443775 32.654185
13071 2018 5 AGUA 14 - Unknown 59.082077 San Diego -117.028463 33.363286
13141 2018 6 CHURCH 2 10 - Vehicle 11.045366 San Diego -116.359866 32.687008
13142 2018 7 BUILDING 9 - Miscellaneous 10.419848 San Diego -116.766340 32.620508

13 rows × 8 columns

It seems that there were 13 such fires.

10.3. Multiple Conditions

How many fires did San Diego County have in 2018? We saw above that this can be answered with two queries, one after the other. But a more direct and arguably better way is to construct a query with multiple conditions, as we’ll do now.

10.3.1. The & Operator

Let’s begin with another simple example. Below is the table of city populations that we saw at the beginning of this section, but with an additional column containing the region of California that the city is in:

with_regions = bpd.DataFrame().assign(
    City=['Fresno', 'LA', 'SD', 'SF', 'SJ'],
    Population=[502_000, 4_094_000, 1_376_000, 856_000, 1_023_000],
    Region=['Central Valley', 'SoCal', 'SoCal', 'NorCal', 'NorCal']
).set_index('City')
with_regions
Population Region
City
Fresno 502000 Central Valley
LA 4094000 SoCal
SD 1376000 SoCal
SF 856000 NorCal
SJ 1023000 NorCal

Suppose we want to select only those cities which 1) have population larger than 1 million, and 2) are in NorCal (Northern California).

We know how to make Boolean arrays for both queries individually:

is_more_than_a_million = with_regions.get('Population') > 1_000_000
is_more_than_a_million
City
Fresno    False
LA         True
SD         True
SF        False
SJ         True
Name: Population, dtype: bool
is_in_norcal = with_regions.get('Region') == 'NorCal'
is_in_norcal
City
Fresno    False
LA        False
SD        False
SF         True
SJ         True
Name: Region, dtype: bool

But we want a Boolean array where an entry is True if (and only if) both of the corresponding entries from is_more_than_a_million and is_in_norcal are True. In other words, an entry should be true if the population is above 1 million and the region is NorCal.

We can construct such an array using the binary and & operator. & takes two Boolean arrays of the same size and returns a new Boolean array that is True only when the input arrays are both True.

is_more_than_a_million & is_in_norcal
City
Fresno    False
LA        False
SD        False
SF        False
SJ         True
dtype: bool

You can check that each entry in this new Boolean array is True if and only if the corresponding entries in is_more_than_a_million and is_in_norcal are both True.

We can now pass this new array into the table to select only the desired rows:

with_regions[is_more_than_a_million & is_in_norcal]
Population Region
City
SJ 1023000 NorCal

In this case, we’ve assigned each of the two queries to intermediate variables, is_more_than_a_million and is_in_norcal. This is totally fine, but many times we’ll save ourselves the extra typing by writing the query in one line of code:

with_regions[
    (with_regions.get('Population') > 1_000_000) & (with_regions.get('Region') == 'NorCal')
]
Population Region
City
SJ 1023000 NorCal

Notice the () around each individual comparison. These are extremely important! If you don’t include these, you’ll get yelled at by Python:

with_regions[
    with_regions.get('Population') > 1_000_000 & with_regions.get('Region') == 'NorCal'
]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [24], line 2
      1 with_regions[
----> 2     with_regions.get('Population') > 1_000_000 & with_regions.get('Region') == 'NorCal'
      3 ]

TypeError: unsupported operand type(s) for &: 'int' and 'Series'

Warning

Always include parentheses () around separate conditions in a query with multiple conditions. If you are performing a query, and you see a TypeError complaining about an “unsupported operand type”, it is likely due to forgetting the parentheses.

Jupyter Tip

If you forgot parentheses and want to add them quickly, you can select the section of code you want to surround in parentheses and then type (. Jupyter will wrap your entire selection in a single pair of parentheses.

Lastly, Python is all about readability and  a e s t h e t i c.

If we’re inside of parentheses or brackets, we can break up long lines of code to make them easier to read and understand:

with_regions[
    (with_regions.get('Population') > 1_000_000)
    &
    (with_regions.get('Region') == 'NorCal')
]
Population Region
City
SJ 1023000 NorCal

How would you use & to select all ‘Class E’ fires – fires which burned at least 300 acres (inclusive) but less than 1000 acres?

10.3.2. The | Operator

Suppose we wanted to select all fires which occurred in either San Diego County or Imperial County. In this situation, we use the binary or operator, |. This operator takes two Boolean arrays of the same size as input and creates a new array in which a particular entry is True if at least one of the corresponding entries in the input arrays is true.

in_san_diego = calfire.get('county') == "San Diego"
in_imperial = calfire.get('county') == "Imperial"

calfire[in_imperial | in_san_diego]
year month name cause acres county longitude latitude
1292 1950 10 TALBOTT 9 - Miscellaneous 522.326538 San Diego -116.732715 32.699059
1295 1950 6 PUEBLO SIDING 9 - Miscellaneous 315.018127 San Diego -116.356054 32.613788
1296 1950 10 BANNER 9 - Miscellaneous 2195.823242 San Diego -116.568905 33.057330
1322 1950 9 AGUA TIBIA 1 - Lightning 1941.808594 San Diego -116.992760 33.392911
1332 1950 8 CONEJOS 9 - Miscellaneous 62848.742188 San Diego -116.644504 32.900322
... ... ... ... ... ... ... ... ...
13348 2019 6 PAUMA 4 9 - Miscellaneous 22.021397 San Diego -116.986509 33.297972
13349 2019 8 AIRPORT 14 - Unknown 1.677516 San Diego -116.663357 33.285729
13350 2019 9 DEHESA 11 14 - Unknown 225.646027 San Diego -116.819500 32.777229
13351 2019 11 CRESTWOOD 10 - Vehicle 31.554823 San Diego -116.349084 32.701877
13371 2019 11 CAMINO 14 - Unknown 0.721892 San Diego -116.575401 33.256260

846 rows × 8 columns

10.3.3. Never use and / or!

You might know that and and or are valid Python keywords and might wonder why we use & and | instead. In fact, and and or do not perform the right type of comparison when working with arrays. Your code will run, but it will give you the wrong result. Here’s a quick example. Suppose we want to get all cities in NorCal with populations larger than 1 million. We write a compound query, but use and instead of &. The code runs, but is the result correct?

with_regions[
    (with_regions.get('Population') > 1_000_000) and (with_regions.get('Region') == 'NorCal')
]
Population Region
City
SF 856000 NorCal
SJ 1023000 NorCal

Notice that SF has been included, even though its population is less than 1 million! Why is this? The short answer is that and does not do the same thing as &. Remember that & works elementwise on arrays, returning a new array which is True only where both of the input arrays are True:

bpd.Series(data=[True, True, False]) & bpd.Series(data=[False, True, True])
0    False
1     True
2    False
dtype: bool

Now let’s try the same thing, but with and instead of &:

bpd.Series(data=[True, True, False]) and bpd.Series(data=[False, True, True])
0    False
1     True
2     True
dtype: bool

That’s not what we expected!

For the purposes of this course, it’s enough to know that and doesn’t do the same thing as &. But if you’re curious as to why this happens, here’s the short version. We know that Python has Boolean literals True and False, and and works with them as you’d expect:

True and False
False
True and True
True

But Python also assigns truth value to other things, like lists and strings. For instance, it is often useful to treat an empty list as if is is False, and a non-empty list as if it is True. We can see the truth value Python assigns to something by using the bool function to convert it to a Boolean:

bool([1, 2, 3])
True
bool([])
False

Because of this, we can use and between things that aren’t True or False. For instance, we can put it between two lists to make sure that both are non-empty:

[1, 2, 3] and [4, 5]
[4, 5]
[1, 2, 3] and []
[]

Note that the output isn’t True or False, it is a list! If we really wanted True or False for some reason, we could do so with bool():

bool([1, 2, 3] and [4, 5])
True
bool([1, 2, 3] and [])
False

But this is the key: apparently, and will produce the second list if both are non-empty. The same thing happens if we use and between Boolean series. Take another look at our example from above:

bpd.Series(data=[True, True, False]) and bpd.Series(data=[False, True, True])
0    False
1     True
2     True
dtype: bool

Notice that what is returned is actually the array on the right hand side of and! This almost never what we want when comparing arrays, so remember the following warning:

Warning

Never use and when writing comparisons in a query. Always use &. Likewise, never use or; always use |.

We will use and and or later on when writing conditionals.

10.4. Searching for a Substring

Wildfires are typically named after the place where they started. For example, fires that start near the border between California and another state or Mexico are often named something like “BORDER #6”, or “BORDER #12”, etc. We know how to write a query to select all fires with a specific name. For instance, it turns out that there are multiple fires called “BORDER#2”.

calfire[calfire.get('name') == 'BORDER#2']
year month name cause acres county longitude latitude
7802 2002 2 BORDER#2 4 - Campfire 270.632538 San Diego -116.692025 32.644564
8093 2003 8 BORDER#2 4 - Campfire 112.706635 San Diego -116.915079 32.602055

But how do we retrieve all fires with “BORDER” somewhere in their names? It turns out that babypandas includes a helpful Series method called .str.contains which can help us do exactly this. It accepts one argument – a string – and searches for it within each entry of the Series, returning a Boolean array. For instance, to find the fires with “BORDER” in their name, we write:

calfire[calfire.get('name').str.contains('BORDER')]
year month name cause acres county longitude latitude
606 1934 6 BORDER MT. 1 - Lightning 9.332158 Siskiyou -121.467036 41.489674
824 1939 9 BORDER MT. 9 - Miscellaneous 48.154457 Siskiyou -121.466297 41.487778
7187 1999 7 BORDER #70 4 - Campfire 118.480690 San Diego -116.808309 32.567800
7418 2000 7 BORDER #6 14 - Unknown 74.557014 San Diego -116.335658 32.657144
7802 2002 2 BORDER#2 4 - Campfire 270.632538 San Diego -116.692025 32.644564
... ... ... ... ... ... ... ... ...
12092 2016 6 BORDER 3 14 - Unknown 7606.026367 San Diego -116.557240 32.604410
12710 2017 9 BORDER 9 14 - Unknown 22.502857 San Diego -116.668979 32.574468
12759 2017 9 BORDER 6 14 - Unknown 135.044418 San Diego -116.750203 32.585912
13317 2019 8 BORDER 10 14 - Unknown 11.862469 San Diego -116.866295 32.557500
13318 2019 8 BORDER 9 7 - Arson 67.372292 San Diego -116.741779 32.569219

45 rows × 8 columns

10.4.1. A Trick to Avoid Spurious Matches

Many fires are named after roads. Let’s count how many:

calfire[calfire.get('name').str.contains('ROAD')]
year month name cause acres county longitude latitude
467 1930 7 ROAD #1 5 - Debris 6.970353 Trinity -123.528351 40.239767
497 1931 10 NIELSON ROAD 4 - Campfire 2.489452 Trinity -123.364262 40.019498
791 1939 8 CUTOFF ROAD 7 - Arson 1057.719482 Siskiyou -122.865814 41.825530
1106 1946 9 RAILROAD #13 14 - Unknown 240.841370 Mendocino -123.319311 39.373807
1303 1950 7 SCULLY ROAD 14 - Unknown 416.051544 Amador -120.925246 38.300486
... ... ... ... ... ... ... ... ...
12237 2017 8 RAILROAD 9 - Miscellaneous 12371.148438 Madera -119.612630 37.447218
12516 2017 9 RAILROAD 14 - Unknown 2.216443 Siskiyou -122.520362 41.526095
12567 2017 7 ROADRUNNER 7 - Arson 2284.595947 Tulare -118.933014 36.015821
13147 2018 11 ROADRUNNER 14 - Unknown 0.613522 Ventura -118.921408 34.206380
13218 2019 8 R4 RAILROAD 1 - Lightning 24.105974 Lassen -120.375015 40.629286

132 rows × 8 columns

We see several “good” matches, like “NIELSON ROAD” and “CUTOFF ROAD”, but several “spurious” matches, like “ROADRUNNER”, and “RAILROAD”. How can we exclude these fires?

Here’s a trick: instead of searching for strings containing 'ROAD', we’ll search for strings containing ' ROAD', where we have added a space to the beginning of the search string. This will match strings like “CUTOFF ROAD”, but not “RAILROAD”.

calfire[calfire.get('name').str.contains(' ROAD')]
year month name cause acres county longitude latitude
497 1931 10 NIELSON ROAD 4 - Campfire 2.489452 Trinity -123.364262 40.019498
791 1939 8 CUTOFF ROAD 7 - Arson 1057.719482 Siskiyou -122.865814 41.825530
1303 1950 7 SCULLY ROAD 14 - Unknown 416.051544 Amador -120.925246 38.300486
1441 1951 8 SMITH ROAD 3 - Smoking 629.024658 Calaveras -120.314476 38.354487
1760 1953 5 RICE ROAD 14 - Unknown 638.207275 Ventura -119.296153 34.441824
... ... ... ... ... ... ... ... ...
5176 1985 7 GAS POINT ROAD 14 - Unknown 93.247581 Shasta -122.477858 40.391840
5177 1985 7 GAS POINT ROAD 14 - Unknown 78.217186 Shasta -122.439950 40.392093
5178 1985 7 GAS POINT ROAD 14 - Unknown 106.863319 Shasta -122.430627 40.394094
6579 1996 7 MINES ROAD 2 - Equipment Use 174.051270 Santa Clara -121.523671 37.475377
8870 2006 7 JOHNSON ROAD FIRE 14 - Unknown 110.659088 Los Angeles -118.334695 34.661443

48 rows × 8 columns

Strings have other useful string methods, as well. In this case, all of the fire names are capitalized, but it is common to find that datasets are inconsistent in their capitalization. For instance:

cities = bpd.DataFrame().assign(names=[
    'San Diego', 'los angeles', 'san luis obispo', 'Oakland', 'Stockton'
])

Note that some of the strings are not capitalized, while others are. If we want to retrieve all strings that contain San, we’ll have to write two queries: one for San and another for san. But there’s an easier way.

A common practice is to make sure all of your text data for a given column is in the same capitalization before selecting on substrings. We can chain together the .str.lower method with the .str.contains method to quickly achieve this:

cities.get('names').str.lower()
0          san diego
1        los angeles
2    san luis obispo
3            oakland
4           stockton
Name: names, dtype: object
cities[cities.get('names').str.lower().str.contains('san')]
names
0 San Diego
2 san luis obispo

It’s worth noting that if you try using the .str methods on a Series that doesn’t contain text, you’ll encounter a helpful error (once you scroll to the bottom).

calfire.get('year').str.contains('2019')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [46], line 1
----> 1 calfire.get('year').str.contains('2019')

File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/babypandas/bpd.py:850, in Series.str(self)
    844 '''
    845 String methods on Series.
    846 '''
    847 # accessing the `.str` attribute of a pd.Series will raise an 
    848 # AttributeError if the series does not consist of string values. We
    849 # use a property here to replicate this behavior.
--> 850 return SeriesStringMethods(self._pd.str)

File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/pandas/core/generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)

File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/pandas/core/accessor.py:182, in CachedAccessor.__get__(self, obj, cls)
    179 if obj is None:
    180     # we're accessing the attribute of the class, i.e., Dataset.geo
    181     return self._accessor
--> 182 accessor_obj = self._accessor(obj)
    183 # Replace the property with the accessor object. Inspired by:
    184 # https://www.pydanny.com/cached-property.html
    185 # We need to use object.__setattr__ because we overwrite __setattr__ on
    186 # NDFrame
    187 object.__setattr__(obj, self._name, accessor_obj)

File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/pandas/core/strings/accessor.py:177, in StringMethods.__init__(self, data)
    174 def __init__(self, data):
    175     from pandas.core.arrays.string_ import StringDtype
--> 177     self._inferred_dtype = self._validate(data)
    178     self._is_categorical = is_categorical_dtype(data.dtype)
    179     self._is_string = isinstance(data.dtype, StringDtype)

File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/pandas/core/strings/accessor.py:231, in StringMethods._validate(data)
    228 inferred_dtype = lib.infer_dtype(values, skipna=True)
    230 if inferred_dtype not in allowed_types:
--> 231     raise AttributeError("Can only use .str accessor with string values!")
    232 return inferred_dtype

AttributeError: Can only use .str accessor with string values!

Tip

Often times, when reading an error message it’s most helpful to look at the two ends of the message – and don’t get too worried about the middle bits.

The very top line points to where the error occurred, and the very bottom lines explain why the error occurred.

10.5. More Examples

10.5.1. What percentage of fires occurred in September or October?

We’ll use | to find those fires whose 'month' is either 9 or 10:

sept_or_oct = calfire[(calfire.get('month') == 9) | (calfire.get('month') == 10)]
sept_or_oct.shape[0] / calfire.shape[0]
0.2335858585858586

10.5.2. How many fires occurred between San Francisco and LA in terms of latitude?

The latitude of San Francisco is 37.7749, while the latitude of LA is 34.0522. Selecting all fires within this range can be done with &:

calfire[
    (calfire.get('latitude') <= 37.7749)
    &
    (calfire.get('latitude') >= 34.0522)
].shape[0]
5519

10.5.3. How many fires larger than 1000 acres did San Diego have in the 1990s?

Now we’re having fun! There are three conditions here: 1) Larger than 1000 acres, 2) in San Diego County, and 3) in the 1990s. The third condition, however, is actually two conditions: after 1990 and before 2000.

With a complex query like this, it’s probably best to make the three Boolean arrays separately and save them in intermediate variables, like this:

is_gt_1000_acres = calfire.get('acres') > 1000
is_in_sd = calfire.get('county') == 'San Diego'
is_in_1990s = (calfire.get('year') >= 1990) & (calfire.get('year') < 2000)

Now we can perform the query. All three of these conditions must be true simultaneously, so we use &:

calfire[is_gt_1000_acres & is_in_sd & is_in_1990s]
year month name cause acres county longitude latitude
5799 1990 6 MORRETTIS 7 - Arson 3314.026611 San Diego -116.667315 33.196718
5805 1990 9 PAINT 9 - Miscellaneous 2760.890869 San Diego -117.144556 33.058681
5811 1990 10 CASE 9 - Miscellaneous 2062.641357 San Diego -117.441422 33.432863
5924 1991 6 BARRETT 2 - Equipment Use 1609.205811 San Diego -116.670060 32.618218
5985 1992 11 BRAVO 14 - Unknown 2023.847290 San Diego -117.543498 33.374327
... ... ... ... ... ... ... ... ...
6999 1998 7 SUMMIT 14 - Unknown 1123.004883 San Diego -116.732161 32.604883
7107 1999 8 LAGUNA 100 7 - Arson 4411.434570 San Diego -116.406902 32.900014
7188 1999 9 PROCTOR 7 - Arson 1460.064453 San Diego -116.918494 32.692336
7217 1999 10 LA JOLLA 5 - Debris 7845.939941 San Diego -116.789519 33.273254
7331 1999 10 SHOCKEY 14 - Unknown 3284.517578 San Diego -116.405917 32.607751

32 rows × 8 columns