10. Querying¶

The 2018 wildfire season was the worst in California history. How many fires were in 2018? What was their combined size? How many were in San Diego County? These questions could be easily answered if we had a table consisting only of fires from 2018, but our dataset has fires from all years. Creating a new table by selecting only certain rows from an existing table which satisfy some condition is called a query. In this section, we’ll see how to perform queries.

10.1. Querying with Boolean Arrays¶

Let’s start with a simple example. Below, we have created a table containing the population of the five largest cities in California. The table is arranged in alphabetical order:

import babypandas as bpd
populations = (
    bpd.DataFrame()
    .assign(
        City=['Fresno', 'LA', 'SD', 'SF', 'SJ'],
        Population=[502_000, 4_094_000, 1_376_000, 856_000, 1_023_000]
        )
    .set_index('City')
)
populations

	Population
City
Fresno	502000
LA	4094000
SD	1376000
SF	856000
SJ	1023000

Tip

To create a table by hand, start by creating an empty DataFrame with bpd.DataFrame(), then use .assign to add columns to the table.

Suppose we want to know which cities have populations larger than 1 million. We can do so with the following query, the result of which is a new table consisting only of the cities whose population is greater than 1 million.

populations[populations.get('Population') > 1_000_000]

	Population
City
LA	4094000
SD	1376000
SJ	1023000

Let’s take a deeper look at what is going on insider of a query. At it’s heart is a comparison: populations.get('Population') > 1_000_000. Recall from Booleans that a comparison between two numbers results in a Boolean value (that is, the result is either True or False). For instance, is we ask if 2 million is larger than 1 million, we get:

2_000_000 > 1_000_000

True

Now, in this case of populations.get('Population') > 1_000_000, we are comparing an entire column of the table to a single number, 1_000_000. What will be the result? As with arithmetic on a table column, comparisons are performed elementwise:

populations.get('Population') > 1_000_000

City
Fresno    False
LA         True
SD         True
SF        False
SJ         True
Name: Population, dtype: bool

The result is a Series with one entry for every row of the original table. The Series is True where the comparison is true and False where the comparison is not true. Because each entry of the Series is a Boolean, the result is called a Boolean array (or, more precisely, a Boolean series). The entries of the Boolean array tell us exactly which rows to keep and which rows to discard. If an entry is True, we want to keep the corresponding row of the table – if it is False, we should discard that row.

We can carry out the query and produce a new table by passing the Boolean array into the table using square bracket notation:

populations[populations.get('Population') > 1_000_000]

	Population
City
LA	4094000
SD	1376000
SJ	1023000

Tip

Using square brackets on a table can be read aloud as “where”.

So the expression

populations[populations.get('Population') > 1_000_000]

is read as “the rows where the populations is greater than 1 million”.

It should be mentioned that any Boolean array whose length matches the number of rows in the table can be used to select a subset of the table’s rows. For instance, we could construct our own Boolean array by hand and get the same result:

populations[[False, True, True, False, True]]

	Population
City
LA	4094000
SD	1376000
SJ	1023000

Of course, it’s usually much more convenient to create the Boolean arrays with comparisons, like populations.get('Population') > 1_000_000.

Lastly, note that if no row satisfies your condition, an empty table will be returned. For instance, no cities have more than 10 million people:

populations[populations.get('Population') > 10_000_000]

	Population
City

10.2. Examples¶

Let’s get some practice with querying on a larger data set. For this, we’ll use the full dataset of California wildfires. While the wildfire dataset we saw previously contained only the 50 largest fires in California history, this dataset will contain all named fires since the late 1800’s. It is in a file called calfire-full.csv in the data directory:

calfire = bpd.read_csv('data/calfire-full.csv')
calfire

	year	month	name	cause	acres	county	longitude	latitude
0	1898	9	LOS PADRES	14 - Unknown	20539.949219	Ventura	-119.367830	34.446830
1	1898	4	MATILIJA	14 - Unknown	2641.123047	Ventura	-119.299625	34.488614
2	1898	9	COZY DELL	14 - Unknown	2974.585205	Ventura	-119.265380	34.482316
3	1902	8	FEROUD	14 - Unknown	731.481567	Ventura	-119.320979	34.417515
4	1903	10	SAN ANTONIO	14 - Unknown	380.260590	Ventura	-119.253422	34.430616
...	...	...	...	...	...	...	...	...
13459	2019	9	STAGE	7 - Arson	13.019149	Monterey	-121.599207	36.764065
13460	2019	10	CROSS	14 - Unknown	289.151428	Monterey	-120.726245	35.793698
13461	2019	9	FRUDDEN	2 - Equipment Use	11.789393	Monterey	-120.908061	35.908627
13462	2019	9	JOLON	11 - Powerline	61.592369	Monterey	-121.010025	35.910750
13463	2019	10	SADDLE RIDGE	14 - Unknown	8799.325195	Los Angeles	-118.516473	34.321859

13464 rows × 8 columns

Notice that we haven’t set the index of the table. The natural choice of index would be the 'name' column, but it turns out that several fires have the same name. Remember that row labels should be unique!

10.2.1. How many fires were there in 2018?¶

We can determine how many fires there were in 2018 by selecting only those rows where the 'year' value is equal to 2018 and calculating the size of the resulting table. Remember: to ask if two values are equal, we use ==, not =.

calfire[calfire.get('year') == 2018]

	year	month	name	cause	acres	county	longitude	latitude
12772	2018	7	LAGUNA	2 - Equipment Use	1.116125	San Luis Obispo	-120.632741	35.588189
12773	2018	12	OAK	11 - Powerline	2.180850	Los Angeles	-118.692809	34.473468
12774	2018	11	COPCO	9 - Miscellaneous	6.980345	Los Angeles	-118.797409	34.720091
12775	2018	10	SANTA MARIA	14 - Unknown	1.347234	Los Angeles	-118.592681	34.123768
12776	2018	10	MUREAU	11 - Powerline	0.133597	Los Angeles	-118.678461	34.151321
...	...	...	...	...	...	...	...	...
13164	2018	4	HARBOR	14 - Unknown	0.392633	Ventura	-119.247964	34.214345
13165	2018	1	ERBES	14 - Unknown	0.028838	Ventura	-118.846409	34.235447
13166	2018	10	GRIMES	2 - Equipment Use	0.202070	Ventura	-118.904546	34.343482
13167	2018	6	BEACH	9 - Miscellaneous	0.521827	Ventura	-119.341277	34.298223
13168	2018	7	ROLLING	14 - Unknown	0.007145	Ventura	-118.868440	34.174372

397 rows × 8 columns

The resulting table has 397 rows, which means that there were 397 fires in 2018. We could also get this information with code by using the .shape attribute:

calfire[calfire.get('year') == 2018].shape[0]

In reality, there were probably many more wildfires than this: the dataset contains only those fires which were large enough to be named.

10.2.2. What was the combined size of all fires in 2018?¶

We know how to retrieve only those fires from 2018; we did this above with calfire[calfire.get('year') == 2018]. Now we simply ask for the sum of the 'acres' column:

area_burned_2018 = calfire[calfire.get('year') == 2018].get('acres').sum()
area_burned_2018

1586074.870139362

That certainly seems like a large number, but let’s put it in perspective. The island of Manhattan is 14,600 acres in area. Therefore, the area burned in 2018 was:

print(area_burned_2018 / 14_600, 'times the size of Manhattan.')

108.6352650780385 times the size of Manhattan.

10.2.3. What percentage of all fires are caused by “arson”?¶

We want to select the rows where the cause is due to someone deliberately setting the fire. The 'cause' column, however, contains strings in a special format, such as 1 - Lightning, 2 - Eqipment Use, and so on. To select the fires caused by arson we need to know the exact string to search for.

We can get this information by asking for the unique values that appear in a column with the .unique Series method:

calfire.get('cause').unique()

array(['14 - Unknown', '7 - Arson', '4 - Campfire', '1 - Lightning',
       '9 - Miscellaneous', '5 - Debris', '8 - Playing with fire',
       '3 - Smoking', '15 - Structure', '2 - Equipment Use',
       '10 - Vehicle', '6 - Railroad', '18 - Escaped Prescribed Burn',
       '11 - Powerline', '16 - Aircraft', '12 - Firefighter Training',
       '13 - Non-Firefighter Training'], dtype=object)

It looks like the right string for arson is 7 - Arson. Let’s perform the query:

arsons = calfire[calfire.get('cause') == '7 - Arson']
arsons

	year	month	name	cause	acres	county	longitude	latitude
9	1910	8	CRAWFORD CREEK 2	7 - Arson	497.885071	Humboldt	-123.552471	41.300052
24	1911	7	ORCUTT FLAT	7 - Arson	94.598785	Humboldt	-123.515171	41.319056
25	1911	7	PRIGMORE FLAT	7 - Arson	14.936650	Humboldt	-123.520594	41.288310
29	1911	8	ULLATHORNE CREEK	7 - Arson	9.957536	Humboldt	-123.576445	41.289008
31	1911	8	PEARCH CREEK	7 - Arson	398.317413	Humboldt	-123.495326	41.302627
...	...	...	...	...	...	...	...	...
13387	2019	9	FRAME	7 - Arson	15.022148	Fresno	-120.649290	36.163830
13405	2019	3	LINCOLN	7 - Arson	564.215393	Riverside	-116.072059	33.534968
13409	2019	8	PATTERSON	7 - Arson	139.200714	San Joaquin	-121.538773	37.715312
13425	2019	8	MARSH 5	7 - Arson	226.616867	Contra Costa	-121.828352	37.888016
13459	2019	9	STAGE	7 - Arson	13.019149	Monterey	-121.599207	36.764065

763 rows × 8 columns

To find the percentage of fires caused by arson, we divide the size of this table by the size of the original table:

arsons.shape[0] / calfire.shape[0]

0.05666963755199049

So about 5-6% of all wildfires are known to be caused by arson.

10.2.4. How many fires from 2018 occurred in San Diego County?¶

We can answer this question by first selecting only the rows from 2018. Using this smaller table, we’ll select the rows from San Diego County. This two step process is slightly cumbersome, however – in the next section, we’ll see a better way of performing the same query.

First, we’ll get the fires from 2018:

fires_from_2018 = calfire[calfire.get('year') == 2018]

Now we’ll select only the fires from San Diego County, making sure to use fires_from_2018, and not calfires:

fires_from_2018[fires_from_2018.get('county') == 'San Diego']

	year	month	name	cause	acres	county	longitude	latitude
12800	2018	7	PASQUAL	14 - Unknown	327.586121	San Diego	-116.917288	33.078967
12819	2018	3	MCP BASILONE GATE	14 - Unknown	0.023540	San Diego	-117.559119	33.440113
12821	2018	8	SAN MATEO CAMPGROUND	14 - Unknown	2.909111	San Diego	-117.572772	33.419019
12889	2018	5	MONTE	14 - Unknown	13.088419	San Diego	-116.767460	32.896771
12968	2018	7	WEST	2 - Equipment Use	505.399994	San Diego	-116.745126	32.829008
...	...	...	...	...	...	...	...	...
13056	2018	7	ROCK 5	14 - Unknown	206.256805	San Diego	-117.236669	33.419540
13069	2018	6	RECYCLE	4 - Campfire	248.380676	San Diego	-116.443775	32.654185
13071	2018	5	AGUA	14 - Unknown	59.082077	San Diego	-117.028463	33.363286
13141	2018	6	CHURCH 2	10 - Vehicle	11.045366	San Diego	-116.359866	32.687008
13142	2018	7	BUILDING	9 - Miscellaneous	10.419848	San Diego	-116.766340	32.620508

13 rows × 8 columns

It seems that there were 13 such fires.

10.3. Multiple Conditions¶

How many fires did San Diego County have in 2018? We saw above that this can be answered with two queries, one after the other. But a more direct and arguably better way is to construct a query with multiple conditions, as we’ll do now.

10.3.1. The `&` Operator¶

Let’s begin with another simple example. Below is the table of city populations that we saw at the beginning of this section, but with an additional column containing the region of California that the city is in:

with_regions = bpd.DataFrame().assign(
    City=['Fresno', 'LA', 'SD', 'SF', 'SJ'],
    Population=[502_000, 4_094_000, 1_376_000, 856_000, 1_023_000],
    Region=['Central Valley', 'SoCal', 'SoCal', 'NorCal', 'NorCal']
).set_index('City')
with_regions

	Population	Region
City
Fresno	502000	Central Valley
LA	4094000	SoCal
SD	1376000	SoCal
SF	856000	NorCal
SJ	1023000	NorCal

Suppose we want to select only those cities which 1) have population larger than 1 million, and 2) are in NorCal (Northern California).

We know how to make Boolean arrays for both queries individually:

is_more_than_a_million = with_regions.get('Population') > 1_000_000
is_more_than_a_million

City
Fresno    False
LA         True
SD         True
SF        False
SJ         True
Name: Population, dtype: bool

is_in_norcal = with_regions.get('Region') == 'NorCal'
is_in_norcal

City
Fresno    False
LA        False
SD        False
SF         True
SJ         True
Name: Region, dtype: bool

But we want a Boolean array where an entry is True if (and only if) both of the corresponding entries from is_more_than_a_million and is_in_norcal are True. In other words, an entry should be true if the population is above 1 million and the region is NorCal.

We can construct such an array using the binary and & operator. & takes two Boolean arrays of the same size and returns a new Boolean array that is True only when the input arrays are both True.

is_more_than_a_million & is_in_norcal

City
Fresno    False
LA        False
SD        False
SF        False
SJ         True
dtype: bool

You can check that each entry in this new Boolean array is True if and only if the corresponding entries in is_more_than_a_million and is_in_norcal are both True.

We can now pass this new array into the table to select only the desired rows:

with_regions[is_more_than_a_million & is_in_norcal]

	Population	Region
City
SJ	1023000	NorCal

In this case, we’ve assigned each of the two queries to intermediate variables, is_more_than_a_million and is_in_norcal. This is totally fine, but many times we’ll save ourselves the extra typing by writing the query in one line of code:

with_regions[
    (with_regions.get('Population') > 1_000_000) & (with_regions.get('Region') == 'NorCal')
]

	Population	Region
City
SJ	1023000	NorCal

Notice the () around each individual comparison. These are extremely important! If you don’t include these, you’ll get yelled at by Python:

with_regions[
    with_regions.get('Population') > 1_000_000 & with_regions.get('Region') == 'NorCal'
]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [24], line 2
      1 with_regions[
----> 2     with_regions.get('Population') > 1_000_000 & with_regions.get('Region') == 'NorCal'
      3 ]

TypeError: unsupported operand type(s) for &: 'int' and 'Series'

Warning

Always include parentheses () around separate conditions in a query with multiple conditions. If you are performing a query, and you see a TypeError complaining about an “unsupported operand type”, it is likely due to forgetting the parentheses.

Jupyter Tip

If you forgot parentheses and want to add them quickly, you can select the section of code you want to surround in parentheses and then type (. Jupyter will wrap your entire selection in a single pair of parentheses.

Lastly, Python is all about readability and a e s t h e t i c.

If we’re inside of parentheses or brackets, we can break up long lines of code to make them easier to read and understand:

with_regions[
    (with_regions.get('Population') > 1_000_000)
    &
    (with_regions.get('Region') == 'NorCal')
]

	Population	Region
City
SJ	1023000	NorCal

How would you use & to select all ‘Class E’ fires – fires which burned at least 300 acres (inclusive) but less than 1000 acres?

calfire[(fires.get('acres') >= 300) & (fires.get('acres') < 300)]

10.3.2. The `|` Operator¶

Suppose we wanted to select all fires which occurred in either San Diego County or Imperial County. In this situation, we use the binary or operator, |. This operator takes two Boolean arrays of the same size as input and creates a new array in which a particular entry is True if at least one of the corresponding entries in the input arrays is true.

in_san_diego = calfire.get('county') == "San Diego"
in_imperial = calfire.get('county') == "Imperial"

calfire[in_imperial | in_san_diego]

	year	month	name	cause	acres	county	longitude	latitude
1292	1950	10	TALBOTT	9 - Miscellaneous	522.326538	San Diego	-116.732715	32.699059
1295	1950	6	PUEBLO SIDING	9 - Miscellaneous	315.018127	San Diego	-116.356054	32.613788
1296	1950	10	BANNER	9 - Miscellaneous	2195.823242	San Diego	-116.568905	33.057330
1322	1950	9	AGUA TIBIA	1 - Lightning	1941.808594	San Diego	-116.992760	33.392911
1332	1950	8	CONEJOS	9 - Miscellaneous	62848.742188	San Diego	-116.644504	32.900322
...	...	...	...	...	...	...	...	...
13348	2019	6	PAUMA 4	9 - Miscellaneous	22.021397	San Diego	-116.986509	33.297972
13349	2019	8	AIRPORT	14 - Unknown	1.677516	San Diego	-116.663357	33.285729
13350	2019	9	DEHESA 11	14 - Unknown	225.646027	San Diego	-116.819500	32.777229
13351	2019	11	CRESTWOOD	10 - Vehicle	31.554823	San Diego	-116.349084	32.701877
13371	2019	11	CAMINO	14 - Unknown	0.721892	San Diego	-116.575401	33.256260

846 rows × 8 columns

10.3.3. Never use `and` / `or`!¶

You might know that and and or are valid Python keywords and might wonder why we use & and | instead. In fact, and and or do not perform the right type of comparison when working with arrays. Your code will run, but it will give you the wrong result. Here’s a quick example. Suppose we want to get all cities in NorCal with populations larger than 1 million. We write a compound query, but use and instead of &. The code runs, but is the result correct?

with_regions[
    (with_regions.get('Population') > 1_000_000) and (with_regions.get('Region') == 'NorCal')
]

	Population	Region
City
SF	856000	NorCal
SJ	1023000	NorCal

Notice that SF has been included, even though its population is less than 1 million! Why is this? The short answer is that and does not do the same thing as &. Remember that & works elementwise on arrays, returning a new array which is True only where both of the input arrays are True:

bpd.Series(data=[True, True, False]) & bpd.Series(data=[False, True, True])

  False
   True
  False
dtype: bool

Now let’s try the same thing, but with and instead of &:

bpd.Series(data=[True, True, False]) and bpd.Series(data=[False, True, True])

  False
   True
   True
dtype: bool

That’s not what we expected!

For the purposes of this course, it’s enough to know that and doesn’t do the same thing as &. But if you’re curious as to why this happens, here’s the short version. We know that Python has Boolean literals True and False, and and works with them as you’d expect:

True and False

False

True and True

True

But Python also assigns truth value to other things, like lists and strings. For instance, it is often useful to treat an empty list as if is is False, and a non-empty list as if it is True. We can see the truth value Python assigns to something by using the bool function to convert it to a Boolean:

bool([1, 2, 3])

True

bool([])

False

Because of this, we can use and between things that aren’t True or False. For instance, we can put it between two lists to make sure that both are non-empty:

[1, 2, 3] and [4, 5]

[4, 5]

[1, 2, 3] and []

[]

Note that the output isn’t True or False, it is a list! If we really wanted True or False for some reason, we could do so with bool():

bool([1, 2, 3] and [4, 5])

True

bool([1, 2, 3] and [])

False

But this is the key: apparently, and will produce the second list if both are non-empty. The same thing happens if we use and between Boolean series. Take another look at our example from above:

bpd.Series(data=[True, True, False]) and bpd.Series(data=[False, True, True])

  False
   True
   True
dtype: bool

Notice that what is returned is actually the array on the right hand side of and! This almost never what we want when comparing arrays, so remember the following warning:

Warning

Never use and when writing comparisons in a query. Always use &. Likewise, never use or; always use |.

We will use and and or later on when writing conditionals.

10.4. Searching for a Substring¶

Wildfires are typically named after the place where they started. For example, fires that start near the border between California and another state or Mexico are often named something like “BORDER #6”, or “BORDER #12”, etc. We know how to write a query to select all fires with a specific name. For instance, it turns out that there are multiple fires called “BORDER#2”.

calfire[calfire.get('name') == 'BORDER#2']

	year	month	name	cause	acres	county	longitude	latitude
7802	2002	2	BORDER#2	4 - Campfire	270.632538	San Diego	-116.692025	32.644564
8093	2003	8	BORDER#2	4 - Campfire	112.706635	San Diego	-116.915079	32.602055

But how do we retrieve all fires with “BORDER” somewhere in their names? It turns out that babypandas includes a helpful Series method called .str.contains which can help us do exactly this. It accepts one argument – a string – and searches for it within each entry of the Series, returning a Boolean array. For instance, to find the fires with “BORDER” in their name, we write:

calfire[calfire.get('name').str.contains('BORDER')]

	year	month	name	cause	acres	county	longitude	latitude
606	1934	6	BORDER MT.	1 - Lightning	9.332158	Siskiyou	-121.467036	41.489674
824	1939	9	BORDER MT.	9 - Miscellaneous	48.154457	Siskiyou	-121.466297	41.487778
7187	1999	7	BORDER #70	4 - Campfire	118.480690	San Diego	-116.808309	32.567800
7418	2000	7	BORDER #6	14 - Unknown	74.557014	San Diego	-116.335658	32.657144
7802	2002	2	BORDER#2	4 - Campfire	270.632538	San Diego	-116.692025	32.644564
...	...	...	...	...	...	...	...	...
12092	2016	6	BORDER 3	14 - Unknown	7606.026367	San Diego	-116.557240	32.604410
12710	2017	9	BORDER 9	14 - Unknown	22.502857	San Diego	-116.668979	32.574468
12759	2017	9	BORDER 6	14 - Unknown	135.044418	San Diego	-116.750203	32.585912
13317	2019	8	BORDER 10	14 - Unknown	11.862469	San Diego	-116.866295	32.557500
13318	2019	8	BORDER 9	7 - Arson	67.372292	San Diego	-116.741779	32.569219

45 rows × 8 columns

10.4.1. A Trick to Avoid Spurious Matches¶

Many fires are named after roads. Let’s count how many:

calfire[calfire.get('name').str.contains('ROAD')]

	year	month	name	cause	acres	county	longitude	latitude
467	1930	7	ROAD #1	5 - Debris	6.970353	Trinity	-123.528351	40.239767
497	1931	10	NIELSON ROAD	4 - Campfire	2.489452	Trinity	-123.364262	40.019498
791	1939	8	CUTOFF ROAD	7 - Arson	1057.719482	Siskiyou	-122.865814	41.825530
1106	1946	9	RAILROAD #13	14 - Unknown	240.841370	Mendocino	-123.319311	39.373807
1303	1950	7	SCULLY ROAD	14 - Unknown	416.051544	Amador	-120.925246	38.300486
...	...	...	...	...	...	...	...	...
12237	2017	8	RAILROAD	9 - Miscellaneous	12371.148438	Madera	-119.612630	37.447218
12516	2017	9	RAILROAD	14 - Unknown	2.216443	Siskiyou	-122.520362	41.526095
12567	2017	7	ROADRUNNER	7 - Arson	2284.595947	Tulare	-118.933014	36.015821
13147	2018	11	ROADRUNNER	14 - Unknown	0.613522	Ventura	-118.921408	34.206380
13218	2019	8	R4 RAILROAD	1 - Lightning	24.105974	Lassen	-120.375015	40.629286

132 rows × 8 columns

We see several “good” matches, like “NIELSON ROAD” and “CUTOFF ROAD”, but several “spurious” matches, like “ROADRUNNER”, and “RAILROAD”. How can we exclude these fires?

Here’s a trick: instead of searching for strings containing 'ROAD', we’ll search for strings containing ' ROAD', where we have added a space to the beginning of the search string. This will match strings like “CUTOFF ROAD”, but not “RAILROAD”.

calfire[calfire.get('name').str.contains(' ROAD')]

	year	month	name	cause	acres	county	longitude	latitude
497	1931	10	NIELSON ROAD	4 - Campfire	2.489452	Trinity	-123.364262	40.019498
791	1939	8	CUTOFF ROAD	7 - Arson	1057.719482	Siskiyou	-122.865814	41.825530
1303	1950	7	SCULLY ROAD	14 - Unknown	416.051544	Amador	-120.925246	38.300486
1441	1951	8	SMITH ROAD	3 - Smoking	629.024658	Calaveras	-120.314476	38.354487
1760	1953	5	RICE ROAD	14 - Unknown	638.207275	Ventura	-119.296153	34.441824
...	...	...	...	...	...	...	...	...
5176	1985	7	GAS POINT ROAD	14 - Unknown	93.247581	Shasta	-122.477858	40.391840
5177	1985	7	GAS POINT ROAD	14 - Unknown	78.217186	Shasta	-122.439950	40.392093
5178	1985	7	GAS POINT ROAD	14 - Unknown	106.863319	Shasta	-122.430627	40.394094
6579	1996	7	MINES ROAD	2 - Equipment Use	174.051270	Santa Clara	-121.523671	37.475377
8870	2006	7	JOHNSON ROAD FIRE	14 - Unknown	110.659088	Los Angeles	-118.334695	34.661443

48 rows × 8 columns

Strings have other useful string methods, as well. In this case, all of the fire names are capitalized, but it is common to find that datasets are inconsistent in their capitalization. For instance:

cities = bpd.DataFrame().assign(names=[
    'San Diego', 'los angeles', 'san luis obispo', 'Oakland', 'Stockton'
])

Note that some of the strings are not capitalized, while others are. If we want to retrieve all strings that contain San, we’ll have to write two queries: one for San and another for san. But there’s an easier way.

A common practice is to make sure all of your text data for a given column is in the same capitalization before selecting on substrings. We can chain together the .str.lower method with the .str.contains method to quickly achieve this:

cities.get('names').str.lower()

        san diego
      los angeles
  san luis obispo
          oakland
         stockton
Name: names, dtype: object

cities[cities.get('names').str.lower().str.contains('san')]

	names
0	San Diego
2	san luis obispo

It’s worth noting that if you try using the .str methods on a Series that doesn’t contain text, you’ll encounter a helpful error (once you scroll to the bottom).

calfire.get('year').str.contains('2019')

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [46], line 1
----> 1 calfire.get('year').str.contains('2019')

File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/babypandas/bpd.py:850, in Series.str(self)
    844 '''
    845 String methods on Series.
    846 '''
    847 # accessing the `.str` attribute of a pd.Series will raise an 
    848 # AttributeError if the series does not consist of string values. We
    849 # use a property here to replicate this behavior.
--> 850 return SeriesStringMethods(self._pd.str)

File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/pandas/core/generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)

File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/pandas/core/accessor.py:182, in CachedAccessor.__get__(self, obj, cls)
    179 if obj is None:
    180     # we're accessing the attribute of the class, i.e., Dataset.geo
    181     return self._accessor
--> 182 accessor_obj = self._accessor(obj)
    183 # Replace the property with the accessor object. Inspired by:
    184 # https://www.pydanny.com/cached-property.html
    185 # We need to use object.__setattr__ because we overwrite __setattr__ on
    186 # NDFrame
    187 object.__setattr__(obj, self._name, accessor_obj)

File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/pandas/core/strings/accessor.py:177, in StringMethods.__init__(self, data)
    174 def __init__(self, data):
    175     from pandas.core.arrays.string_ import StringDtype
--> 177     self._inferred_dtype = self._validate(data)
    178     self._is_categorical = is_categorical_dtype(data.dtype)
    179     self._is_string = isinstance(data.dtype, StringDtype)

File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/pandas/core/strings/accessor.py:231, in StringMethods._validate(data)
    228 inferred_dtype = lib.infer_dtype(values, skipna=True)
    230 if inferred_dtype not in allowed_types:
--> 231     raise AttributeError("Can only use .str accessor with string values!")
    232 return inferred_dtype

AttributeError: Can only use .str accessor with string values!

Tip

Often times, when reading an error message it’s most helpful to look at the two ends of the message – and don’t get too worried about the middle bits.

The very top line points to where the error occurred, and the very bottom lines explain why the error occurred.

10.5. More Examples¶

10.5.1. What percentage of fires occurred in September or October?¶

We’ll use | to find those fires whose 'month' is either 9 or 10:

sept_or_oct = calfire[(calfire.get('month') == 9) | (calfire.get('month') == 10)]
sept_or_oct.shape[0] / calfire.shape[0]

0.2335858585858586

10.5.2. How many fires occurred between San Francisco and LA in terms of latitude?¶

The latitude of San Francisco is 37.7749, while the latitude of LA is 34.0522. Selecting all fires within this range can be done with &:

calfire[
    (calfire.get('latitude') <= 37.7749)
    &
    (calfire.get('latitude') >= 34.0522)
].shape[0]

10.5.3. How many fires larger than 1000 acres did San Diego have in the 1990s?¶

Now we’re having fun! There are three conditions here: 1) Larger than 1000 acres, 2) in San Diego County, and 3) in the 1990s. The third condition, however, is actually two conditions: after 1990 and before 2000.

With a complex query like this, it’s probably best to make the three Boolean arrays separately and save them in intermediate variables, like this:

is_gt_1000_acres = calfire.get('acres') > 1000
is_in_sd = calfire.get('county') == 'San Diego'
is_in_1990s = (calfire.get('year') >= 1990) & (calfire.get('year') < 2000)

Now we can perform the query. All three of these conditions must be true simultaneously, so we use &:

calfire[is_gt_1000_acres & is_in_sd & is_in_1990s]

	year	month	name	cause	acres	county	longitude	latitude
5799	1990	6	MORRETTIS	7 - Arson	3314.026611	San Diego	-116.667315	33.196718
5805	1990	9	PAINT	9 - Miscellaneous	2760.890869	San Diego	-117.144556	33.058681
5811	1990	10	CASE	9 - Miscellaneous	2062.641357	San Diego	-117.441422	33.432863
5924	1991	6	BARRETT	2 - Equipment Use	1609.205811	San Diego	-116.670060	32.618218
5985	1992	11	BRAVO	14 - Unknown	2023.847290	San Diego	-117.543498	33.374327
...	...	...	...	...	...	...	...	...
6999	1998	7	SUMMIT	14 - Unknown	1123.004883	San Diego	-116.732161	32.604883
7107	1999	8	LAGUNA 100	7 - Arson	4411.434570	San Diego	-116.406902	32.900014
7188	1999	9	PROCTOR	7 - Arson	1460.064453	San Diego	-116.918494	32.692336
7217	1999	10	LA JOLLA	5 - Debris	7845.939941	San Diego	-116.789519	33.273254
7331	1999	10	SHOCKEY	14 - Unknown	3284.517578	San Diego	-116.405917	32.607751

32 rows × 8 columns

Notes on (Baby)Pandas

Querying

Contents

10. Querying¶

10.1. Querying with Boolean Arrays¶

10.2. Examples¶

10.2.1. How many fires were there in 2018?¶

10.2.2. What was the combined size of all fires in 2018?¶

10.2.3. What percentage of all fires are caused by “arson”?¶

10.2.4. How many fires from 2018 occurred in San Diego County?¶

10.3. Multiple Conditions¶

10.3.1. The `&` Operator¶

10.3.2. The `|` Operator¶

10.3.3. Never use `and` / `or`!¶

10.4. Searching for a Substring¶

10.4.1. A Trick to Avoid Spurious Matches¶

10.5. More Examples¶

10.5.1. What percentage of fires occurred in September or October?¶

10.5.2. How many fires occurred between San Francisco and LA in terms of latitude?¶

10.5.3. How many fires larger than 1000 acres did San Diego have in the 1990s?¶

Notes on (Baby)Pandas

Querying

Contents

10. Querying¶

10.1. Querying with Boolean Arrays¶

10.2. Examples¶

10.2.1. How many fires were there in 2018?¶

10.2.2. What was the combined size of all fires in 2018?¶

10.2.3. What percentage of all fires are caused by “arson”?¶

10.2.4. How many fires from 2018 occurred in San Diego County?¶

10.3. Multiple Conditions¶

10.3.1. The & Operator¶

10.3.2. The | Operator¶

10.3.3. Never use and / or!¶

10.4. Searching for a Substring¶

10.4.1. A Trick to Avoid Spurious Matches¶

10.5. More Examples¶

10.5.1. What percentage of fires occurred in September or October?¶

10.5.2. How many fires occurred between San Francisco and LA in terms of latitude?¶

10.5.3. How many fires larger than 1000 acres did San Diego have in the 1990s?¶

10.3.1. The `&` Operator¶

10.3.2. The `|` Operator¶

10.3.3. Never use `and` / `or`!¶