Querying
Contents
10. Querying¶
The 2018 wildfire season was the worst in California history. How many fires were in 2018? What was their combined size? How many were in San Diego County? These questions could be easily answered if we had a table consisting only of fires from 2018, but our dataset has fires from all years. Creating a new table by selecting only certain rows from an existing table which satisfy some condition is called a query. In this section, we’ll see how to perform queries.
10.1. Querying with Boolean Arrays¶
Let’s start with a simple example. Below, we have created a table containing the population of the five largest cities in California. The table is arranged in alphabetical order:
import babypandas as bpd
populations = (
bpd.DataFrame()
.assign(
City=['Fresno', 'LA', 'SD', 'SF', 'SJ'],
Population=[502_000, 4_094_000, 1_376_000, 856_000, 1_023_000]
)
.set_index('City')
)
populations
Population | |
---|---|
City | |
Fresno | 502000 |
LA | 4094000 |
SD | 1376000 |
SF | 856000 |
SJ | 1023000 |
Tip
To create a table by hand, start by creating an empty DataFrame with bpd.DataFrame()
, then use .assign
to add columns to the table.
Suppose we want to know which cities have populations larger than 1 million. We can do so with the following query, the result of which is a new table consisting only of the cities whose population is greater than 1 million.
populations[populations.get('Population') > 1_000_000]
Population | |
---|---|
City | |
LA | 4094000 |
SD | 1376000 |
SJ | 1023000 |
Let’s take a deeper look at what is going on insider of a query. At it’s heart is a comparison: populations.get('Population') > 1_000_000
. Recall from Booleans that a comparison between two numbers results in a Boolean value (that is, the result is either True
or False
). For instance, is we ask if 2 million is larger than 1 million, we get:
2_000_000 > 1_000_000
True
Now, in this case of populations.get('Population') > 1_000_000
, we are comparing an entire column of the table to a single number, 1_000_000
. What will be the result? As with arithmetic on a table column, comparisons are performed elementwise:
populations.get('Population') > 1_000_000
City
Fresno False
LA True
SD True
SF False
SJ True
Name: Population, dtype: bool
The result is a Series with one entry for every row of the original table. The Series is True
where the comparison is true and False
where the comparison is not true. Because each entry of the Series is a Boolean, the result is called a Boolean array (or, more precisely, a Boolean series). The entries of the Boolean array tell us exactly which rows to keep and which rows to discard. If an entry is True
, we want to keep the corresponding row of the table – if it is False
, we should discard that row.
We can carry out the query and produce a new table by passing the Boolean array into the table using square bracket notation:
populations[populations.get('Population') > 1_000_000]
Population | |
---|---|
City | |
LA | 4094000 |
SD | 1376000 |
SJ | 1023000 |
Tip
Using square brackets on a table can be read aloud as “where”.
So the expression
populations[populations.get('Population') > 1_000_000]
is read as “the rows where the populations is greater than 1 million”.
It should be mentioned that any Boolean array whose length matches the number of rows in the table can be used to select a subset of the table’s rows. For instance, we could construct our own Boolean array by hand and get the same result:
populations[[False, True, True, False, True]]
Population | |
---|---|
City | |
LA | 4094000 |
SD | 1376000 |
SJ | 1023000 |
Of course, it’s usually much more convenient to create the Boolean arrays with comparisons, like populations.get('Population') > 1_000_000
.
Lastly, note that if no row satisfies your condition, an empty table will be returned. For instance, no cities have more than 10 million people:
populations[populations.get('Population') > 10_000_000]
Population | |
---|---|
City |
10.2. Examples¶
Let’s get some practice with querying on a larger data set. For this, we’ll use the full dataset of California wildfires. While the wildfire dataset we saw previously contained only the 50 largest fires in California history, this dataset will contain all named fires since the late 1800’s. It is in a file called calfire-full.csv
in the data
directory:
calfire = bpd.read_csv('data/calfire-full.csv')
calfire
year | month | name | cause | acres | county | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
0 | 1898 | 9 | LOS PADRES | 14 - Unknown | 20539.949219 | Ventura | -119.367830 | 34.446830 |
1 | 1898 | 4 | MATILIJA | 14 - Unknown | 2641.123047 | Ventura | -119.299625 | 34.488614 |
2 | 1898 | 9 | COZY DELL | 14 - Unknown | 2974.585205 | Ventura | -119.265380 | 34.482316 |
3 | 1902 | 8 | FEROUD | 14 - Unknown | 731.481567 | Ventura | -119.320979 | 34.417515 |
4 | 1903 | 10 | SAN ANTONIO | 14 - Unknown | 380.260590 | Ventura | -119.253422 | 34.430616 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
13459 | 2019 | 9 | STAGE | 7 - Arson | 13.019149 | Monterey | -121.599207 | 36.764065 |
13460 | 2019 | 10 | CROSS | 14 - Unknown | 289.151428 | Monterey | -120.726245 | 35.793698 |
13461 | 2019 | 9 | FRUDDEN | 2 - Equipment Use | 11.789393 | Monterey | -120.908061 | 35.908627 |
13462 | 2019 | 9 | JOLON | 11 - Powerline | 61.592369 | Monterey | -121.010025 | 35.910750 |
13463 | 2019 | 10 | SADDLE RIDGE | 14 - Unknown | 8799.325195 | Los Angeles | -118.516473 | 34.321859 |
13464 rows × 8 columns
Notice that we haven’t set the index of the table. The natural choice of index would be the 'name'
column, but it turns out that several fires have the same name. Remember that row labels should be unique!
10.2.1. How many fires were there in 2018?¶
We can determine how many fires there were in 2018 by selecting only those rows where the 'year'
value is equal to 2018 and calculating the size of the resulting table. Remember: to ask if two values are equal, we use ==
, not =
.
calfire[calfire.get('year') == 2018]
year | month | name | cause | acres | county | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
12772 | 2018 | 7 | LAGUNA | 2 - Equipment Use | 1.116125 | San Luis Obispo | -120.632741 | 35.588189 |
12773 | 2018 | 12 | OAK | 11 - Powerline | 2.180850 | Los Angeles | -118.692809 | 34.473468 |
12774 | 2018 | 11 | COPCO | 9 - Miscellaneous | 6.980345 | Los Angeles | -118.797409 | 34.720091 |
12775 | 2018 | 10 | SANTA MARIA | 14 - Unknown | 1.347234 | Los Angeles | -118.592681 | 34.123768 |
12776 | 2018 | 10 | MUREAU | 11 - Powerline | 0.133597 | Los Angeles | -118.678461 | 34.151321 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
13164 | 2018 | 4 | HARBOR | 14 - Unknown | 0.392633 | Ventura | -119.247964 | 34.214345 |
13165 | 2018 | 1 | ERBES | 14 - Unknown | 0.028838 | Ventura | -118.846409 | 34.235447 |
13166 | 2018 | 10 | GRIMES | 2 - Equipment Use | 0.202070 | Ventura | -118.904546 | 34.343482 |
13167 | 2018 | 6 | BEACH | 9 - Miscellaneous | 0.521827 | Ventura | -119.341277 | 34.298223 |
13168 | 2018 | 7 | ROLLING | 14 - Unknown | 0.007145 | Ventura | -118.868440 | 34.174372 |
397 rows × 8 columns
The resulting table has 397 rows, which means that there were 397 fires in 2018. We could also get this information with code by using the .shape
attribute:
calfire[calfire.get('year') == 2018].shape[0]
397
In reality, there were probably many more wildfires than this: the dataset contains only those fires which were large enough to be named.
10.2.2. What was the combined size of all fires in 2018?¶
We know how to retrieve only those fires from 2018; we did this above with calfire[calfire.get('year') == 2018]
. Now we simply ask for the sum of the 'acres'
column:
area_burned_2018 = calfire[calfire.get('year') == 2018].get('acres').sum()
area_burned_2018
1586074.870139362
That certainly seems like a large number, but let’s put it in perspective. The island of Manhattan is 14,600 acres in area. Therefore, the area burned in 2018 was:
print(area_burned_2018 / 14_600, 'times the size of Manhattan.')
108.6352650780385 times the size of Manhattan.
10.2.3. What percentage of all fires are caused by “arson”?¶
We want to select the rows where the cause is due to someone deliberately setting the fire. The 'cause'
column, however, contains strings in a special format, such as 1 - Lightning
, 2 - Eqipment Use
, and so on. To select the fires caused by arson we need to know the exact string to search for.
We can get this information by asking for the unique values that appear in a column with the .unique
Series method:
calfire.get('cause').unique()
array(['14 - Unknown', '7 - Arson', '4 - Campfire', '1 - Lightning',
'9 - Miscellaneous', '5 - Debris', '8 - Playing with fire',
'3 - Smoking', '15 - Structure', '2 - Equipment Use',
'10 - Vehicle', '6 - Railroad', '18 - Escaped Prescribed Burn',
'11 - Powerline', '16 - Aircraft', '12 - Firefighter Training',
'13 - Non-Firefighter Training'], dtype=object)
It looks like the right string for arson is 7 - Arson
. Let’s perform the query:
arsons = calfire[calfire.get('cause') == '7 - Arson']
arsons
year | month | name | cause | acres | county | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
9 | 1910 | 8 | CRAWFORD CREEK 2 | 7 - Arson | 497.885071 | Humboldt | -123.552471 | 41.300052 |
24 | 1911 | 7 | ORCUTT FLAT | 7 - Arson | 94.598785 | Humboldt | -123.515171 | 41.319056 |
25 | 1911 | 7 | PRIGMORE FLAT | 7 - Arson | 14.936650 | Humboldt | -123.520594 | 41.288310 |
29 | 1911 | 8 | ULLATHORNE CREEK | 7 - Arson | 9.957536 | Humboldt | -123.576445 | 41.289008 |
31 | 1911 | 8 | PEARCH CREEK | 7 - Arson | 398.317413 | Humboldt | -123.495326 | 41.302627 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
13387 | 2019 | 9 | FRAME | 7 - Arson | 15.022148 | Fresno | -120.649290 | 36.163830 |
13405 | 2019 | 3 | LINCOLN | 7 - Arson | 564.215393 | Riverside | -116.072059 | 33.534968 |
13409 | 2019 | 8 | PATTERSON | 7 - Arson | 139.200714 | San Joaquin | -121.538773 | 37.715312 |
13425 | 2019 | 8 | MARSH 5 | 7 - Arson | 226.616867 | Contra Costa | -121.828352 | 37.888016 |
13459 | 2019 | 9 | STAGE | 7 - Arson | 13.019149 | Monterey | -121.599207 | 36.764065 |
763 rows × 8 columns
To find the percentage of fires caused by arson, we divide the size of this table by the size of the original table:
arsons.shape[0] / calfire.shape[0]
0.05666963755199049
So about 5-6% of all wildfires are known to be caused by arson.
10.2.4. How many fires from 2018 occurred in San Diego County?¶
We can answer this question by first selecting only the rows from 2018. Using this smaller table, we’ll select the rows from San Diego County. This two step process is slightly cumbersome, however – in the next section, we’ll see a better way of performing the same query.
First, we’ll get the fires from 2018:
fires_from_2018 = calfire[calfire.get('year') == 2018]
Now we’ll select only the fires from San Diego County, making sure to use fires_from_2018
, and not calfires
:
fires_from_2018[fires_from_2018.get('county') == 'San Diego']
year | month | name | cause | acres | county | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
12800 | 2018 | 7 | PASQUAL | 14 - Unknown | 327.586121 | San Diego | -116.917288 | 33.078967 |
12819 | 2018 | 3 | MCP BASILONE GATE | 14 - Unknown | 0.023540 | San Diego | -117.559119 | 33.440113 |
12821 | 2018 | 8 | SAN MATEO CAMPGROUND | 14 - Unknown | 2.909111 | San Diego | -117.572772 | 33.419019 |
12889 | 2018 | 5 | MONTE | 14 - Unknown | 13.088419 | San Diego | -116.767460 | 32.896771 |
12968 | 2018 | 7 | WEST | 2 - Equipment Use | 505.399994 | San Diego | -116.745126 | 32.829008 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
13056 | 2018 | 7 | ROCK 5 | 14 - Unknown | 206.256805 | San Diego | -117.236669 | 33.419540 |
13069 | 2018 | 6 | RECYCLE | 4 - Campfire | 248.380676 | San Diego | -116.443775 | 32.654185 |
13071 | 2018 | 5 | AGUA | 14 - Unknown | 59.082077 | San Diego | -117.028463 | 33.363286 |
13141 | 2018 | 6 | CHURCH 2 | 10 - Vehicle | 11.045366 | San Diego | -116.359866 | 32.687008 |
13142 | 2018 | 7 | BUILDING | 9 - Miscellaneous | 10.419848 | San Diego | -116.766340 | 32.620508 |
13 rows × 8 columns
It seems that there were 13 such fires.
10.3. Multiple Conditions¶
How many fires did San Diego County have in 2018? We saw above that this can be answered with two queries, one after the other. But a more direct and arguably better way is to construct a query with multiple conditions, as we’ll do now.
10.3.1. The &
Operator¶
Let’s begin with another simple example. Below is the table of city populations that we saw at the beginning of this section, but with an additional column containing the region of California that the city is in:
with_regions = bpd.DataFrame().assign(
City=['Fresno', 'LA', 'SD', 'SF', 'SJ'],
Population=[502_000, 4_094_000, 1_376_000, 856_000, 1_023_000],
Region=['Central Valley', 'SoCal', 'SoCal', 'NorCal', 'NorCal']
).set_index('City')
with_regions
Population | Region | |
---|---|---|
City | ||
Fresno | 502000 | Central Valley |
LA | 4094000 | SoCal |
SD | 1376000 | SoCal |
SF | 856000 | NorCal |
SJ | 1023000 | NorCal |
Suppose we want to select only those cities which 1) have population larger than 1 million, and 2) are in NorCal (Northern California).
We know how to make Boolean arrays for both queries individually:
is_more_than_a_million = with_regions.get('Population') > 1_000_000
is_more_than_a_million
City
Fresno False
LA True
SD True
SF False
SJ True
Name: Population, dtype: bool
is_in_norcal = with_regions.get('Region') == 'NorCal'
is_in_norcal
City
Fresno False
LA False
SD False
SF True
SJ True
Name: Region, dtype: bool
But we want a Boolean array where an entry is True
if (and only if) both of the corresponding entries from is_more_than_a_million
and is_in_norcal
are True
. In other words, an entry should be true if the population is above 1 million and the region is NorCal.
We can construct such an array using the binary and &
operator. &
takes two Boolean arrays of the same size and returns a new Boolean array that is True only when the input arrays are both True.
is_more_than_a_million & is_in_norcal
City
Fresno False
LA False
SD False
SF False
SJ True
dtype: bool
You can check that each entry in this new Boolean array is True
if and only if the corresponding entries in is_more_than_a_million
and is_in_norcal
are both True
.
We can now pass this new array into the table to select only the desired rows:
with_regions[is_more_than_a_million & is_in_norcal]
Population | Region | |
---|---|---|
City | ||
SJ | 1023000 | NorCal |
In this case, we’ve assigned each of the two queries to intermediate variables, is_more_than_a_million
and is_in_norcal
. This is totally fine, but many times we’ll save ourselves the extra typing by writing the query in one line of code:
with_regions[
(with_regions.get('Population') > 1_000_000) & (with_regions.get('Region') == 'NorCal')
]
Population | Region | |
---|---|---|
City | ||
SJ | 1023000 | NorCal |
Notice the ()
around each individual comparison. These are extremely important! If you don’t include these, you’ll get yelled at by Python:
with_regions[
with_regions.get('Population') > 1_000_000 & with_regions.get('Region') == 'NorCal'
]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [24], line 2
1 with_regions[
----> 2 with_regions.get('Population') > 1_000_000 & with_regions.get('Region') == 'NorCal'
3 ]
TypeError: unsupported operand type(s) for &: 'int' and 'Series'
Warning
Always include parentheses ()
around separate conditions in a query with multiple conditions. If you are performing a query, and you see a TypeError
complaining about an “unsupported operand type”, it is likely due to forgetting the parentheses.
Jupyter Tip
If you forgot parentheses and want to add them quickly, you can select the section of code you want to surround in parentheses and then type (
. Jupyter will wrap your entire selection in a single pair of parentheses.
Lastly, Python is all about readability and a e s t h e t i c.
If we’re inside of parentheses or brackets, we can break up long lines of code to make them easier to read and understand:
with_regions[
(with_regions.get('Population') > 1_000_000)
&
(with_regions.get('Region') == 'NorCal')
]
Population | Region | |
---|---|---|
City | ||
SJ | 1023000 | NorCal |
How would you use &
to select all ‘Class E’ fires – fires which burned at least 300 acres (inclusive) but less than 1000 acres?
calfire[(fires.get('acres') >= 300) & (fires.get('acres') < 300)]
10.3.2. The |
Operator¶
Suppose we wanted to select all fires which occurred in either San Diego County or Imperial County. In this situation,
we use the binary or operator, |
. This operator takes two Boolean arrays of the same size as input and creates a new array in which a particular entry is True
if at least one of the corresponding entries in the input arrays is true.
in_san_diego = calfire.get('county') == "San Diego"
in_imperial = calfire.get('county') == "Imperial"
calfire[in_imperial | in_san_diego]
year | month | name | cause | acres | county | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
1292 | 1950 | 10 | TALBOTT | 9 - Miscellaneous | 522.326538 | San Diego | -116.732715 | 32.699059 |
1295 | 1950 | 6 | PUEBLO SIDING | 9 - Miscellaneous | 315.018127 | San Diego | -116.356054 | 32.613788 |
1296 | 1950 | 10 | BANNER | 9 - Miscellaneous | 2195.823242 | San Diego | -116.568905 | 33.057330 |
1322 | 1950 | 9 | AGUA TIBIA | 1 - Lightning | 1941.808594 | San Diego | -116.992760 | 33.392911 |
1332 | 1950 | 8 | CONEJOS | 9 - Miscellaneous | 62848.742188 | San Diego | -116.644504 | 32.900322 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
13348 | 2019 | 6 | PAUMA 4 | 9 - Miscellaneous | 22.021397 | San Diego | -116.986509 | 33.297972 |
13349 | 2019 | 8 | AIRPORT | 14 - Unknown | 1.677516 | San Diego | -116.663357 | 33.285729 |
13350 | 2019 | 9 | DEHESA 11 | 14 - Unknown | 225.646027 | San Diego | -116.819500 | 32.777229 |
13351 | 2019 | 11 | CRESTWOOD | 10 - Vehicle | 31.554823 | San Diego | -116.349084 | 32.701877 |
13371 | 2019 | 11 | CAMINO | 14 - Unknown | 0.721892 | San Diego | -116.575401 | 33.256260 |
846 rows × 8 columns
10.3.3. Never use and
/ or
!¶
You might know that and
and or
are valid Python keywords and might wonder why we use &
and |
instead. In fact, and
and or
do not perform the right type of comparison when working with arrays. Your code will run, but it will give you the wrong result. Here’s a quick example. Suppose we want to get all cities in NorCal with populations larger than 1 million. We write a compound query, but use and
instead of &
. The code runs, but is the result correct?
with_regions[
(with_regions.get('Population') > 1_000_000) and (with_regions.get('Region') == 'NorCal')
]
Population | Region | |
---|---|---|
City | ||
SF | 856000 | NorCal |
SJ | 1023000 | NorCal |
Notice that SF has been included, even though its population is less than 1 million! Why is this? The short answer is that and
does not do the same thing as &
. Remember that &
works elementwise on arrays, returning a new array which is True
only where both of the input arrays are True
:
bpd.Series(data=[True, True, False]) & bpd.Series(data=[False, True, True])
0 False
1 True
2 False
dtype: bool
Now let’s try the same thing, but with and
instead of &
:
bpd.Series(data=[True, True, False]) and bpd.Series(data=[False, True, True])
0 False
1 True
2 True
dtype: bool
That’s not what we expected!
For the purposes of this course, it’s enough to know that and
doesn’t do the same thing as &
. But if you’re curious as to why this happens, here’s the short version. We know that Python has Boolean literals True
and False
, and and
works with them as you’d expect:
True and False
False
True and True
True
But Python also assigns truth value to other things, like lists and strings. For instance, it is often useful to treat an empty list as if is is False
, and a non-empty list as if it is True
. We can see the truth value Python assigns to something by using the bool
function to convert it to a Boolean:
bool([1, 2, 3])
True
bool([])
False
Because of this, we can use and
between things that aren’t True
or False
. For instance, we can put it between two lists to make sure that both are non-empty:
[1, 2, 3] and [4, 5]
[4, 5]
[1, 2, 3] and []
[]
Note that the output isn’t True
or False
, it is a list! If we really wanted True
or False
for some reason, we could do so with bool()
:
bool([1, 2, 3] and [4, 5])
True
bool([1, 2, 3] and [])
False
But this is the key: apparently, and
will produce the second list if both are non-empty. The same thing happens if we use and
between Boolean series. Take another look at our example from above:
bpd.Series(data=[True, True, False]) and bpd.Series(data=[False, True, True])
0 False
1 True
2 True
dtype: bool
Notice that what is returned is actually the array on the right hand side of and
! This almost never what we want when comparing arrays, so remember the following warning:
Warning
Never use and
when writing comparisons in a query. Always use &
. Likewise, never use or
; always use |
.
We will use and
and or
later on when writing conditionals.
10.4. Searching for a Substring¶
Wildfires are typically named after the place where they started. For example, fires that start near the border between California and another state or Mexico are often named something like “BORDER #6”, or “BORDER #12”, etc. We know how to write a query to select all fires with a specific name. For instance, it turns out that there are multiple fires called “BORDER#2”.
calfire[calfire.get('name') == 'BORDER#2']
year | month | name | cause | acres | county | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
7802 | 2002 | 2 | BORDER#2 | 4 - Campfire | 270.632538 | San Diego | -116.692025 | 32.644564 |
8093 | 2003 | 8 | BORDER#2 | 4 - Campfire | 112.706635 | San Diego | -116.915079 | 32.602055 |
But how do we retrieve all fires with “BORDER” somewhere in their names? It turns out that babypandas includes a helpful Series method called .str.contains
which can help us do exactly this. It accepts one argument – a string – and searches for it within each entry of the Series, returning a Boolean array. For instance, to find the fires with “BORDER” in their name, we write:
calfire[calfire.get('name').str.contains('BORDER')]
year | month | name | cause | acres | county | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
606 | 1934 | 6 | BORDER MT. | 1 - Lightning | 9.332158 | Siskiyou | -121.467036 | 41.489674 |
824 | 1939 | 9 | BORDER MT. | 9 - Miscellaneous | 48.154457 | Siskiyou | -121.466297 | 41.487778 |
7187 | 1999 | 7 | BORDER #70 | 4 - Campfire | 118.480690 | San Diego | -116.808309 | 32.567800 |
7418 | 2000 | 7 | BORDER #6 | 14 - Unknown | 74.557014 | San Diego | -116.335658 | 32.657144 |
7802 | 2002 | 2 | BORDER#2 | 4 - Campfire | 270.632538 | San Diego | -116.692025 | 32.644564 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
12092 | 2016 | 6 | BORDER 3 | 14 - Unknown | 7606.026367 | San Diego | -116.557240 | 32.604410 |
12710 | 2017 | 9 | BORDER 9 | 14 - Unknown | 22.502857 | San Diego | -116.668979 | 32.574468 |
12759 | 2017 | 9 | BORDER 6 | 14 - Unknown | 135.044418 | San Diego | -116.750203 | 32.585912 |
13317 | 2019 | 8 | BORDER 10 | 14 - Unknown | 11.862469 | San Diego | -116.866295 | 32.557500 |
13318 | 2019 | 8 | BORDER 9 | 7 - Arson | 67.372292 | San Diego | -116.741779 | 32.569219 |
45 rows × 8 columns
10.4.1. A Trick to Avoid Spurious Matches¶
Many fires are named after roads. Let’s count how many:
calfire[calfire.get('name').str.contains('ROAD')]
year | month | name | cause | acres | county | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
467 | 1930 | 7 | ROAD #1 | 5 - Debris | 6.970353 | Trinity | -123.528351 | 40.239767 |
497 | 1931 | 10 | NIELSON ROAD | 4 - Campfire | 2.489452 | Trinity | -123.364262 | 40.019498 |
791 | 1939 | 8 | CUTOFF ROAD | 7 - Arson | 1057.719482 | Siskiyou | -122.865814 | 41.825530 |
1106 | 1946 | 9 | RAILROAD #13 | 14 - Unknown | 240.841370 | Mendocino | -123.319311 | 39.373807 |
1303 | 1950 | 7 | SCULLY ROAD | 14 - Unknown | 416.051544 | Amador | -120.925246 | 38.300486 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
12237 | 2017 | 8 | RAILROAD | 9 - Miscellaneous | 12371.148438 | Madera | -119.612630 | 37.447218 |
12516 | 2017 | 9 | RAILROAD | 14 - Unknown | 2.216443 | Siskiyou | -122.520362 | 41.526095 |
12567 | 2017 | 7 | ROADRUNNER | 7 - Arson | 2284.595947 | Tulare | -118.933014 | 36.015821 |
13147 | 2018 | 11 | ROADRUNNER | 14 - Unknown | 0.613522 | Ventura | -118.921408 | 34.206380 |
13218 | 2019 | 8 | R4 RAILROAD | 1 - Lightning | 24.105974 | Lassen | -120.375015 | 40.629286 |
132 rows × 8 columns
We see several “good” matches, like “NIELSON ROAD” and “CUTOFF ROAD”, but several “spurious” matches, like “ROADRUNNER”, and “RAILROAD”. How can we exclude these fires?
Here’s a trick: instead of searching for strings containing 'ROAD'
, we’ll search for strings containing ' ROAD'
,
where we have added a space to the beginning of the search string. This will match strings like “CUTOFF ROAD”, but not “RAILROAD”.
calfire[calfire.get('name').str.contains(' ROAD')]
year | month | name | cause | acres | county | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
497 | 1931 | 10 | NIELSON ROAD | 4 - Campfire | 2.489452 | Trinity | -123.364262 | 40.019498 |
791 | 1939 | 8 | CUTOFF ROAD | 7 - Arson | 1057.719482 | Siskiyou | -122.865814 | 41.825530 |
1303 | 1950 | 7 | SCULLY ROAD | 14 - Unknown | 416.051544 | Amador | -120.925246 | 38.300486 |
1441 | 1951 | 8 | SMITH ROAD | 3 - Smoking | 629.024658 | Calaveras | -120.314476 | 38.354487 |
1760 | 1953 | 5 | RICE ROAD | 14 - Unknown | 638.207275 | Ventura | -119.296153 | 34.441824 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
5176 | 1985 | 7 | GAS POINT ROAD | 14 - Unknown | 93.247581 | Shasta | -122.477858 | 40.391840 |
5177 | 1985 | 7 | GAS POINT ROAD | 14 - Unknown | 78.217186 | Shasta | -122.439950 | 40.392093 |
5178 | 1985 | 7 | GAS POINT ROAD | 14 - Unknown | 106.863319 | Shasta | -122.430627 | 40.394094 |
6579 | 1996 | 7 | MINES ROAD | 2 - Equipment Use | 174.051270 | Santa Clara | -121.523671 | 37.475377 |
8870 | 2006 | 7 | JOHNSON ROAD FIRE | 14 - Unknown | 110.659088 | Los Angeles | -118.334695 | 34.661443 |
48 rows × 8 columns
Strings have other useful string methods, as well. In this case, all of the fire names are capitalized, but it is common to find that datasets are inconsistent in their capitalization. For instance:
cities = bpd.DataFrame().assign(names=[
'San Diego', 'los angeles', 'san luis obispo', 'Oakland', 'Stockton'
])
Note that some of the strings are not capitalized, while others are. If we want to retrieve all strings that contain San
, we’ll have to write two queries: one for San
and another for san
. But there’s an easier way.
A common practice is to make sure all of your text data for a given column is in the same capitalization before selecting on substrings. We can chain together the .str.lower
method with the .str.contains
method to quickly achieve this:
cities.get('names').str.lower()
0 san diego
1 los angeles
2 san luis obispo
3 oakland
4 stockton
Name: names, dtype: object
cities[cities.get('names').str.lower().str.contains('san')]
names | |
---|---|
0 | San Diego |
2 | san luis obispo |
It’s worth noting that if you try using the .str
methods on a Series that doesn’t contain text, you’ll encounter a helpful error (once you scroll to the bottom).
calfire.get('year').str.contains('2019')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [46], line 1
----> 1 calfire.get('year').str.contains('2019')
File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/babypandas/bpd.py:850, in Series.str(self)
844 '''
845 String methods on Series.
846 '''
847 # accessing the `.str` attribute of a pd.Series will raise an
848 # AttributeError if the series does not consist of string values. We
849 # use a property here to replicate this behavior.
--> 850 return SeriesStringMethods(self._pd.str)
File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/pandas/core/generic.py:5575, in NDFrame.__getattr__(self, name)
5568 if (
5569 name not in self._internal_names_set
5570 and name not in self._metadata
5571 and name not in self._accessors
5572 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5573 ):
5574 return self[name]
-> 5575 return object.__getattribute__(self, name)
File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/pandas/core/accessor.py:182, in CachedAccessor.__get__(self, obj, cls)
179 if obj is None:
180 # we're accessing the attribute of the class, i.e., Dataset.geo
181 return self._accessor
--> 182 accessor_obj = self._accessor(obj)
183 # Replace the property with the accessor object. Inspired by:
184 # https://www.pydanny.com/cached-property.html
185 # We need to use object.__setattr__ because we overwrite __setattr__ on
186 # NDFrame
187 object.__setattr__(obj, self._name, accessor_obj)
File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/pandas/core/strings/accessor.py:177, in StringMethods.__init__(self, data)
174 def __init__(self, data):
175 from pandas.core.arrays.string_ import StringDtype
--> 177 self._inferred_dtype = self._validate(data)
178 self._is_categorical = is_categorical_dtype(data.dtype)
179 self._is_string = isinstance(data.dtype, StringDtype)
File ~/.cache/pypoetry/virtualenvs/dsc10-notes-5yRYveSa-py3.10/lib/python3.10/site-packages/pandas/core/strings/accessor.py:231, in StringMethods._validate(data)
228 inferred_dtype = lib.infer_dtype(values, skipna=True)
230 if inferred_dtype not in allowed_types:
--> 231 raise AttributeError("Can only use .str accessor with string values!")
232 return inferred_dtype
AttributeError: Can only use .str accessor with string values!
Tip
Often times, when reading an error message it’s most helpful to look at the two ends of the message – and don’t get too worried about the middle bits.
The very top line points to where the error occurred, and the very bottom lines explain why the error occurred.
10.5. More Examples¶
10.5.1. What percentage of fires occurred in September or October?¶
We’ll use |
to find those fires whose 'month'
is either 9 or 10:
sept_or_oct = calfire[(calfire.get('month') == 9) | (calfire.get('month') == 10)]
sept_or_oct.shape[0] / calfire.shape[0]
0.2335858585858586
10.5.2. How many fires occurred between San Francisco and LA in terms of latitude?¶
The latitude of San Francisco is 37.7749, while the latitude of LA is 34.0522. Selecting all fires within this range can be done with &
:
calfire[
(calfire.get('latitude') <= 37.7749)
&
(calfire.get('latitude') >= 34.0522)
].shape[0]
5519
10.5.3. How many fires larger than 1000 acres did San Diego have in the 1990s?¶
Now we’re having fun! There are three conditions here: 1) Larger than 1000 acres, 2) in San Diego County, and 3) in the 1990s. The third condition, however, is actually two conditions: after 1990 and before 2000.
With a complex query like this, it’s probably best to make the three Boolean arrays separately and save them in intermediate variables, like this:
is_gt_1000_acres = calfire.get('acres') > 1000
is_in_sd = calfire.get('county') == 'San Diego'
is_in_1990s = (calfire.get('year') >= 1990) & (calfire.get('year') < 2000)
Now we can perform the query. All three of these conditions must be true simultaneously, so we use &
:
calfire[is_gt_1000_acres & is_in_sd & is_in_1990s]
year | month | name | cause | acres | county | longitude | latitude | |
---|---|---|---|---|---|---|---|---|
5799 | 1990 | 6 | MORRETTIS | 7 - Arson | 3314.026611 | San Diego | -116.667315 | 33.196718 |
5805 | 1990 | 9 | PAINT | 9 - Miscellaneous | 2760.890869 | San Diego | -117.144556 | 33.058681 |
5811 | 1990 | 10 | CASE | 9 - Miscellaneous | 2062.641357 | San Diego | -117.441422 | 33.432863 |
5924 | 1991 | 6 | BARRETT | 2 - Equipment Use | 1609.205811 | San Diego | -116.670060 | 32.618218 |
5985 | 1992 | 11 | BRAVO | 14 - Unknown | 2023.847290 | San Diego | -117.543498 | 33.374327 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
6999 | 1998 | 7 | SUMMIT | 14 - Unknown | 1123.004883 | San Diego | -116.732161 | 32.604883 |
7107 | 1999 | 8 | LAGUNA 100 | 7 - Arson | 4411.434570 | San Diego | -116.406902 | 32.900014 |
7188 | 1999 | 9 | PROCTOR | 7 - Arson | 1460.064453 | San Diego | -116.918494 | 32.692336 |
7217 | 1999 | 10 | LA JOLLA | 5 - Debris | 7845.939941 | San Diego | -116.789519 | 33.273254 |
7331 | 1999 | 10 | SHOCKEY | 14 - Unknown | 3284.517578 | San Diego | -116.405917 | 32.607751 |
32 rows × 8 columns