Accessing and Sorting
Contents
9. Accessing and Sorting¶
The Cedar Fire left an indelible mark on San Diego county. When it happened in 2003, it was the largest wildfire in California history. Just how big was the fire? And what caused it? The data set we saw in the previous section has these answers, but we still need to learn how to retrieve this information.
9.1. Accessing with .loc
¶
Let’s begin by reading our dataset from a CSV and setting the 'name'
column to be the index:
calfire = bpd.read_csv('data/calfire.csv').set_index('name')
calfire
year | cause | acres | county | |
---|---|---|---|---|
name | ||||
CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte |
BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras |
KING | 2014 | 7 - Arson | 97684.546875 | El Dorado |
ROUGH | 2015 | 1 - Lightning | 151546.812500 | Fresno |
MEGRAM | 1999 | 1 - Lightning | 125072.531250 | Humboldt |
... | ... | ... | ... | ... |
DAY | 2006 | 5 - Debris | 161815.656250 | Ventura |
MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura |
THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura |
SIMI FIRE | 2003 | 14 - Unknown | 107570.398438 | Ventura |
COUNTY | 2018 | 14 - Unknown | 89831.148438 | Yolo |
50 rows × 4 columns
How big was the Cedar Fire? We know that we can retrieve the 'acres'
column using the .get
method:
calfire.get('acres')
name
CAMP 153335.562500
BUTTE 70846.531250
KING 97684.546875
ROUGH 151546.812500
MEGRAM 125072.531250
...
DAY 161815.656250
MATILIJA 219999.281250
THOMAS 281790.875000
SIMI FIRE 107570.398438
COUNTY 89831.148438
Name: acres, Length: 50, dtype: float64
The result is a series. We don’t see the Cedar Fire listed; it is there, just hidden within the ...
.
We can retrieve an element from a Series using its row label and the special .loc
attribute. Since it looks like all the fires have their names capitalized, the row label for the Cedar Fire will be CEDAR
:
calfire.get('acres').loc['CEDAR']
270686.0
Notice the square brackets in .loc['CEDAR']
. This is by analogy with using square brackets to retrieve elements from an array.
To repeat, we can retrieve a particular element in a table, such as the size of the Cedar Fire, by first .get
ting the column containing the fire sizes, then using .loc
to retrieve the size corresponding to the Cedar Fire. This is a pattern that we’ll be using over and over:
.get()
with the column label; then.loc[]
with the row label
Let’s try it again: what was the cause of the Cedar Fire?
calfire.get('cause').loc['CEDAR']
'2 - Equipment Use'
Originally, the Cedar Fire was to believed to be caused by a lost hunter shooting their rifle to get the attention of rescuers – thus the cause is listed as “Equipment Use”. However, the hunter later admitted that they had started a signal fire to alert rescuers, and that it quickly spread out of control.
What code would you write to determine the county in which the Frying Pan Fire occurred?
calfire.get('county').loc['FRYING PAN']
9.1.1. Example¶
The Cedar Fire was truly huge, stretching from the Cuyamaca Mountains in eastern San Diego County, almost to the Pacific Ocean. But most of us don’t have a good sense of what an “acre” is. Instead, let’s frame the size of the fire in terms of other things we might be more familiar with. For instance, a soccer field is 1.32 acres. Therefore, the Cedar Fire burned:
print(calfire.get('acres').loc['CEDAR'] / 1.32, 'soccer fields.')
205065.15151515152 soccer fields.
Other comparisons are perhaps more surprising. The Cedar Fire burned:
print(calfire.get('acres').loc['CEDAR'] / 1_976, 'times the size of UCSD\'s campus.')
print(calfire.get('acres').loc['CEDAR'] / 14_600, 'times the size of Manhattan.')
print(calfire.get('acres').loc['CEDAR'] / 382_000, 'times the size of the island of Oahu.')
print(calfire.get('acres').loc['CEDAR'] / 2.9e6 * 100, 'percent of San Diego county.')
136.98684210526315 times the size of UCSD's campus.
18.54013698630137 times the size of Manhattan.
0.7086020942408376 times the size of the island of Oahu.
9.334000000000001 percent of San Diego county.
9.2. Sorting¶
While the Cedar Fire was the largest fire in California history when it occurred in 2003, it has since been surpassed. If it wasn’t the largest fire, what was?
One way to answer this question is to sort the table by the size of the fires. We can do this with the .sort_values
DataFrame method. The method requires one keyword argument, by=
, which should be given a string with the name of the column to sort by:
calfire.sort_values(by='acres')
year | cause | acres | county | |
---|---|---|---|---|
name | ||||
HAPPY | 2015 | 1 - Lightning | 68095.242188 | Trinity |
ROCKY | 2015 | 9 - Miscellaneous | 69438.164062 | Lake |
BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras |
RAVENNA | 1919 | 14 - Unknown | 71373.148438 | Los Angeles |
VALLEY | 2015 | 14 - Unknown | 76084.835938 | Lake |
... | ... | ... | ... | ... |
RIM | 2013 | 4 - Campfire | 256175.531250 | Tuolumne |
CEDAR | 2003 | 2 - Equipment Use | 270686.000000 | San Diego |
THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura |
RUSH | 2012 | 1 - Lightning | 315511.500000 | Lassen |
RANCH | 2018 | 14 - Unknown | 410202.468750 | Lake |
50 rows × 4 columns
Notice that the table has been sorted in ascending order (from smallest to largest) using the values in the 'acres'
column. Therefore, we see that the largest fire in California history was the Ranch Fire. At 410,202 acres, the Ranch fire was double the size of the Cedar Fire. It was part of the Mendocino Complex Fire, which burned in 2019. Although the cause of the fire is listed here as undetermined, it was later discovered that it began when a person tried to plug the entrance of wasp’s nest with a hammer and a metal stake. When they hammered the stake, sparks flew and caught the ground on fire. The fire resulted in $250 million in damages.
What if we want to come to the same conclusion programatically, without having to look at the table? After sorting, the last row of the table corresponds to the largest fire, and we want the fire’s name. This is stored in the table index. Remember that the index is essentially an array, and so we can get the last element of the array using square bracket notation:
calfire.sort_values(by='acres').index[-1]
'RANCH'
Here, we are using method chaining to save typing.
9.2.1. Sorting in descending order¶
We could have also made the same discovery by sorting the table in descending order. This is done by passing a second keyword argument: ascending=False
:
calfire.sort_values(by='acres', ascending=False)
year | cause | acres | county | |
---|---|---|---|---|
name | ||||
RANCH | 2018 | 14 - Unknown | 410202.468750 | Lake |
RUSH | 2012 | 1 - Lightning | 315511.500000 | Lassen |
THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura |
CEDAR | 2003 | 2 - Equipment Use | 270686.000000 | San Diego |
RIM | 2013 | 4 - Campfire | 256175.531250 | Tuolumne |
... | ... | ... | ... | ... |
VALLEY | 2015 | 14 - Unknown | 76084.835938 | Lake |
RAVENNA | 1919 | 14 - Unknown | 71373.148438 | Los Angeles |
BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras |
ROCKY | 2015 | 9 - Miscellaneous | 69438.164062 | Lake |
HAPPY | 2015 | 1 - Lightning | 68095.242188 | Trinity |
50 rows × 4 columns
If this keyword argument isn’t given, it defaults to ascending=True
.
When sorting in descending order, we could get the fire’s name from the index by asking for the first row label:
calfire.sort_values(by='acres', ascending=False).index[0]
'RANCH'
We can also sort by columns containing strings. In this case, the rows are sorted lexicographically – that is, in dictionary order:
calfire.sort_values(by='county')
year | cause | acres | county | |
---|---|---|---|---|
name | ||||
CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte |
BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras |
KING | 2014 | 7 - Arson | 97684.546875 | El Dorado |
ROUGH | 2015 | 1 - Lightning | 151546.812500 | Fresno |
MEGRAM | 1999 | 1 - Lightning | 125072.531250 | Humboldt |
... | ... | ... | ... | ... |
DAY | 2006 | 5 - Debris | 161815.656250 | Ventura |
MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura |
THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura |
SIMI FIRE | 2003 | 14 - Unknown | 107570.398438 | Ventura |
COUNTY | 2018 | 14 - Unknown | 89831.148438 | Yolo |
50 rows × 4 columns
9.2.2. Sorting by multiple columns¶
If you look at the help message for .sort_values
, you’ll see that it can also accept a list of column names to sort by, like this:
calfire.sort_values(by=['year', 'acres'])
year | cause | acres | county | |
---|---|---|---|---|
name | ||||
GLASS MOUNTAIN | 1910 | 1 - Lightning | 146967.562500 | Siskiyou |
RAVENNA | 1919 | 14 - Unknown | 71373.148438 | Los Angeles |
MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura |
REFUGIO | 1955 | 9 - Miscellaneous | 79428.710938 | Santa Barbara |
CLAMPITT FIRE | 1970 | 14 - Unknown | 115537.421875 | Ventura |
... | ... | ... | ... | ... |
WOOLSEY | 2018 | 14 - Unknown | 96949.000000 | Los Angeles |
CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte |
CARR | 2018 | 10 - Vehicle | 229651.406250 | Shasta |
RANCH | 2018 | 14 - Unknown | 410202.468750 | Lake |
KINCADE | 2019 | 11 - Powerline | 77762.140625 | Sonoma |
50 rows × 4 columns
In this case, the resulting table will be sorted by the 'year'
column, but within each year the fires will be grouped by their size in the 'acres'
column. Take a look: the fires occurring in 2018 are organized from smallest to largest. Sorting like this is mostly used for presentation purposes.
9.3. Accessing with .iloc
¶
How large was the oldest fire in the data set? There are two approaches to answering this question. The first approach uses tools that we already know, but it requires a little extra work. The second approach uses something new: .iloc
.
9.3.1. Approach #1: with .loc
.¶
In the first approach, we’ll first find the name of the oldest fire in the data set. We’ll then use this name to ask for the size of the fire using our .get().loc[]
pattern.
To get the name of the oldest fire, we’ll sort by year and look at the first element of the index:
oldest_fire_name = calfire.sort_values(by='year').index[0]
oldest_fire_name
'GLASS MOUNTAIN'
Now we’ll use this as the row label in looking up the acreage:
calfire.get('acres').loc[oldest_fire_name]
146967.5625
This approach works, but there is a faster way.
9.3.2. Approach #2: with .iloc
¶
After we sort the table by year, we know that the row we’re interested in is the first row in the table. But to access the entries of this row with .loc
, we need to first find out that row’s label. Instead of performing this intermediate step, we can get the first row’s information directly with .iloc
.
Whereas .loc
looks up a row by its label, .iloc
looks up a row by its integer position (thus the i in iloc
). For instance, suppose we have sorted the table by year and retrieved the acreage column:
sizes = calfire.sort_values(by='year').get('acres')
sizes
name
GLASS MOUNTAIN 146967.562500
RAVENNA 71373.148438
MATILIJA 219999.281250
REFUGIO 79428.710938
LAGUNA 174161.546875
...
FERGUSON 96900.039062
WOOLSEY 96949.000000
RANCH 410202.468750
COUNTY 89831.148438
KINCADE 77762.140625
Name: acres, Length: 50, dtype: float64
To get the size of the first fire, we can use .iloc
directly, without knowing the fire’s name.
sizes.iloc[0]
146967.5625
This is equivalent to using .loc
with the column name:
sizes.loc['GLASS MOUNTAIN']
146967.5625
To summarize, .loc
retrieves information using the row label. .iloc
retrieves information using integer position. It is typically more convenient to use .loc
with the row label, but sometimes .iloc
is preferable.