(Baby)Pandas DataFrames
Contents
8. (Baby)Pandas DataFrames¶
Most of the data sets that we are interested in are more complex that simple lists of numbers. For instance, consider a data set containing information about California wildfires. It might contain multiple pieces of data about each fire, including its name, size, location, and cause.
How would we store and analyze such a data set? While we could use NumPy arrays – one holding the fire’s name, another holding the size, etc. – there is a much better way: We will use a table to contain the data, like this:
name | year | cause | acres | county | |
---|---|---|---|---|---|
0 | CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte |
1 | BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras |
2 | KING | 2014 | 7 - Arson | 97684.546875 | El Dorado |
3 | ROUGH | 2015 | 1 - Lightning | 151546.812500 | Fresno |
4 | MEGRAM | 1999 | 1 - Lightning | 125072.531250 | Humboldt |
... | ... | ... | ... | ... | ... |
45 | DAY | 2006 | 5 - Debris | 161815.656250 | Ventura |
46 | MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura |
47 | THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura |
48 | SIMI FIRE | 2003 | 14 - Unknown | 107570.398438 | Ventura |
49 | COUNTY | 2018 | 14 - Unknown | 89831.148438 | Yolo |
50 rows × 5 columns
8.1. Pandas DataFrames¶
Like NumPy arrays, tables are provided by a third-party extension. The Python package which provides tables is called pandas. Pandas is the tool for doing data science in Python, and it is immensely popular – as of Summer 2020, it was downloaded nearly 1 million times per day. It is without a doubt a powerful tool, and you’ll need to know how to use it if you want to do serious data science. But there’s a problem: pandas is complicated. There are numerous ways to do even the simplest tasks. This makes it hard to learn, especially if you’re new to programming.
This leaves us in an interesting situation. On one hand, we want to learn pandas, because it is the tool used by actual data scientists. On the other hand, we don’t want to be thrown into the deep end. The solution? We’ll take pandas and remove everything that isn’t absolutely necessary, resulting in something simpler and easier to learn. What’s left is still pandas – just not all of it. Because this new package is a smaller (and cuter) version of pandas, we’re calling it babypandas.
To get access to the functionality that babypandas provides, we’ll need to import it:
import babypandas as bpd
Note
We’re going to be using babypandas in the rest of this book, but it should be stressed that babypandas is pandas, just a smaller version of it. So if someone asks if you have experience working with pandas (during a job interview, for instance), you’ll be able to say “yes!”.
In babypandas (and pandas), a table is called a DataFrame (though we’ll use the two terms interchangeably). Since DataFrames are often used to store very large data sets, they are not typically created by typing their entries one by one – instead, they are usually read from a file. We’ll see how to do that in a moment, but for now we assume that we have already loaded a DataFrame into a variable called fires
. If we type fires
in our Jupyter notebook cell and execute it, it will display the table with nice formatting:
fires
name | year | cause | acres | county | |
---|---|---|---|---|---|
0 | CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte |
1 | BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras |
2 | KING | 2014 | 7 - Arson | 97684.546875 | El Dorado |
3 | ROUGH | 2015 | 1 - Lightning | 151546.812500 | Fresno |
4 | MEGRAM | 1999 | 1 - Lightning | 125072.531250 | Humboldt |
... | ... | ... | ... | ... | ... |
45 | DAY | 2006 | 5 - Debris | 161815.656250 | Ventura |
46 | MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura |
47 | THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura |
48 | SIMI FIRE | 2003 | 14 - Unknown | 107570.398438 | Ventura |
49 | COUNTY | 2018 | 14 - Unknown | 89831.148438 | Yolo |
50 rows × 5 columns
If we ask for the type
of fires
, Python will tell us that it is a DataFrame:
type(fires)
babypandas.bpd.DataFrame
A DataFrame consists of columns and rows. Almost always, a row represents a single thing – in this case, a fire – and the columns provide different pieces of information about that thing. In this case, we have a column describing the name of the fire, another describing the cause, and so on.
We can get the number of rows and columns in a DataFrame by asking for its shape:
fires.shape
(50, 5)
This tells us that there are 50 rows and 9 columns. If for whatever reason we just wanted the number of rows, we could ask for the first element of this pair:
fires.shape[0]
50
Every row and column in a DataFrame has a label. We will use the row and column labels to refer to particular parts of the table and retrieve information from within it. The columns of the above DataFrame are labeled “year”, “name”, “cause”, and so on. The rows of the above table are simply labeled “0”, “1”, “2”, and so forth.
8.2. The Index¶
Together, the row labels are called the table index.
By default, a table’s rows are labeled by numbering them. However, in many cases it makes more sense to label the rows in some other way. For example, each row in our current data set is a single fire. Perhaps it makes more sense to use the fire’s name as its row label. We can ask babypandas to use a particular column as the row labels with the .set_index
method:
fires.set_index('name')
year | cause | acres | county | |
---|---|---|---|---|
name | ||||
CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte |
BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras |
KING | 2014 | 7 - Arson | 97684.546875 | El Dorado |
ROUGH | 2015 | 1 - Lightning | 151546.812500 | Fresno |
MEGRAM | 1999 | 1 - Lightning | 125072.531250 | Humboldt |
... | ... | ... | ... | ... |
DAY | 2006 | 5 - Debris | 161815.656250 | Ventura |
MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura |
THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura |
SIMI FIRE | 2003 | 14 - Unknown | 107570.398438 | Ventura |
COUNTY | 2018 | 14 - Unknown | 89831.148438 | Yolo |
50 rows × 4 columns
The .set_index
method accepts one argument: the label of the column that should be used as the index. It then creates a new DataFrame in which the index has been replaced with the information from this column; the old DataFrame is not changed. In order to save the results, we’ll need to assign the new table to a variable, like so:
fires_by_name = fires.set_index('name')
fires_by_name
year | cause | acres | county | |
---|---|---|---|---|
name | ||||
CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte |
BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras |
KING | 2014 | 7 - Arson | 97684.546875 | El Dorado |
ROUGH | 2015 | 1 - Lightning | 151546.812500 | Fresno |
MEGRAM | 1999 | 1 - Lightning | 125072.531250 | Humboldt |
... | ... | ... | ... | ... |
DAY | 2006 | 5 - Debris | 161815.656250 | Ventura |
MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura |
THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura |
SIMI FIRE | 2003 | 14 - Unknown | 107570.398438 | Ventura |
COUNTY | 2018 | 14 - Unknown | 89831.148438 | Yolo |
50 rows × 4 columns
Notice that the fire names have been moved all the way to the left, and have been made bold – this is babypandas’ way of showing that these names are now the index.
Warning
The index is not a column – it is it’s own separate thing. When we use .set_index
, the old index is thrown out and number of columns decreases by one.
Since we will later use row labels to refer to specific rows by name, the labels should be unique. In this case, that means that every fire should have a unique name. In this case, every fire is uniquely named, and it is fine to use the fire name as the index. Later, we’ll see a larger version of this data set in which there are multiple fires with the same name. In that case, the name should probably not be used as the index.
A table’s index is essentially an array. We can get the index by writing:
fires_by_name.index
Index(['CAMP', 'BUTTE', 'KING', 'ROUGH', 'MEGRAM', 'RANCH', 'VALLEY ',
'ROCKY ', 'FORK', 'RUSH', 'RAVENNA', 'WOOLSEY', 'STATION', 'FERGUSON',
'DETWILER', 'SCARFACE', 'SOBERANES', 'INDIANS', 'KIRK', 'BASIN COMPLEX',
'MARBLE-CONE', 'CHIPS', 'OLD', 'HARRIS 2', 'CEDAR', 'WITCH', 'LAGUNA',
'LAS PILITAS', 'HIGHWAY 58', 'ZACA', 'REFUGIO', 'LA BREA', 'CARR ',
'OAK', 'FRYING PAN', 'GLASS MOUNTAIN', 'KINCADE', 'CAMPBELL',
'SKINNER MILL', 'HAPPY', 'MANTER', 'MCNALLY', 'RIM', 'CLAMPITT FIRE',
'WHEELER #2', 'DAY', 'MATILIJA', 'THOMAS', 'SIMI FIRE', 'COUNTY'],
dtype='object', name='name')
We can then access individual elements of the index using the same notation as used with arrays, remembering that Python starts counting from zero:
# the first element
fires_by_name.index[0]
'CAMP'
# the second element
fires_by_name.index[1]
'BUTTE'
# the last element
fires_by_name.index[-1]
'COUNTY'
8.3. Series¶
8.3.1. Getting a column with .get
¶
We can retrieve a particular column from the table with the .get
method. For instance, to get the column labeled “acres”, we would write:
fires_by_name.get('acres')
name
CAMP 153335.562500
BUTTE 70846.531250
KING 97684.546875
ROUGH 151546.812500
MEGRAM 125072.531250
...
DAY 161815.656250
MATILIJA 219999.281250
THOMAS 281790.875000
SIMI FIRE 107570.398438
COUNTY 89831.148438
Name: acres, Length: 50, dtype: float64
The result might look like a DataFrame with one column, but it’s actually a new type of object called a Series:
type(fires_by_name.get('acres'))
babypandas.bpd.Series
A Series is basically an array, but with an index. A Series represents a column in a DataFrame. This means that we can think of the columns of a DataFrame as being arrays (more or less).
8.3.2. Arithmetic¶
Because a Series is like an array, we can do similar things with it. For instance, we can perform elementwise arithmetic with a Series. Let’s try it out by converting the fire sizes from acres to square miles. Each acre is 0.0015625 square miles, so we can do the conversion with a simple multiplication:
fires_by_name.get('acres') * 0.0015625
name
CAMP 239.586816
BUTTE 110.697705
KING 152.632104
ROUGH 236.791895
MEGRAM 195.425830
...
DAY 252.836963
MATILIJA 343.748877
THOMAS 440.298242
SIMI FIRE 168.078748
COUNTY 140.361169
Name: acres, Length: 50, dtype: float64
We can also perform arithmetic with two series, assuming that they are the same size.
8.3.3. Series methods: .mean
, .max
, .describe
, etc.¶
Series objects also come with a bunch of useful methods attached, like .mean
and .max
. For example, the average size of a fire in this data set is:
fires_by_name.get('acres').mean()
135919.636875
The largest fire burned this many acres:
fires_by_name.get('acres').max()
410202.46875
And the earliest fire in the data set was in the year:
fires_by_name.get('year').min()
1910
A very useful Series method is .describe
. It gives us a quick look at the basic statistics of the data in a particular column:
fires_by_name.get('year').describe()
count 50.000000
mean 1998.840000
std 25.220141
min 1910.000000
25% 1996.000000
50% 2007.500000
75% 2015.000000
max 2019.000000
Name: year, dtype: float64
From this, we can see that there are 50 fires in the data set, the earliest from 1910 and the latest from 2019. The 25%, 50%, and 75% refer to percentiles. That is, 25% of the fires occurred during or before 1996, and half occurred during or before 2007. This also means that half occurred between 2007 and 2019!
We will see more Series methods throughout these notes, but only when we need to use them.
Jupyter Tip
You can ask Jupyter for some information on all of the Series methods available by writing help(bpd.Series)
. The methods starting with two underscores (__
) are called “dunder” methods, and implement special behavior. You’re not meant to call them direcly, so you can pretty much ignore them.
8.4. Adding and removing columns¶
Above, we saw that we could convert the 'acres'
column to square miles using a little bit of array math. But doing so doesn’t change the table. What if we want to add this column to our table?
8.4.1. Adding a column with .assign
¶
Adding a column can be done with the .assign
method, like this:
fires_by_name.assign(sqmiles=fires_by_name.get('acres') * 0.0015625)
year | cause | acres | county | sqmiles | |
---|---|---|---|---|---|
name | |||||
CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte | 239.586816 |
BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras | 110.697705 |
KING | 2014 | 7 - Arson | 97684.546875 | El Dorado | 152.632104 |
ROUGH | 2015 | 1 - Lightning | 151546.812500 | Fresno | 236.791895 |
MEGRAM | 1999 | 1 - Lightning | 125072.531250 | Humboldt | 195.425830 |
... | ... | ... | ... | ... | ... |
DAY | 2006 | 5 - Debris | 161815.656250 | Ventura | 252.836963 |
MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura | 343.748877 |
THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura | 440.298242 |
SIMI FIRE | 2003 | 14 - Unknown | 107570.398438 | Ventura | 168.078748 |
COUNTY | 2018 | 14 - Unknown | 89831.148438 | Yolo | 140.361169 |
50 rows × 5 columns
There’s a lot going on here, so let’s break it down. First, the assign method takes a single argument: a series that will become the new column. But the way that we pass this argument is new. Instead of simply passing the argument itself, we also give the argument a name by writing sqmiles=
. This will be the column’s label. Arguments written in the form argument_name=argument_value
are called keyword arguments.
We can call the column anything we like, as long as it is a valid python variable name. This means that the variable name cannot contain spaces, or start with a number. If you try, you’ll get a SyntaxError
:
fires_by_name.assign(square miles=fires_by_name.get('acres') * 0.0015625)
Cell In [21], line 1
fires_by_name.assign(square miles=fires_by_name.get('acres') * 0.0015625)
^
SyntaxError: invalid syntax. Perhaps you forgot a comma?
Instead of spaces, we can use underscores:
fires_by_name.assign(square_miles=fires_by_name.get('acres') * 0.0015625)
year | cause | acres | county | square_miles | |
---|---|---|---|---|---|
name | |||||
CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte | 239.586816 |
BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras | 110.697705 |
KING | 2014 | 7 - Arson | 97684.546875 | El Dorado | 152.632104 |
ROUGH | 2015 | 1 - Lightning | 151546.812500 | Fresno | 236.791895 |
MEGRAM | 1999 | 1 - Lightning | 125072.531250 | Humboldt | 195.425830 |
... | ... | ... | ... | ... | ... |
DAY | 2006 | 5 - Debris | 161815.656250 | Ventura | 252.836963 |
MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura | 343.748877 |
THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura | 440.298242 |
SIMI FIRE | 2003 | 14 - Unknown | 107570.398438 | Ventura | 168.078748 |
COUNTY | 2018 | 14 - Unknown | 89831.148438 | Yolo | 140.361169 |
50 rows × 5 columns
The second thing to note is that .assign
creates an entirely new table containing the new column. It does not change the old table, as we can verify by recalling the value of fires_by_name
:
fires_by_name
year | cause | acres | county | |
---|---|---|---|---|
name | ||||
CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte |
BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras |
KING | 2014 | 7 - Arson | 97684.546875 | El Dorado |
ROUGH | 2015 | 1 - Lightning | 151546.812500 | Fresno |
MEGRAM | 1999 | 1 - Lightning | 125072.531250 | Humboldt |
... | ... | ... | ... | ... |
DAY | 2006 | 5 - Debris | 161815.656250 | Ventura |
MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura |
THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura |
SIMI FIRE | 2003 | 14 - Unknown | 107570.398438 | Ventura |
COUNTY | 2018 | 14 - Unknown | 89831.148438 | Yolo |
50 rows × 4 columns
Note
Wherever possible, DataFrame and Series methods return new objects instead of modifying existing ones. Creating copies like this results in code that is easier to reason about helps prevent strange bugs in your code.
In order to permanently add the column to the table, we need to save the result of .assign
to a variable.
fires_with_sqmiles = fires_by_name.assign(
sqmiles=fires_by_name.get('acres') * 0.0015625
)
fires_with_sqmiles
year | cause | acres | county | sqmiles | |
---|---|---|---|---|---|
name | |||||
CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte | 239.586816 |
BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras | 110.697705 |
KING | 2014 | 7 - Arson | 97684.546875 | El Dorado | 152.632104 |
ROUGH | 2015 | 1 - Lightning | 151546.812500 | Fresno | 236.791895 |
MEGRAM | 1999 | 1 - Lightning | 125072.531250 | Humboldt | 195.425830 |
... | ... | ... | ... | ... | ... |
DAY | 2006 | 5 - Debris | 161815.656250 | Ventura | 252.836963 |
MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura | 343.748877 |
THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura | 440.298242 |
SIMI FIRE | 2003 | 14 - Unknown | 107570.398438 | Ventura | 168.078748 |
COUNTY | 2018 | 14 - Unknown | 89831.148438 | Yolo | 140.361169 |
50 rows × 5 columns
8.4.2. Removing a column with .drop
¶
Columns can be removed using the .drop
method. It accepts one keyword argument: columns
. The argument can either be the label of a single column as a string, or a list of column labels. As with .assign
, the result is a new DataFrame (a copy).
For example, to get rid of the 'sqmiles'
column:
fires_with_sqmiles.drop(columns='sqmiles')
year | cause | acres | county | |
---|---|---|---|---|
name | ||||
CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte |
BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras |
KING | 2014 | 7 - Arson | 97684.546875 | El Dorado |
ROUGH | 2015 | 1 - Lightning | 151546.812500 | Fresno |
MEGRAM | 1999 | 1 - Lightning | 125072.531250 | Humboldt |
... | ... | ... | ... | ... |
DAY | 2006 | 5 - Debris | 161815.656250 | Ventura |
MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura |
THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura |
SIMI FIRE | 2003 | 14 - Unknown | 107570.398438 | Ventura |
COUNTY | 2018 | 14 - Unknown | 89831.148438 | Yolo |
50 rows × 4 columns
If we didn’t want the cause or the county:
fires_with_sqmiles.drop(columns=['cause', 'county'])
year | acres | sqmiles | |
---|---|---|---|
name | |||
CAMP | 2018 | 153335.562500 | 239.586816 |
BUTTE | 2015 | 70846.531250 | 110.697705 |
KING | 2014 | 97684.546875 | 152.632104 |
ROUGH | 2015 | 151546.812500 | 236.791895 |
MEGRAM | 1999 | 125072.531250 | 195.425830 |
... | ... | ... | ... |
DAY | 2006 | 161815.656250 | 252.836963 |
MATILIJA | 1932 | 219999.281250 | 343.748877 |
THOMAS | 2017 | 281790.875000 | 440.298242 |
SIMI FIRE | 2003 | 107570.398438 | 168.078748 |
COUNTY | 2018 | 89831.148438 | 140.361169 |
50 rows × 3 columns
Note that the argument name (columns
) is not something we can change, unlike the keyword argument name used in .assign
. We must use columns=...
, or else Python will not understand us. And if you don’t use the keyword name, Python will be upset:
fires_with_sqmiles.drop('county')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [27], line 1
----> 1 fires_with_sqmiles.drop('county')
TypeError: DataFrame.drop() takes 1 positional argument but 2 were given
8.4.3. Renaming columns¶
How do we rename a column? Suppose we want to rename sqmiles
to square_miles
. To do so, we:
Add a new column with the desired name by copying the old column.
Drop the old column
For instance:
fires_with_new_name = fires_with_sqmiles.assign(
square_miles=fires_with_sqmiles.get('sqmiles')
)
fires_with_new_name.drop(columns='sqmiles')
year | cause | acres | county | square_miles | |
---|---|---|---|---|---|
name | |||||
CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte | 239.586816 |
BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras | 110.697705 |
KING | 2014 | 7 - Arson | 97684.546875 | El Dorado | 152.632104 |
ROUGH | 2015 | 1 - Lightning | 151546.812500 | Fresno | 236.791895 |
MEGRAM | 1999 | 1 - Lightning | 125072.531250 | Humboldt | 195.425830 |
... | ... | ... | ... | ... | ... |
DAY | 2006 | 5 - Debris | 161815.656250 | Ventura | 252.836963 |
MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura | 343.748877 |
THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura | 440.298242 |
SIMI FIRE | 2003 | 14 - Unknown | 107570.398438 | Ventura | 168.078748 |
COUNTY | 2018 | 14 - Unknown | 89831.148438 | Yolo | 140.361169 |
50 rows × 5 columns
We can also do this in a single piece of code, without intermediate variables:
(
fires_with_sqmiles
.assign(square_miles=fires_with_sqmiles.get('sqmiles'))
.drop(columns='sqmiles')
)
year | cause | acres | county | square_miles | |
---|---|---|---|---|---|
name | |||||
CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte | 239.586816 |
BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras | 110.697705 |
KING | 2014 | 7 - Arson | 97684.546875 | El Dorado | 152.632104 |
ROUGH | 2015 | 1 - Lightning | 151546.812500 | Fresno | 236.791895 |
MEGRAM | 1999 | 1 - Lightning | 125072.531250 | Humboldt | 195.425830 |
... | ... | ... | ... | ... | ... |
DAY | 2006 | 5 - Debris | 161815.656250 | Ventura | 252.836963 |
MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura | 343.748877 |
THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura | 440.298242 |
SIMI FIRE | 2003 | 14 - Unknown | 107570.398438 | Ventura | 168.078748 |
COUNTY | 2018 | 14 - Unknown | 89831.148438 | Yolo | 140.361169 |
50 rows × 5 columns
Tip
You can break up long expressions by surrounding the whole expression with parentheses and inserting line breaks wherever makes sense. We’ll often break right at a method call.
This trick of applying two methods, one after the other, in one line of code is called method chaining. It works because the result of .assign
is itself a table. When Python evalautes the expression, it first evaluates the .assign
, then uses this table during the call to .drop
.
Method chaining is useful and can save us some typing, but it can be overused. It is sometimes better to save intermediate results.
Tip
If your method-chaining code isn’t working as you’d expect, break apart the code and save intermediate variables. Print out the values of these variables to do some debugging.
8.5. Reading CSV files¶
As mentioned above, DataFrames are not typically created by typing their entries by hand, one-by-one. Instead, we usually download a data set in a standard format and read it from disk. One such standard format is CSV, or comma-separated values.
A CSV file is simply a text file in a certain format. Here are the first few lines of the CSV file containing our wildfire data:
name,year,cause,acres,county
CAMP,2018,11 - Powerline,153335.5625,Butte
BUTTE,2015,14 - Unknown,70846.53125,Calaveras
KING,2014,7 - Arson,97684.546875,El Dorado
ROUGH,2015,1 - Lightning,151546.8125,Fresno
MEGRAM,1999,1 - Lightning,125072.53125,Humboldt
RANCH,2018,14 - Unknown,410202.46875,Lake
VALLEY ,2015,14 - Unknown,76084.8359375,Lake
ROCKY ,2015,9 - Miscellaneous,69438.1640625,Lake
FORK,1996,7 - Arson,83056.9453125,Lake
As it’s name suggests, a CSV file consists of values, separated by commas. The first line of the file usually contains the column labels. CSV is a widely-used format, and can be read by many pieces of software, including Excel and Google Sheets.
We can read a CSV file into a babypandas DataFrame using the bpd.read_csv
function. We give this function a string containing the filepath to the CSV file we want to read. For example, our wildfire data exists in a file called calfire.csv
contained in the data/
directory. We can read it into a DataFrame as follows:
calfire = bpd.read_csv('data/calfire.csv')
calfire
name | year | cause | acres | county | |
---|---|---|---|---|---|
0 | CAMP | 2018 | 11 - Powerline | 153335.562500 | Butte |
1 | BUTTE | 2015 | 14 - Unknown | 70846.531250 | Calaveras |
2 | KING | 2014 | 7 - Arson | 97684.546875 | El Dorado |
3 | ROUGH | 2015 | 1 - Lightning | 151546.812500 | Fresno |
4 | MEGRAM | 1999 | 1 - Lightning | 125072.531250 | Humboldt |
... | ... | ... | ... | ... | ... |
45 | DAY | 2006 | 5 - Debris | 161815.656250 | Ventura |
46 | MATILIJA | 1932 | 9 - Miscellaneous | 219999.281250 | Ventura |
47 | THOMAS | 2017 | 9 - Miscellaneous | 281790.875000 | Ventura |
48 | SIMI FIRE | 2003 | 14 - Unknown | 107570.398438 | Ventura |
49 | COUNTY | 2018 | 14 - Unknown | 89831.148438 | Yolo |
50 rows × 5 columns
Modifying the DataFrame will not affect the data on disk in any way.