5.2. Text

Think about the type of data that you’re likely to see as a data scientist. Does it only contain numbers? Of course not! Much of the data you’ll see comes in the form of text. Most programming languages have a special data type for storing text, called strings. In this section, we’ll move beyond using Python as a simple calculator and use it to do some simple text analysis.

5.2.1. Creating Strings

To create a string in Python, write some text surrounded by quotation marks: ":

"Word"
'Word'
"More than one word, and punc-tu-a-tion!"
'More than one word, and punc-tu-a-tion!'
# not a number, since we're using quotes!
"3"
'3'

Notice that Python displays our string with single quotes around it – this is its way of letting us know that it is indeed a string. As before, we can use the type function to check what type Python believes these pieces of data to be:

type("Word")
str
type("3")
str

Of course, str is short for “string”.

As with numbers, we can use variables to store strings:

s = "Hello world!"
s
'Hello world!'

If you don’t like double-quotes, you’re in luck: you can write strings using single quote marks just the same:

'Word'
'Word'
'More than one word, and punc-tu-a-tion!'
'More than one word, and punc-tu-a-tion!'

Unlike in other languages, there is no difference between using single quotes and double quotes in Python.

What happens if you don’t put anything inside of the quotes?

''
''

This is fine! We call it an empty string.

5.2.2. More on single vs. double quotes

We said above that there is no real difference between using single quotes and double quotes to create strings, but there are instances where one is preferable over the other.

For instance, what if we wanted to turn the following piece of text into a string?

JavaScript is a “real” programming language.

Notice that the text itself includes quotation marks. If we try to wrap the whole piece of text with double-quotes, Python will get upset:

"JavaScript is a "real" programming language.
  Cell In [10], line 1
    "JavaScript is a "real" programming language.
                          ^
SyntaxError: unterminated string literal (detected at line 1)

When Python sees the first ", it thinks to itself: OK, this is a string. It continues reading until it find the second ", right before real. It then thinks the string is over – but it isn’t.

To avoid confusing Python, we can instead use single-quotes to delineate the string:

'JavaScript is a "real" programming language'
'JavaScript is a "real" programming language'

In this case, single-quotes were preferable, but that’s not always the case. For instance, suppose we want to represent the string:

Python: a data scientist’s best friend.

We can’t use single quotes here because of the apostrophe in “scientist’s”. So we surround the string with double quotes:

"Python: a data scientist's best friend"
"Python: a data scientist's best friend"

There is another way we can handle strings containing single- and double-quotes: We can “escape” the character by prefixing it with a backslash \. This will tell Python to treat the character differently than it normally would – in this case, it tells Python not to end the string. This is very helpful when both single quotes and double quotes appear in the string!

'They said, "escaping isn\'t so bad," and I believe them!'
'They said, "escaping isn\'t so bad," and I believe them!'

5.2.3. String Methods

What are some things you might want to do with strings? For one, you might want to combine two strings into one longer string. Doing so in Python is easy: we can simply use +.

s1 = 'Data'
s2 = 'Science'
s1 + s2
'DataScience'

Combining two strings in this way is called concatenation.

Given the following variables, write an expression that concatenates the two strings and adds a space in between. The output should be 'red fish blue fish'

string1 = "red fish"
string2 = "blue fish"

What else might you want to do with a string? Capitilizing a string seems like a common task. Luckily, Python provides a function to do just this. However, unlike the functions we have seen so far (like abs() and round()), the capitalize() function is attached to the string itself. We call it in a slightly different way:

'data science is awesome'.capitalize()
'Data science is awesome'

Functions that are attached to the things they operate on are called methods – but don’t worry too much about the difference in name. Just remember that methods are by placing a dot after the string, and then writing the function name.

Methods can be called directly on a string, or the variable name of a string.

s = 'data science is awesome'
s.capitalize()
'Data science is awesome'

Note that the string method does not change the string itself, but rather creates a new string. We can see this by printing out s:

s
'data science is awesome'

We observe that this is the original value of s as it was before we called the method. If we wanted instead to save the result, we would need to give it a name:

t = s.capitalize()
t
'Data science is awesome'

We could also overwrite s with the new value:

s = s.capitalize()
s
'Data science is awesome'

Strings have plenty of methods – you can see them by typing “‘some string’.” and hitting tab. Here are a few examples:

"why am i yelling?".upper()
'WHY AM I YELLING?'
"THIS IS A LIBRARY PLEASE BE QUIET".lower()
'this is a library please be quiet'
"hitchhiker's guide to the galaxy".title()
"Hitchhiker'S Guide To The Galaxy"

Jupyter Tip

You can see all of Python’s string methods by typing "some string". then hitting Tab.

In particular, the replace method is extremely powerful, since it allows us to find and replace sections of a string. The previous string methods we looked at took no arguments, but the replace methods takes two arguments: the text to find, and the text to replace it with (in that order).

'found you'.replace('you', 'Waldo')
'found Waldo'

Remember the empty string ''? It’s used a lot with replace in order to get rid of parts of text entirely! Notice that the text must match exactly, and is case sensitive.

'Hello, my name is **SNEEZE** Justin'.replace('**SNEEZE** ', '')
'Hello, my name is Justin'
'where\'s Waldo'.replace('w', '')
"here's Waldo"

Since the string methods we’ve looked at return more strings, we can even call more string methods on the result!

s = 'started with words'
t = s.replace('started', 'ended')
u = t.replace('words', 'a sentence')
v = u.capitalize()
v
'Ended with a sentence'

But here’s a shortcut: we don’t need to assign each result to intermediate variables. We can do it “all at once” with method chaining:

s = 'started with words'
s.replace('started', 'ended').replace('words', 'a sentence').capitalize()
'Ended with a sentence'

5.2.4. Conversion

Data scientists sometimes get their data by scraping a webpage. For instance, suppose we want to know the water temperature in La Jolla (either to study the effects of climate change, or to know whether the water is nice for surfing). We can find the information we want on the NOAA’s website. We can write some code to download this webpage and extract the data we need. There’s just one problem: webpages are just globs of text (that is, they are strings), but we want the temperature as a number (preferably a float). Luckily, converting back and forth between these types is easy.

Suppose we have a isolated the current ocean temperature as a string:

ocean_temp = '72.8'
ocean_temp
'72.8'

While this might look like a number, notice the quotation marks: this means that Python thinks of it as a piece of text.

type(ocean_temp)
str

As a result, we can’t do math with the ocean temperature:

ocean_temp + 2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [30], line 1
----> 1 ocean_temp + 2

TypeError: can only concatenate str (not "int") to str

Here, Python has complained with a TypeError. Python is saying that it doesn’t know how to combine objects of these types: str and int. To do the arithmetic, we need to convert the string to a float. We can do so with the float function:

float(ocean_temp)
72.8
float(ocean_temp) + 2
74.8

We can also convert a string to an integer using int:

days_since_last_beach_visit = '4'
days_since_last_beach_visit
'4'
int(days_since_last_beach_visit)
4

Be careful, though – if you try to convert a string that doesn’t look like an integer using int, Python will yell at you:

int('3.14')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [35], line 1
----> 1 int('3.14')

ValueError: invalid literal for int() with base 10: '3.14'

If we really wanted to convert '3.14' to an integer, we would need to convert it to a float first:

int(float('3.14'))
3

5.2.5. Summary

  • Multiple strings can be glued together using +.

  • Strings own a handful of methods – functions that belong solely to the data type of strings.

  • String methods are called using dot notation, by placing a dot after a string or variable name of a string, then calling the function: my_string.function_name(arguments, ...)

  • Some string methods allow you to create new strings that change capitalization or find and replace snippets of text.