Textmining

Tokenising Text

Keypoints:

Tokenisation means to split a string into separate words and punctuation, for example to be able to count them.
Text can be tokenised using a tokeniser, e.g. the punkt tokeniser in NLTK.

But first … importing packages

Python has a selection of pre-written code that can be used. These come as in built functions and a library of packages of modules. We have already used the in-built function print(). In-built functions are available as soon as you start Python. There is also a (software) library of modules that contain other functions, but these modules need to be imported.

For this course we need to import a few libraries into Python, including the Natural Language Toolkit (NLTK). To do this, we need to use the import command and run import nltk. We will also use numpy to represent information in arrays and matrices, string to process some strings, and matplotlib to visualise the output.

If there is a problem importing any of these modules you may need to revisit the appropriate install above.

import nltk
import numpy
import string
import matplotlib.pyplot as plt

Tokenising a string

In order to process text we need to break it down into tokens. As we explained at the start, a token is a letter, word, number, or punctuation which is contained in a string.

To tokenise we first need to import the word_tokenize method from the tokenize package from NLTK which allows us to do this without writing the code ourselves.

from nltk.tokenize import word_tokenize

We will also download a specific tokeniser that NLTK uses as default. There are different ways of tokenising text and today we will use NLTK’s in-built punkt tokeniser by calling:

nltk.download('punkt')

Now we can assign text as a string variable and tokenise it. We will save the tokenised output in a list using the humpty_tokens variable. We can inspect this list by inspecting the humpty_tokens variable.

humpty_string = "Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall; All the king's horses and all the king's men couldn't put Humpty together again."
humpty_tokens = word_tokenize(humpty_string)
# Show first 10 entries of the tokens list
humpty_tokens[0:10]

['Humpty', 'Dumpty', 'sat', 'on', 'a', 'wall', ',', 'Humpty', 'Dumpty', 'had']

As you can see, some of the words are uppercase and some are lowercase. To further analyse the data, for example counting the occurrences of a word, we need to normalise the data and make it all lowercase.

You can lowercase the strings in the list by going through it and calling the .lower() method on each entry. You can do this by using a for loop to loop through each word in the list.

lower_humpty_tokens = [word.lower() for word in humpty_tokens]
# Show first 10 entries of the lowercased tokens list
lower_humpty_tokens[0:6]

['humpty', 'dumpty', 'sat', 'on', 'a', 'wall']

Task: Printing token in list

Print the 13th token of the nursery rhyme (remember that a list index starts with 0).
Answer
print(lower_humpty_tokens[12])
fall

Index or
next