Tokenisation is a method where a string of text is broken down into individual units, known as tokens. These tokens can be individual words, characters or phrases. You will need tokenisation for most text and data analysis methods.
In English it’s common to split your text up into individual words or 2–3-word phrases. Splitting your text into phrases is called “n-gram tokenisation”, where “n” is the number of words in the phrase.
Example
Sample sentence: “The cat sat on a mat. Then the cat saw a rat.”
This text can be tokenised as follows:
Word (sometimes called a “unigram”):
The
cat
sat
on
a
mat.
Then
the
cat
saw
a
rat.
|
2-word phrase (often called “bigrams” or “2-grams”):
The cat
cat sat
sat on
on a
a mat
mat. Then
Then the
the cat
cat saw
saw a
a rat.
|
3-word phrase (often called “trigrams” or “3-grams”):
The cat sat
cat sat on
sat on a
on a mat
a mat. Then
mat. Then the
Then the cat
the cat saw
cat saw a
saw a rat.
|
For languages that don’t separate words in their writing, such as Chinese, Thai or Vietnamese, tokenisation will require more thought to identify how the text should be split to enable the desired analysis.
Potential pitfalls
Splitting up words based on character spaces can change meaning or cause things to be grouped incorrectly in cases where multiple words are used to indicate a single thing. For example:
- “southeast" vs “south east" vs "south-east"
- place names like "New South Wales" or "Los Angeles"
- multi-word concepts like "global warming" and "social distancing".
Use both phrase tokenisation and single word tokenisation to mitigate this issue.
Converting text to lowercase
Computers often treat capitalised versions of words as different to their lowercase counterparts, which can cause problems during analysis. Make all text lowercase to avoid this problem.
Example
Uncorrected text contains:
Convert all text to lowercase to get one number:
Potential pitfalls
Sometimes capital letters help to distinguish between things that are different. For example, if your documents refer to both a person named “Rose” and the flower called “rose”, then converting the name to lowercase will result in these two different things being grouped together.
Other pre-processing techniques, such as named entity recognition, can help avoid this pitfall.
Word replacement
Variations in spelling can cause problems in text analysis as the computer will treat different spellings of the same word as different words. Choose a single spelling and replace any other variants in your text with that version.
For a large dataset, tokenise words first and then standardise the spelling. Alternatively, you can use tools such as VARD to do the work for you.
Example
Uncorrected text contains: “paediatric”, “pediatric”, and “pædiatric”.
Replace all variants with: “paediatric”.
Potential pitfalls
If you’re specifically looking at the use of different spellings or how spelling can change over time, using this method won’t be helpful.
Punctuation and non-alphanumeric character removal
Punctuation or special characters can clutter your data and make analysing the text difficult. Errors in optical character recognition (OCR) can also result in unusual non-alphanumeric characters being mistakenly added to your text.
Identify characters in your text that are neither letters or numbers and remove them.
Example
Uncorrected text contains: “coastline” and “coastline;”.
Removing the punctuation will correctly identify them as the same word.
Potential pitfalls
If you’re specifically looking at how certain punctuation or special characters are used, this method will remove important information. This will also be the case when using data with mixed languages or text where punctuation is important (e.g. names or words in French). You will need to take a more targeted approach to any non-alphanumeric character removal.
Other pre-processing, for example, tokenisation by sentences, may also need punctuation.
Stopwords
Stopwords are commonly used words, like “the”, “is”, “that”, “a”, etc., that don’t offer much insight into the text in your documents. It is best to filter stopwords out before analysing text.
You can use existing stopword lists to remove common words from numerous languages.
If there are specific words that are common in your documents, but aren’t relevant to your analysis, you can customise existing stopword lists by adding your own words to them.
Potential pitfalls
Before using a stopword list, particularly one created by someone else, check to make sure that it doesn’t contain any words that you would like to analyse.