Text and data mining methods

Text and data mining (TDM) can be used to capture key concepts, trends and patterns in your research.

Common TDM methods include:

Topic modelling

Topic modelling is a text and data mining method that scans your texts to identify groups of words that often appear in the same documents as each other.

For more information on this method, visit Topic modeling made just simple enough.

See how researchers have used topic modelling in Examples of TDM in research.

Uses and applications

Topic modelling can be used to:

get an overview of the discourses or topics that appear in your texts
discover overarching themes or concepts that might be missed if you were reading each text individually, or if a corpus is too large for close reading
identify gaps or trends in existing research.

Limitations

Topic modelling can’t be used:

for short documents or small corpora. A corpus needs enough data to establish defined themes and concepts
for identifying the topic uniting a group of words
to make prescriptive judgements about texts, as a closer reading is necessary to validate any theories about them
without cleaning the documents. The removal of stopwords and non-alphanumeric characters are both important cleaning techniques for topic modelling.

Sentiment analysis

Sentiment analysis is used to determine the emotional tone of a text.

There are 2 types of this method you can use:

1. Lexicon-based sentiment analysis

The algorithm analyses the text and calculates a score for each sentence based on the presence of words from a dictionary of opinion words with predefined scores (e.g. -1 for bad, -3 for horrendous, +1 for nice, +2 for excellent).
Depending on the algorithm and word list, it may also detect negations (such as don’t, isn’t) or intensifiers (such as very, really etc.) and adjust the score accordingly. For example, the sentence “It was a really good game overall” might be scored higher than the sentence “It was a good game overall”.
The algorithm will then average the final score by the number of sentences to determine the overall sentiment score for the document.

2. Machine learning sentiment analysis

A large set of documents is scored by humans for different types of sentiment.
This set of documents is used as a dataset to train a machine learning model.
The model is then used to identify sentiment in other documents.

Machine learning approaches may identify specific emotions (e.g. sadness, anger, fear, joy, surprise), rather than just an overall positive or negative sentiment.

See how researchers have used both types of sentiment analysis in Examples of TDM in research.

Uses and applications

Sentiment analysis can be used to:

gauge the overall mood of a single text
examine social media posts and website comments to investigate emotional responses to topics
review texts that document significant events, such as newspapers, to determine public perceptions and potentially the evolution of these perceptions through time.

Limitations

Sentiment analysis doesn’t work well with:

sarcasm and satire
short pieces of text
text written with a deliberately neutral tone, such as some newspapers.

Term frequency and TF-IDF

Term frequency analyses how often a word or phrase appears in a document or in your corpus. In its simplest form, term frequency is calculated by counting the number of times the term is used. This can provide insight into the topics most frequently discussed in your text.

Term frequency-inverse document frequency (TF-IDF) is a related method that can identify more meaningful frequent words. In TF-IDF, a frequently used term in one document is compared to other documents in the corpus. This differentiates terms that are common within particular documents from terms that are common across all or most documents in the corpus.

Uses and applications

Term frequency and TF-IDF can be used to:

gain insights into how language is used across a sample of documents (for instance, how a word or term falls in and out of use over time)
visualise frequent terms – such as in a word cloud – to get an idea of the overall content of a document or groups of documents
index documents and retrieve information, where documents with a higher instance of a term are shown before documents with lower instances.

Limitations

Term frequency and TF-IDF does not account for:

synonyms, like “run” and “sprint”
homographs, like “tear” (noun, liquid produced when crying) and “tear” (verb, to rip something)
Can produce unhelpful results if your corpus hasn't been appropriately cleaned, e.g. frequencies dominated by stopwords such as "of" and "the".

Collocation analysis

A collocation is a group of 2 or more words that appear close together more often than would be expected by chance.

Collocations can be:

multi-word phrases, such as “middle management” or “crystal clear”
words that appear near each other, but not always directly together. For example, “door” and “knock” are likely to appear in close proximity, such as in the phrase “a knock came at the door”, however they don't necessarily form a distinct phrase.

See how researchers have used collocation analysis in Examples of TDM in research.

Uses and applications

Collocation analysis can be used to:

understand the contexts in which words are used and the associated meanings they gain due to regularly occurring with other words (e.g. the collocation “illegal immigrant” can reinforce negative ideas around immigration and migrants)
distinguish subtle differences in meaning and use in near synonyms (e.g. “strong” and “powerful” have similar meanings, but we would use “strong tea” and “powerful computer” rather than the other way around)
identify idioms and understand how native speakers of a language construct phrases.

Limitations

Different statistical methods can identify different sets of collocations from the same text. You may get different results depending on the tool or settings that you use.
It's important to understand your method and what information it gives you so that you can use the most appropriate method for the question you want to answer.

Named entity recognition

Named entity recognition (NER) is a process where software analyses text to locate words that a human would recognise as a distinct entity. These entities are then classified into categories, such as person, location, organisation, nationality, time, date, etc.

Some named entity recognisers, such as SpaCy, have a set of predefined categories that they have been trained to identify. Others, such as Stanford NER, allow you to define your own categories. Defining your own categories means you’ll need to train the recogniser to identify the entities you’re interested in. To do this, you’ll need to manually classify many documents, a time consuming and laborious process.

For example, if you wanted to know all the people mentioned in your text, your computer wouldn’t know how to tell you that information before you’ve performed NER, as it doesn’t know what people are. After entities in your text have been classified, it’s easy for the computer to list all the entities with a “person” tag.

Uses and applications

Named entity recognition can be used to:

find which people are mentioned in the same documents as each other
identify places that are important in the text
find if the entities mentioned in the text change with different time periods
and many more patterns.

Limitations

It’s good to ensure that the NER you use has been trained on text that is similar to the kind of text that you’re working with.

A NER tagger trained with American terms and locations may mislabel Fairfax as a geographic location if we run it over newspapers published by Australasian Fairfax.

A tagger is never going to get everything right, so you will likely end up with some missed or misclassified entities.

Related information

Text and data mining Cleaning and preparing your data Creating a dataset Text and data mining databases

Contact

For more help finding and accessing theses, speak to our friendly library staff.