Skip to main content

Text and data mining

Analyse large scale text or datasets in your research

Data mining is the process of applying open-ended computational methods to large scale datasets to discover new insights that may not be revealed through targeted smaller scale analyses. When the datasets used are bodies of text, this process is often termed text mining and can provide a complementary approach to traditional close readings of texts. Text and data mining (TDM) approaches can open up new areas of scholarly enquiry.

Before you start

Before you get started with TDM make sure that you:

  • Understand and have considered any issues around copyright and licensing conditions for the content that you wish to use
  • Understand and have considered any ethical concerns that might arise from your use of the content, particularly when linking datasets or working with sensitive information
  • that you comply with data providers’ preferences for how to access their content

Further information about these considerations can be found on the step by step guide to text and data mining.

Library licensed data sources and tools

Text and data mining is permitted in a number of the databases that the Library provides access to for University staff and students. Check out the full list of databases available for text and data mining, including licence and access conditions, to see which might be useful for your project. Some data sources may require considerable time and work to apply for, access, and prepare the data before they are mining ready, so ensure that you factor this into your project timelines. Please note that Factiva doesn’t allow text and data mining.

University staff and students can access a few different tools that can be used to perform TDM on specific Library-subscribed content:

  • Gale Digital Scholar Lab – The Digital Scholar Lab is a useful tool for analysing Gale Primary Sources content. No programming is required to use the Lab and it allows you to clean and standardise your content and perform several different TDM methods. External content can be uploaded to the Digital Scholar Lab for analysis, provided the content has a licence that allows for text mining.
  • ProQuest Text and Data Mining Studio – The Studio allows you to mine a wealth of ProQuest content, including newspapers, magazines, journals and books. The Studio has two interfaces. The Studio Workbench allows you to use R or Python to analyse up to two million documents and is best suited to researchers who have some coding experience and want to conduct large scale analysis of texts. The Studio Visualizations interface allows you to analyse a smaller subset of content using pre-built text mining tools, so it doesn’t require coding experience and is best suited to researchers and students new to text mining, or teaching staff wanting to introduce text mining to their students.

Help with TDM

Get started by checking out the step by step guide to text and data mining.

Library

The Library can support you with:

  • Understanding text and data mining concepts
  • Finding out which library licensed data sources can be mined
  • Advice on forming a search strategy for corpora creation
  • Using the Gale Digital Scholar Lab and ProQuest TDM Studio

Contact an Academic Liaison Librarian for assistance.

Sydney Informatics Hub

Sydney Informatics Hub provide free introductory to advanced training courses, including courses on programming and collecting web data for research. You can see all training courses available from the Sydney Informatics Hub on their website.

Aboriginal and Torres Strait Islander peoples are advised that this website may contain images, voices and names of people who have died.

The University of Sydney Library acknowledges that its facilities sit on the ancestral lands of Aboriginal and Torres Strait Islander peoples, who have for thousands of generations exchanged knowledge for the benefit of all. Learn more