Named entity recognition (NER) is a process where software analyses text to locate words that a human would recognise as a distinct entity. These entities are then classified into categories, such as person, location, organisation, nationality, time, date, etc.
Some named entity recognisers, such as SpaCy, have a set of predefined categories that they have been trained to identify. Others, such as Stanford NER, allow you to define your own categories. Defining your own categories means you’ll need to train the recogniser to identify the entities you’re interested in. To do this, you’ll need to manually classify many documents, a time consuming and laborious process.
For example, if you wanted to know all the people mentioned in your text, your computer wouldn’t know how to tell you that information before you’ve performed NER, as it doesn’t know what people are. After entities in your text have been classified, it’s easy for the computer to list all the entities with a “person” tag.
Uses and applications
Named entity recognition can be used to:
- find which people are mentioned in the same documents as each other
- identify places that are important in the text
- find if the entities mentioned in the text change with different time periods
- and many more patterns.
Limitations
It’s good to ensure that the NER you use has been trained on text that is similar to the kind of text that you’re working with.
A NER tagger trained with American terms and locations may mislabel Fairfax as a geographic location if we run it over newspapers published by Australasian Fairfax.
A tagger is never going to get everything right, so you will likely end up with some missed or misclassified entities.