Extracting entities from text using Stanford NLP from scratch

This article is intended for those who have never worked with Stanford nlp and are faced with the need to study it and apply it as soon as possible.

This software is quite common, and, in particular, our company - BaltInfoCom - uses this program.

First you need to understand a simple thing: Stanford NLP works on the principle of annotating words, that is, one or more annotations are “hung” on each word, for example, POS (Part of Speech is part of speech), NER (Named-Entity Recognizing is a named entity) and etc.

The first thing that a newbie sees when going to the Stanford NLP website in the “ quick start ” section is the following design:

Properties props = new Properties(); props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,regexner,parse,depparse,coref"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); // create a document object CoreDocument document = new CoreDocument(text); // annnotate the document pipeline.annotate(document); 

Here, the StanfordCoreNLP is a pipeline, to which our text, pre-packaged in a CoreDocument object, is fed. StanfordCoreNLP is the most important and frequently used object in the entire structure, through which all the main work takes place.

First, we set the parameters in the StanfordCoreNLP and indicate which actions we need. In this case, all possible combinations of these parameters can be found on the official website at this link.


Here is an example of how the annotators (parse and depparse) work together:

image

If the annotations above the tokens are not clear to you, then on these sites you will find their meanings: the meanings of the links in sentences , the meanings of parts of speech .

For each of these parameters, you can find additional flags for more fine-tuning here in the "Antotators" section.

These constructions are set if you want to use the built-in Stanford NLP models, but you can also set them manually using the addAnnotator (Annotator ...) method or via adding parameters before creating the StanfordCoreNLP object.

Now about how you can extract the named entity from the text. For this, Stanford NLP has three built-in classes based on regular expressions and one class designed for marking tokens through the model.

Classes based on regular expressions:


Layout of text through the model using NERClassifierCombiner
In order to use this class, you must first have or train your model.

How to do this can be found here ;
After you have trained the model, all that remains is to create a NERClassifierCombiner, specify the model path in it, and call the classify method.

 NERClassifierCombiner classifier = new NERClassifierCombiner(false, false, serialized_model); String text = "Some lucky people working in BaltInfoCom Org."; List<List<CoreLabel>> out = classifier.classify(text); 

A complete list of annotators can be found here .

In addition to the above, if you need to use Stanford NLP for the Russian language, I can advise you to go here . There are models for determining parts of speech (pos-tagger) and for identifying relationships in a sentence (dependency parser).

The types of taggers represented there are:
russian-ud-pos.tagger - just a tagger,
russian-ud-mfmini.tagger - with the main list of morphological features,
russian-ud-mf.tagger - with a full list of morphological features, an example of a mapping for which can be found here .

Source: https://habr.com/ru/post/414175/


All Articles