Extracting entities from text using Stanford NLP from scratch

This article is intended for those who have never worked with Stanford nlp and are faced with the need to study it and apply it as soon as possible.

This software is quite common, and, in particular, our company - BaltInfoCom - uses this program.

First you need to understand a simple thing: Stanford NLP works on the principle of annotating words, that is, one or more annotations are “hung” on each word, for example, POS (Part of Speech is part of speech), NER (Named-Entity Recognizing is a named entity) and etc.

The first thing that a newbie sees when going to the Stanford NLP website in the “ quick start ” section is the following design:

Properties props = new Properties(); props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,regexner,parse,depparse,coref"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); // create a document object CoreDocument document = new CoreDocument(text); // annnotate the document pipeline.annotate(document);

Here, the StanfordCoreNLP is a pipeline, to which our text, pre-packaged in a CoreDocument object, is fed. StanfordCoreNLP is the most important and frequently used object in the entire structure, through which all the main work takes place.

First, we set the parameters in the StanfordCoreNLP and indicate which actions we need. In this case, all possible combinations of these parameters can be found on the official website at this link.

tokenize - accordingly splitting into tokens
ssplit - split on offer
pos - definition of the part of speech
lemma - add to each word its initial form
ner - the definition of named entities, such as "Organization", "Face", etc.
regexner - defining named entities using regular expressions
parse - analysis of each word by semantics (gender, number, etc.)
depparse - parsing syntactic dependencies between words in a sentence
coref- search for mentioning the same named entity in the text, for example, "Mary" and "she"

Here is an example of how the annotators (parse and depparse) work together:

If the annotations above the tokens are not clear to you, then on these sites you will find their meanings: the meanings of the links in sentences , the meanings of parts of speech .

For each of these parameters, you can find additional flags for more fine-tuning here in the "Antotators" section.

These constructions are set if you want to use the built-in Stanford NLP models, but you can also set them manually using the addAnnotator (Annotator ...) method or via adding parameters before creating the StanfordCoreNLP object.

Now about how you can extract the named entity from the text. For this, Stanford NLP has three built-in classes based on regular expressions and one class designed for marking tokens through the model.

Classes based on regular expressions:

TokensRegexAnnotator - an annotator working according to the rules - SequenceMatchRules .
Consider an example of mapping for it, built on these rules.
```
 ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" } $EMAIL = "/.*([A-z0-9-]+?)(@)([A-z0-9-]+?).*/" { ruleType: "tokens", pattern: (([]) ($EMAIL)), action: (Annotate($0, ner, "MAIL")), priority:0 } 
```
In the first line, we indicate what type of tags we will fill in this template.
In the second, we create a variable, which, according to the rules, must begin with the “$” character and be at the beginning of the line.

After that we create a block in which we set the type of rules. Then a pattern for comparison (in our case, we say that we need “[]” - any token, after which our variable “$ EMAIL” comes. After that we set the action, in our case we want to annotate the token.

Please note, in the example, specifically “[]” and “$ EMAIL” are enclosed in parentheses, because $ 0 indicates which capture group we want to select from the pattern found, while under the capture group is meant the group enclosed in parentheses. If you specify 0, then in the phrase “mail sobaka@mail.ru” all tokens will be annotated as “MAIL”. If you specify 1 (that is, the first capture group), then only the word “mail” will be annotated; if 2, then only “sobaka@mail.ru”.

For situations where the same token can be defined differently by two rules, you can set the priority of the rule relative to another. For example, in the case of the next phrase - “House $ 25”, there may be two contradictory rules, according to one of which the number 25 will be defined as the house number, and according to the second, its value.
RegexNERAnnotator - this annotator works using the RegexNERSequenceClassifier classifier.

Mapping for him is as follows
```
 regex1 TYPE overwritableType1,Type2... priority 
```
Here regex1 is a regular expression in the format TokenSequencePattern .

TYPE is the name of the named entity.
overwritableType1, Type2 ... - types that we can replace in cases of dispute.
Priority - priority for the disputable situations described above.
Please note that in this mapping all columns must be separated by tabs.
TokensRegexNERAnnotator
This annotator differs from the previous one in that it uses the TokensRegex library for regular expressions, the same as the first annotator, which allows using more flexible rules for matching; as well as those that can write values of tags other than the NER tag.
Mapping for it is made according to the rules of RegexNERAnnotator

Layout of text through the model using NERClassifierCombiner
In order to use this class, you must first have or train your model.

How to do this can be found here ;
After you have trained the model, all that remains is to create a NERClassifierCombiner, specify the model path in it, and call the classify method.

 NERClassifierCombiner classifier = new NERClassifierCombiner(false, false, serialized_model); String text = "Some lucky people working in BaltInfoCom Org."; List<List<CoreLabel>> out = classifier.classify(text);

A complete list of annotators can be found here .

In addition to the above, if you need to use Stanford NLP for the Russian language, I can advise you to go here . There are models for determining parts of speech (pos-tagger) and for identifying relationships in a sentence (dependency parser).

The types of taggers represented there are:
russian-ud-pos.tagger - just a tagger,
russian-ud-mfmini.tagger - with the main list of morphological features,
russian-ud-mf.tagger - with a full list of morphological features, an example of a mapping for which can be found here .

Source: https://habr.com/ru/post/414175/

All Articles

Extracting entities from text using Stanford NLP from scratch

More articles: