Low-level implementation of the trie prefix tree in PHP

Foreword


The implementation of trie in PHP described here makes a too-fat dictionary, which is correspondingly loaded into memory for quite a long time, which eliminates its rather good speed. Search speed is ~ 80 thousand words per second. The dictionary is made from the list of lecmas of the opencorpora.org dictionary and includes 389844 words. The uncompressed dictionary weighs ~ 150mb, and the compressed gzip ~ 6mb. However, quite good performance results prove that in pure PHP you can make a fully functional prefix tree trie.

I ask in advance programmers with the makings of literary critics not to write malicious comments. This article should be interesting to beginners, like myself. Those who are too lazy to read can immediately see the code on github .

How it all began


For quite some time now I have been carrying out the idea of ​​writing a morphological analyzer for my PHP projects, which will be able to quickly perform a morphological analysis of given words, as well as the transformation of words into the desired morphological form.

PHP already has a similar library called phpmorhy. It works quite quickly and I would use it and would not invent anything, but the dictionary compiler in it is made as a separate non-PHP application, which makes it impossible for me to use this library. The library itself is built on the basis of the AOT dictionary that has not been updated for a long time, which further reduces its usefulness.

Weeks and months went by, I read an article by a habrovchanin, then an article by I. Segalovich on a fast morphological algorithm for a search engine, then another bunch of different articles.

Little by little I was ripe for writing my own amusement park with blackjack and shmars of a morphological analyzer. I think: " Well, progress has stepped forward, you can parse the human genome in PHP! ".

I took the opencorpora.org dictionary, put it in mysql and got a search speed of 2 thousand words per second. I need to load the dictionary into memory, I think. And here it turns out that in order to be available in PHP by regular data structures like an array or an object, it is necessary to store a dictionary for 3 million. About 2 GB of RAM is needed. All trie implementations in php that I came across were suitable only as a tutorial for demonstrating the logic of the work, since they themselves were built on native PHP data storage structures.

Dictionary storage device and working principle


Everywhere, where I managed to read, listen or see about the prefix tree, trie is not explained exactly how the data will be stored in memory. Here we have a knot, but his heirs, but the flag is the end of the word that is not shown under the hood. Therefore, I had to invent a method of storage myself.

As is known, the prefix tree trie consists of nodes. The node stores the prefix, links to subsequent nodes (descendants) and a pointer to the fact that this prefix is ​​the last in the chain. Quite lucidly about trie tells a Hindu here .

The nodes in my trie implementation are fixed-length data blocks of 154 bytes. The first 6 bytes (48 bits) contain the bitmask of the heirs. The first 46 bits are the Russian alphabet, numerals, quotes, hyphens and apostrophes. An apostrophe has been added because the opencorpora.org dictionary contains the word “iv dévoire”, which uses the apostrophe sign. The 47th bit is used to store the word end flag. The following 148 bytes after the bitmask are used to store references to node heirs. 3 bytes for each character (46 * 3 = 148).

Nodes are stored as binary data in a row. Access to the desired site is carried out using the functions substr () and subsequent unpacking unpack ().

The use of fixed-length nodes simplifies the addressing process. To switch to the desired node, it is enough to know its sequence number (id) and the length of the node. We multiply the sequence number by the length and find out the offset relative to the beginning of the line - everything is very simple.

rice 1 Storage Scheme




disadvantages


The storage scheme used simplifies addressing, but it devours the place. To store 380 thousand words requires just over a million nodes. 154 bytes * 1,000,000 nodes = 146.86 megabytes. Those. approximately 400 bytes per word. If you write the words in a simple text file in utf8 encoding, then the same 380 thousand words can fit in 16 megabytes.

Plans


In order to use memory more rationally, I want to go to a variable length of nodes, then, as a reference, it is necessary to record not the node id, but its location in bytes. Determining where the link to the desired node will be stored will be as follows. For example, the word "abb".

The letter "a" is the first in the alphabet, so the node is also the first, respectively, the offset 0. We read 6 bytes, starting from 0.

$str = substr($dic, 0, 6); 

Unpack the line:

 $mask = (ord($str[5]) << 40) + (ord($str[4]) << 32) + (ord($str[3]) << 24) + (ord($str[2]) << 16) + (ord($str[1]) << 8) + ord($str[0]); 

We look in the mask of the 2nd bit (the letter "b")

 bit_get($mask, 2); 

The bit is set, now we count the number of raised bits in the mask to 2. Assume that at our node the bits of the letter “a” are also raised, then our bit of the letter “b” will be the second bit raised. We consider the offset to read the link

 $offset = 6 + 3; 

6 bytes mask + 3 bytes, which occupies the first link, it turns out 9 bytes. We read the necessary piece of a line.

 $str = substr($dic, $offset, 3); 

Unpack the link:

 $ref = (ord($str[2]) << 16) + (ord($str[1]) << 8) + ord($str[0]); 

Go to the next node and repeat everything again. In the last letter we check the presence of 47 bits in the mask, if it is installed - there is a word in our trie.

I hope that it will be possible to keep the speed not lower than 50 thousand words per second.

Thanks


I want to thank the forum participants nulled.cc and php.ru for their help with bitwise functions.

Source: https://habr.com/ru/post/416191/


All Articles