Data lossless ElasticSearch data migration

Academic data warehouse design recommends keeping everything in a normalized form, with links between. Then the roll forward of changes in relational math will provide a reliable repository with transaction support. Atomicity, Consistency, Isolation, Durability - that's all. In other words, the storage is specifically built to safely update data. But it is not at all optimal for searching, especially with a broad gesture on the tables and fields. Looking for indexes, a lot of indexes. Volumes expand, recording slows down. SQL LIKE is not indexed, and JOIN GROUP BY sends to meditate in the query planner.

The increasing load on one machine forces it to expand, either vertically into the ceiling or horizontally, buying some more nodes. Resiliency requirements cause data to be spread across multiple nodes. And the requirement for immediate recovery after a failure, without a denial of service, forces us to set up a cluster of machines so that at any moment any of them can perform both writing and reading. That is, to already be a master, or become them automatically and immediately.

The problem of quick search was solved by installing a number of second storage optimized for indexing. Search full-text, faceted, with Stemming ~~and blackjack~~ . The second repository takes as input entries from the tables of the first one, analyzes and builds an index. Thus, the data storage cluster was supplemented with another cluster for their search. With a similar master configuration to match the overall SLA . Everything is good, business is delighted, admins sleep at night ... until the machines in the master-master cluster become more than three.

Elastic

The NoSQL movement has significantly expanded the scaling horizon for both small and big data. NoSQL cluster nodes are able to distribute data among themselves so that the failure of one or more of them does not lead to a denial of service for the entire cluster. The price for the high availability of distributed data was the impossibility of ensuring their complete consistency on the recording at any time. Instead, NoSQL speaks of eventual consistency . That is, it is believed that one day all the data will disperse among the cluster nodes, and they will become consistent in the end.

Thus, the relational model was supplemented with a non-relational one and gave rise to a multitude of database engines that solve the problems of the CAP triangle with some success. Developers got into the hands of fashionable tools to build their own perfect persistence layer - for every taste, budget and profile of the load.

ElasticSearch is a representative of cluster NoSQL with RESTful JSON API on the Lucene engine, open source in Java, which can not only build a search index, but also store the original document. Such a trick helps to rethink the role of a separate database management system for storing the originals, or even abandon it. The end of the entry.

Mapping

Mapping in ElasticSearch is something like a schema (table structure, in terms of SQL), which tells you exactly how to index incoming documents (records, in terms of SQL). Mapping can be static, dynamic, or absent. Static mapping does not allow itself to change. Dynamic allows you to add new fields. If mapping is not specified, ElasticSearch will make it himself, receiving the first document to be written. Analyze the structure of fields, make some assumptions about the types of data in them, skip through the default settings and write. At first glance, such behaviorless behavior seems very convenient. But in fact, it is more suitable for experiments than for surprises in production.

So, the data is indexed, and it is a unidirectional process. Once created, mapping cannot be changed dynamically as ALTER TABLE in SQL. Because the SQL table stores the original document to which you can screw the search index. And in ElasticSearch the opposite. He himself is a search index to which you can fasten the original document. That is why the index scheme is static. Theoretically, one could either create a field in the mapping or delete it. But in practice, ElasticSearch only allows you to add fields. An attempt to remove a field does not lead to anything.

Alias

A nickname is this optional name for the ElasticSearch index. There can be several aliases for a single index. Or one alias for multiple indexes. Then the indices seem to be logically combined and look the same from the outside. Alias is very convenient for services that communicate with the index throughout its life. For example, the pseudonym of products can hide both products_v2 and products_v25 behind them, without having to change the names in the service. Alias is indispensable for data migration when they are already transferred from the old scheme to the new one, and you need to switch the application to work with the new index. Switching an alias from index to index is an atomic operation. That is, it is performed in one step without loss.

Reindex API

The data scheme, mapping, tends to change from time to time. New fields are added, unnecessary fields are deleted. If ElasticSearch plays the role of a single repository, then you need a tool to change the mapping on the fly. For this, there is a special command to transfer data from one index to another, the so-called _reindex API . It works with a ready or empty mapping of the recipient index, on the server side, quickly indexing in batches of 1000 documents at a time.

Reindexing can do a simple field type conversion. For example, long in text and back in long , or boolean in text and back in boolean . But -9.99 in boolean is no longer able, ~~this is not PHP~~ . On the other hand, type conversion is an insecure thing. Service written in a language with dynamic typing such a sin, perhaps, forgive. But if reindex cannot convert the type, the document will simply not be recorded. In general, data migration should take place in 3 stages: add a new field, release a service with it, clean up the old one.

A field is added like this. The source index scheme is taken, a new property fits in, an empty index is created. Then reindexing is started:

{ "source": { "index": "test" }, "dest": { "index": "test_clone" } }

Removes the field in a similar way. The source index scheme is taken, the field is removed, an empty index is created. Then, reindexing is started with the list of copied fields:

 { "source": { "index": "test", "_source": ["field1", "field3"] }, "dest": { "index": "test_clone" } }

For convenience, both cases are combined into the cloning function in Kaizen, the desktop client for ElasticSearch. Cloning can adjust to the mapping of the recipient index. The example below shows how a partial clone is made from an index with three collections (types, in terms of ElasticSearch) act , line , scene . All that remains is a line with two fields, static mapping is turned on, and the speech_number field from text becomes long .

Migration

The reindex API has one unpleasant feature - it does not know how to follow changes in the source index. If after the start of reindexing, something changes there, then the changes are not reflected in the recipient index. To solve this problem, ElasticSearch FollowUp Plugin was developed, which adds logging commands. The plugin can follow the index, returning in JSON format the actions performed on documents in chronological order. The index, type, document ID and operation on it - INDEX or DELETE are remembered. The FollowUp Plugin is published on GitHub and compiled for almost all versions of ElasticSearch.

So, for the lossless data migration, you will need a FollowUp installed on the node on which the reindexing will be launched. It is assumed that the alias index is already available, and all applications run through it. Immediately before reindexing the plugin is enabled. When reindexing is complete, the plugin is turned off, and alias is transferred to a new index. Then the recorded actions are reproduced on the recipient index, catching up with its state. Despite the high speed of reindexing, two types of collisions may occur during playback:

in the new index there is no more document with such _id . The document was removed after switching the alias to the new index.
in the new index there is a document with the same _id , but with the version number higher than in the source index. The document was updated after the alias was switched to a new index.

In these cases, the action should not be reproduced in the recipient index. The remaining changes are reproduced.

Happy coding!

Source: https://habr.com/ru/post/416069/

All Articles