A small note, rather for yourself, about small tricks on data recovery in Elasticsearch. How to fix the red index if there is no backup, what to do if you delete the documents, but no copies remain - unfortunately in the official documentation about these features are silent.
Backups

The first thing to do is set up backups of important data. How this is done is described in the
official documentation .
In general, nothing complicated. In the simplest version, we create a ball on another server, we mount it to all the elastic nodes in any convenient way (nfs, smbfs, whatever). Next, we use cron, your application, or whatever you like to send requests for the periodic creation of snapshots.
The first snapshot will be long, the subsequent ones will contain only the delta between the states of the index. Please note that if you periodically make a
forcemerge , the delta will be huge and, accordingly, the time to create a snapshot will be approximately the same as the first time.
What to consider:
- Check the status of backups, for example using _cat:
curl localhost:9200/_cat/snapshots/ yourbackuprepo / . Partial or Failed snapshots are not your bros. - Starting with ES 6.x, elastic is very picky about request headers. If you do them manually (not through the API), check that your
Content-Type: application/json , otherwise all your requests simply break off and the backup does not occur - Snapshot cannot be restored to open index. It must be closed or deleted first. However, you can restore snapshot side by side using rename_pattern, rename_replacement ( see the example in the dock ). In addition, when restoring a snapshot, its settings, including aliases, the number of replicas, etc., are also restored. If you do not need this, add in the restore request index_settings ( see the dock for an example ) with the necessary changes.
- You can connect the repos (ball) with snapshots to more than one cluster and restore snapshots from any cluster to any other. The main thing is that the versions of elastics are compatible.
In general, look in the documentation, there this topic is more or less disclosed.
Elasticdump
A small utility on nodejs, which allows you to copy data from one index to another index, cluster, file, stdout.
Output to a file or stdout by the way can be used as an alternative backup method - the output is the usual valid json (something like sql dump), which can be reused as you wish. For example, you can shove the output into a pipe, where your script will somehow convert the data and send it to another repository, such as clickhouse. The simplest js transformations can be done directly by elasticdump itself, there is a corresponding key
--transform . In general, the flight of fancy.
From pitfalls:
- As a backup method, it is much slower than snapshots. Plus, the backup is stretched in time, so the result on a frequently changing index may not be consistent. Keep in mind.
- Do not use nodejs from the debian repository, there is too old a version that negatively affects the stability of the tool.
- Stability of work may vary, especially if one of the parties is overloaded. Do not attempt to back up from one server to another by running the tool on the office machine - all traffic will flow through it.
- Sucks copied mapping. If you have something complicated there, then create an index manually, and only then fill it with data.
- Sometimes it makes sense to change the size of the chunk (parameter --limit). This parameter directly affects the speed of copying.
To merge a large number of indexes at the same time there is a multielasticdump with a simplified set of options, but all the indices merge in parallel.
Comment! The utility author said that he no longer has time for support, so the
program is looking for a new maintainer .
From personal experience: the utility is useful, it was rescued more than once. The speed and stability is so-so, I would like an adequate replacement, but so far nothing is on the horizon.
CheckIndex
So, we are starting to get close to the dark side. Situation: the index is in the red state. In the logs - something went wrong, the check amount does not match, probably you have a memory or a disk:
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?)Of course, Mummy admins never do this, because they have top-end hardware with triple replication, superECC memory with correction of absolutely all error levels on the fly, and in general snapshots are set up every second.
But the reality unfortunately sometimes throws up such options, when the backup was relatively long (if you have gigabytes indexed per hour, is the backup too old 2 hours ago too much?), There is nowhere to restore the data, it didn’t have time to replicate, and so on.
Of course, if there is a snapshot, backup or the like. - great, roll and do not worry. And if not? Fortunately, at least part of the data can still be saved.
First of all, close the index and / or turn off the elastic, make a backup copy of the failed shard.
Lucene (namely, it works as a backend in elasticsearch) has a great CheckIndex method. We just need to call him over the broken shard. Lucene will check all its segments and remove the damaged ones. Yes, the data will be lost, but at least not everything. Although here as lucky.
There are at least 2 ways.
Method 1: Right on the node
Such a simple script will help us.
Calling it without parameters will get something like this:
ERROR: index path not specified Usage: java org.apache.lucene.index.CheckIndex pathToIndex [-exorcise] [-crossCheckTermVectors] [-segment X] [-segment Y] [-dir-impl X] -exorcise: actually write a new segments_N file, removing any problematic segments -fast: just verify file checksums, omitting logical integrity checks -crossCheckTermVectors: verifies that term vectors match postings; THIS IS VERY SLOW! -codec X: when exorcising, codec to write the new segments_N file with -verbose: print additional details -segment X: only check the specified segments. This can be specified multiple times, to check more than one segment, eg '-segment _2 -segment _a'. You can't use this with the -exorcise option -dir-impl X: use a specific FSDirectory implementation. If no package is specified the org.apache.lucene.store package will be used. **WARNING**: -exorcise *LOSES DATA*. This should only be used on an emergency basis as it will cause documents (perhaps many) to be permanently removed from the index. Always make a backup copy of your index before running this! Do not run this tool on an index that is actively being written to. You have been warned! Run without -exorcise, this tool will open the index, report version information and report any exceptions it hits and what action it would take if -exorcise were specified. With -exorcise, this tool will remove any segments that have issues and write a new segments_N file. This means all documents contained in the affected segments will be removed. This tool exits with exit code 1 if the index cannot be opened or has any corruption, else 0.
Actually, we can either just get rid of the test index, or force CheckIndex to “fix” it, cutting out all the damaged.
The Lutsen index lives approximately along the following path: / var / lib / elasticsearch / nodes / 0 / indices / str4ngEHashVa1uE / 0 / index /, where 0 and 0 are the node number on the server and the shard number on the node. The terrible value between them - the internal name of the index - can be obtained from the output of curl localhost: 9200 / _cat / indices.
I usually make a copy in a different directory, and I fix it in-place. Then restart elasticsearch. As a rule, everything is picked up, albeit with data loss. Sometimes the index still does not want to be read due to the * corrupted * files in the shards folder. Move them to a safe place at a time.
Method 2: Luke

(picture from the Internet)
To work with Lucene there is a wonderful utility
called Luke .
It's still easier. Get the Lucene version from your elasticsearch:
$ curl localhost:9200 { "name" : "node00", "cluster_name" : "main", "cluster_uuid" : "UCbEivvLTcyWSQElOipgTQ", "version" : { "number" : "6.2.4", "build_hash" : "ccec39f", "build_date" : "2018-04-12T20:37:28.497551Z", "build_snapshot" : false, "lucene_version" : "7.2.1", "minimum_wire_compatibility_version" : "5.6.0", "minimum_index_compatibility_version" : "5.0.0" }, "tagline" : "You Know, for Search" }
Take the same version of Luke. Open an index in it (a copy of course) with the
Do not open IndexReader checkbox (when openning corrupted index) . Next, click Tools / Check Index. First, I recommend driving dry, and only then in repair mode. Further actions are similar - copy back the elastic, restart / open the index.
Recover deleted documents
Situation: You have executed a destructive request that has deleted a lot / all the necessary data. And to restore nowhere, or very expensive. Well, of course, SSZB that there are no backups, but it also happens.
Unfortunately or fortunately, Lucene never deletes anything directly. Its philosophy is closer to CoW, so deleted data is not really deleted, but only marked as deleted. Actually, deletion occurs during index optimization - live data from the segments are copied to the newly created segments, the old segments are simply deleted. In general, as long as the status of the deleted index is not 0, there are chances to pull it out.
$ curl localhost:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open data.0 R0fgvfPnTUaoI2KKyQsgdg 5 1 7238685 1291566 45.1gb 22.6gb
After the forcemerge there is no chance.
So, first of all we close the index, stop the elastic, copy the index (files) to a safe place.
Pull out an individual deleted document is impossible. You can only restore all deleted documents in the specified segment.
For versions of Lucene below 4, everything is very simple. The Lucene API has a function called undeleteAll. You can call her directly from
Luke from the previous paragraph.
For newer versions, alas, the functionality was cut out. Still, there is still a way. Information about “live” documents is stored in * .liv files. However, simply deleting them will make the index unreadable. It is necessary to correct the segments_N file so that it completely forgot about their existence.
Open the segments_N file (N is an integer) in your favorite Hex editor. The
official documentation will help us navigate
it :
segments_N: Header, LuceneVersion, Version, NameCounter, SegCount, MinSegmentLuceneVersion, <SegName, SegID, SegCodec, DelGen, DeletionCount, FieldInfosGen, DocValuesGen, UpdatesFiles>SegCount, CommitUserData, Footer
From all this, we need the values DelGen (Int64) and DeletionCount (Int32). The first must be made equal to -1, and the second 0.

It is not difficult to find them, they are right behind the SegCodec, which is a very conspicuous string like Lucene62. This screen shows that DelGen has the value 3, and the DeletionCount is 184614. Replace the first with 0xFFFFFFFFFFFFFFFF and the second with 0x00000000. Repeat for all required segments, save.
However, the fixed index does not want to load, referring to the checksum error. It's still easier. We take Luke, we load an index with disabled IndexReader, Tools / Check Index. We do a test run and immediately catch that segments_N are damaged. Such and such is expected cheksumma, and such and such is received.
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=51fbdb5c actual=6e964d17
Nonsense! We take the expected checksum and enter in the last 4 bytes of the file.

We save. We run CheckIndex again to make sure that everything is OK and the index is loading.
Et voilà!