B + tree in a real project

In this article we will take a closer look at how a B + tree is made in an Apache Ignite distributed database.

On Habré there are already a couple of articles on B-trees ( one , two ), but they are rather theoretical and even if they contain implementation, it is not production ready . From this and there is interest to look at the implementation, which is used in life.

Before reading the article further, I advise you to watch the lecture of Maxim Babenko , if you still do not know what a B-tree is in theory. But you don’t need to know Java or the Apache Ignite project deeply - I’ll either write the details explicitly or hide it under the spoilers.

When reading Ignite sources, I advise you to skip the arguments of methods in your mind, the meaning of which is not very clear and to read the names of functions - reading the function body is much easier if you know in advance what it does.

Consider that I am not the leading Apache Ignite developer and could misunderstand something from the code. So for brackets I take out the phrase "as I understand it," which should be added mentally before each affirmative sentence.

Why B + tree in Apache Ignite

Apache Ignite is an in-memory database, but since version 2.1 it has a persistent data store , a functionality for storing data on disk (it has nothing to do with the persistent data structure ) . Therefore, it is clear why we need a structure for external memory, it remains to figure out why they did not choose another one.

To begin with, the B + tree is an optimization of the B-tree, where the values are stored only in the leaves. In this optimization, more keys fit into the node, which increases the degree of branching. Therefore, there is little reason to use the classic B-tree.

B * and B + * - lie more tightly on the disk, but they have worse performance, because we store data from RAM, performance is more important to us.

LSM-tree judging by the benchmarks faster to write, but slower to read. And the reading loss is greater than the gain on the record, so for a hypothetical general case, I would also take a B + tree.

There are also strange fractal trees , however, they, apparently, are patented and implemented only in TokuDB .

Personally, I am more interested in why it was impossible to take a ready disk backend, like LevelDB ? A partial answer is that PDS supports third-party repositories.

The implementation of a large cell

Initially, GridGain developed Apache Ignite, but before it went to open-source, it bore the name of the company, so the names of some components begin with the Grid , and others with Ignite . Therefore GridCursor , but IgniteTree . There is no other logic here.

The Apache Ignite code is written according to the Java best practices pattern and each class inherits an interface, or even more than one. From the interface IgniteTree and start the dance. I give the code without javadoc, for short.

 public interface IgniteTree<L, T> { public T put(T val) throws IgniteCheckedException; public void invoke(L key, Object x, InvokeClosure<T> c) throws IgniteCheckedException; public T findOne(L key) throws IgniteCheckedException; public GridCursor<T> find(L lower, L upper) throws IgniteCheckedException; public GridCursor<T> find(L lower, L upper, Object x) throws IgniteCheckedException; public T findFirst() throws IgniteCheckedException; public T findLast() throws IgniteCheckedException; public T remove(L key) throws IgniteCheckedException; public long size() throws IgniteCheckedException; interface InvokeClosure<T> { void call(@Nullable T row) throws IgniteCheckedException; T newRow(); OperationType operationType(); } enum OperationType { NOOP, REMOVE, PUT } }

The IgniteTree interface describes a standard set of operations. Please note that having a search by range requires you to knit the leaves into a list. The bonus supports arbitrary operation on the record - invoke .

The put operation takes only one argument of type T without a key. You will not find an explanation for this in IgniteTree , but the answer is hidden in the BPlusTree header.

 public abstract class BPlusTree<L, T extends L> extends DataStructure implements IgniteTree<L, T>

As you can see, the value inherits the key, so in the put operation the key is calculated from the value. This does not limit the functionality of the tree. My guess is that it’s more compact to store set-s.

Usually, a map is made from set by putting a garbage constant into the value. However, in the B + tree only the keys are stored in the nodes, but if the value also stores the key, then only the values should be stored in the leaves. And if the tree is a set, then it will automatically turn out that in the leaves there will be only keys without garbage values.

Now look at the item search code.

 /** * @param g Get. * @throws IgniteCheckedException If failed. */ private void doFind(Get g) throws IgniteCheckedException { for (;;) { // Go down with retries. g.init(); switch (findDown(g, g.rootId, 0L, g.rootLvl)) { case RETRY: case RETRY_ROOT: checkInterrupted(); continue; default: return; } } } /** * @param g Get. * @param pageId Page ID. * @param fwdId Expected forward page ID. * @param lvl Level. * @return Result code. * @throws IgniteCheckedException If failed. */ private Result findDown(final Get g, final long pageId, final long fwdId, final int lvl) throws IgniteCheckedException { long page = acquirePage(pageId); try { for (;;) { // Init args. g.pageId = pageId; g.fwdId = fwdId; Result res = read(pageId, page, search, g, lvl, RETRY); switch (res) { case GO_DOWN: case GO_DOWN_X: assert g.pageId != pageId; assert g.fwdId != fwdId || fwdId == 0; // Go down recursively. res = findDown(g, g.pageId, g.fwdId, lvl - 1); switch (res) { case RETRY: checkInterrupted(); continue; // The child page got split, need to reread our page. default: return res; } case NOT_FOUND: assert lvl == 0 : lvl; g.row = null; // Mark not found result. return res; default: return res; } } } finally { if (g.canRelease(pageId, lvl)) releasePage(pageId, page); } }

The basis of the B-tree traversal algorithm remains unchanged: descend recursively through the tree to the desired sheet: if the value is present, then the result is returned, and if not, then null . Recursion apparently left for convenience, anyway, B-trees are not deep.

I was surprised, because there was a clear statement in my head: in a real project, recursion is always removed ( in Java there is no tail recursion optimization , projects in other languages allow recursion). But really, the height of a B-tree is measured in units, because the block size is of the order $2 ^ {10}$ , and the number of data order $2 ^ {40}$ and that height will be $\ frac {log (2 ^ {40})} {log (2 ^ {10})} = 4$ .

Apache Ignite loves concurrency . Therefore, the tree supports competitive modification. At the time of the operation, one page is blocked, but another thread may modify the rest of the tree so that it will need to be re-read. And so the procedure can reach the root.

At first, I did not understand the meaning of such functionality, because the disk is slow and one thread will quietly process all I / O operations. It is clear that the search in the node loaded from the disk costs a lot and during this time you can load the disk with another operation, but repeated attempts will eat up the gain. But then it dawned on me that, in this implementation, the nodes are not immediately flushed to disk after modification, but hang in memory for a while, in order to then apply many modifications at once. Data will not be lost thanks to Write Ahead Log. More about it will be at the end of the article.

Now let's look at the code for adding an element.

 private T doPut(T row, boolean needOld) throws IgniteCheckedException { checkDestroyed(); Put p = new Put(row, needOld); try { for (;;) { // Go down with retries. p.init(); Result res = putDown(p, p.rootId, 0L, p.rootLvl); switch (res) { case RETRY: case RETRY_ROOT: checkInterrupted(); continue; case FOUND: // We may need to insert split key into upper level here. if (!p.isFinished()) { // It must be impossible to have an insert higher than the current root, // because we are making decision about creating new root while keeping // write lock on current root, so it can't concurrently change. assert p.btmLvl <= getRootLevel(); checkInterrupted(); continue; } return p.oldRow; default: throw new IllegalStateException("Result: " + res); } } } catch (IgniteCheckedException e) { throw new IgniteCheckedException("Runtime failure on row: " + row, e); } catch (RuntimeException e) { throw new IgniteException("Runtime failure on row: " + row, e); } catch (AssertionError e) { throw new AssertionError("Assertion error on row: " + row, e); } finally { checkDestroyed(); } } /** * @param p Put. * @param pageId Page ID. * @param fwdId Expected forward page ID. * @param lvl Level. * @return Result code. * @throws IgniteCheckedException If failed. */ private Result putDown(final Put p, final long pageId, final long fwdId, final int lvl) throws IgniteCheckedException { assert lvl >= 0 : lvl; final long page = acquirePage(pageId); try { for (;;) { // Init args. p.pageId = pageId; p.fwdId = fwdId; Result res = read(pageId, page, search, p, lvl, RETRY); switch (res) { case GO_DOWN: case GO_DOWN_X: assert lvl > 0 : lvl; assert p.pageId != pageId; assert p.fwdId != fwdId || fwdId == 0; res = p.tryReplaceInner(pageId, page, fwdId, lvl); if (res != RETRY) // Go down recursively. res = putDown(p, p.pageId, p.fwdId, lvl - 1); if (res == RETRY_ROOT || p.isFinished()) return res; if (res == RETRY) checkInterrupted(); continue; // We have to insert split row to this level or it is a retry. case FOUND: // Do replace. assert lvl == 0 : "This replace can happen only at the bottom level."; return p.tryReplace(pageId, page, fwdId, lvl); case NOT_FOUND: // Do insert. assert lvl == p.btmLvl : "must insert at the bottom level"; assert p.needReplaceInner == FALSE : p.needReplaceInner + " " + lvl; return p.tryInsert(pageId, page, fwdId, lvl); default: return res; } } } finally { if (p.canRelease(pageId, lvl)) releasePage(pageId, page); } }

The only difference is that after finding a position, the code forks into replace and insert . The remove code can no longer be viewed. The basic mechanism is to go through the tree recursively along with a special object, depending on the operation: Get , Put or Remove .

Invoke works in the same way, only the operation takes place with a copy of the record, then its type is determined: NOOP for reading, REMOVE for deletion and PUT for updating or adding, and then a corresponding Put or Remove object is generated that is applied to the entry in the tree.

Using

Below I will consider in detail two heirs of BPlusTree : CacheDataTree and PendingEntriesTree . Overboard is the implementation of indices - this is a topic for a separate discussion, for which I am not ready yet.

Before going further, I’ll clarify that the local distributed map has a displacement functionality and is called IgniteCache - then just a cache.

`CacheDataTree`

CacheDataTree is the mapping of several IgniteCache to disk. Multilevel sorting: first, sort by cache id, in order to group keys in one cache, and then by hashes.

CacheDataTree does not iterate over a range, since indexes are implemented in the heirs of H2Tree extends BPlusTree , therefore, the specific sort is not important: any will be enough for put and get operations. Comparing hashes is faster than objects. But much more important is that a uniform hash fills the tree more closely.

The trees are balancing so that they do not degenerate into the list. But if you add evenly distributed keys to the search tree, it will automatically be balanced. Since B-trees start the balancing procedure as problems arise, and hashes evenly shuffle the keys, sorting by hashes reduces the frequency of balancing.

Using hashes in the search tree is not such a strange idea as it seems, its logical development will lead to a Hash array mapped trie .

`PendingEntriesTree`

PendingEntriesTree stores keys for data with a lifetime and is used as a set. Multi-level sorting: first, sort by cache id, to group keys in one cache, and then by lifetime. Next is another round of comparison - apparently, purely technical, in order to distinguish elements. By sorting it is easy to guess that this class is used to search for ranges. This tree duplicates the key entries in the cache for crowding out.

Understand how this separate adventure works.

Adventure

In IgniteCacheOffheapManagerImpl.expire() take the cursor and delete the entries from the PendingEntriesTree . Entries in CacheDataTree are removed in the closure of c , which is passed in the parameters.

 @Override public boolean expire( GridCacheContext cctx, IgniteInClosure2X<GridCacheEntryEx, GridCacheVersion> c, int amount )

Apache Ignite has only recently stopped supporting Java 7, so closures are created via an anonymous class.

 private final IgniteInClosure2X<GridCacheEntryEx, GridCacheVersion> expireC = new IgniteInClosure2X<GridCacheEntryEx, GridCacheVersion>() { @Override public void applyx(GridCacheEntryEx entry, GridCacheVersion obsoleteVer) { boolean touch = !entry.isNear(); while (true) { try { if (log.isTraceEnabled()) log.trace("Trying to remove expired entry from cache: " + entry); if (entry.onTtlExpired(obsoleteVer)) touch = false; break; } catch (GridCacheEntryRemovedException ignore) { entry = entry.context().cache().entryEx(entry.key()); touch = true; } } if (touch) entry.context().evicts().touch(entry, null); } };

What we are looking for is in the GridCacheMapEntry.onTtlExpired() method, where there is a treasured line in the finally block.

 cctx.cache().removeEntry(this);

Implementation details

Work with pages in Offheap

Offheap is a manual memory optimization technique with a garbage collector.

It is a myth that because of the garbage collector everything slows down, usually the collectors are worth a tiny percentage of performance . Even large heaps are not a problem in themselves, because Collectors are competitive (for example, CMS and G1 in Java), and dozens of cores for servers. Of course, the collector adds unpredictable pauses to the application, which is important for games, but tolerable for the database.

But what will really be a problem is a violation of the hypothesis of generations on a big pile.

Generational hypotheses

GC optimizations use the generational hypothesis. This hypothesis exists in two versions: strong and weak.

Weak hypothesis about generations: most objects die young.

A strong hypothesis about generations: the older the object, the longer it will live.

A strong hypothesis implies a weak one, but not vice versa. Ideally, the GC should be content to perform a weak hypothesis, but in practice this is not the case.

Check out Alexey Shipilev's reports on the new GC in Java, if you want to understand the topic better: once and twice .

And here such a thing is that before the advent of PDS, Apache Ignite was positioned mainly as a cache between services and a disk database (for example, Hadoop ). Therefore, the maps in Ignite are called IgniteCache and have the corresponding displacement functionality. And caches just violate the hypothesis of generations - in them the probability of removing an object increases with the lifetime. That's why in this case Offheap for storing user data improves performance.

More bonuses:

Offheap makes it easier to implement structures that hold more than Integer.MAX_VALUE elements.
If you keep a bunch of less than 32GB, then the links will weigh 4 bytes , which saves several gigabytes.
Since the collector builds a graph of objects, it consumes memory in proportion to their number. And the number of objects is proportional to the heap. The same collector consumes CPU, which is better spent on data compression for example.
On very large heaps, even if the hypothesis of generations is not violated as a whole, there is still quite a lot of the absolute value of old objects that will violate it.

Since the data is then sent to disk, the objects are serialized into memory directly through unsafe , and then this memory area is used as an input / output buffer.

Write ahead log

Write Ahead Log is a log of operations with a linear structure, adding to it is a constant, unlike a tree. The tree is updated less frequently, and if the data is lost due to a fall, they are restored from the log using the operations since the last saved state. The result is improved performance without sacrificing reliability. I advise you to look at the interface IgniteWriteAheadLogManager - there is detailed documentation.

Node bypass

Since the nodes in the B-trees are not small they bypass binary search. To do this, use the descendants of the class BPlusTree.GetPageHandler , and for different operations are different.

Implementation of binary search

 private int findInsertionPoint(int lvl, BPlusIO<L> io, long buf, int low, int cnt, L row, int shift) throws IgniteCheckedException { assert row != null; int high = cnt - 1; while (low <= high) { int mid = (low + high) >>> 1; int cmp = compare(lvl, io, buf, mid, row); if (cmp == 0) cmp = -shift; // We need to fix the case when search row matches multiple data rows. //noinspection Duplicates if (cmp < 0) low = mid + 1; else if (cmp > 0) high = mid - 1; else return mid; // Found. } return -(low + 1); // Not found. }

The compare method is different for different descendants of BPlusTree . The negative index is encoded for the absence of an element with a position where it could be. It also does Collections.binarySearch from the standard library.

Pay attention to the following lines.

 if (cmp == 0) cmp = -shift;

For the findOne operation, this code does nothing, since shift is set to zero, i.e. if the same keys are in the tree, they will find an arbitrary one.

However, if you search for a range like this, then the elements will be lost, so there shift is set to 1 . As a result, the search does not find the element even if it is there , but it does not matter for the range search.

List of sheets

To bypass the range effectively, the sheets are knitted into a list. As a search result, a BPlusTree.ForwardCursor returned, which goes from sheet to sheet. Apparently, the passage through the cursor is not isolated from other operations in the tree, since when you pass the lock is taken only on one page. The synchronization mechanism that protects access to cursor methods is not found.

Conclusion

Since B + trees in Apache Ignite are young compared to other relational databases, we get the necessary set for production ready B + trees:

WAL gives cheap security, as a result the tree is rarely updated on disk.
Offheap with data in serialized form.
Concurrency - for loaded into the memory of parts of the tree.

Source: https://habr.com/ru/post/413749/

All Articles