How not to break an Apache Ignite cluster from the beginning

Hello! Below is a transcript of a video recording of a speech at the Apache Ignite community’s community meeting in St. Petersburg on June 20. You can download slides by reference .

There is a whole class of problems faced by novice users. They just downloaded Apache Ignite for themselves, run the first two, three, ten times, and come to us with questions that are solved in a similar way. Therefore, I propose to create a checklist that will save you a lot of time and nerves when you make your first applications on Apache Ignite. We'll talk about preparing for launch; how to make the cluster assembled; how to run any calculations in a compute grid; how to prepare the data model and code so that you can write your data to Ignite and then read them successfully. And the main thing: how not to break anything from the very beginning.

Preparing for launch - customize logging

We need logs. If you have ever asked for an Apache Ignite mailing list or a question on StackOverflow, such as “why everything was hanging on me,” most likely you were asked to send all the logs from all the nodes first.

Naturally, Apache Ignite has logging enabled by default. But there are nuances. First, Apache Ignite doesn't write much in stdout . By default, it runs in the so-called quiet mode. In stdout you will see only the most terrible errors, and everything else will be stored in the file, the path to which Apache Ignite displays at the very beginning (by default - ${IGNITE_HOME}/work/log ). You do not erase it and keep the logs longer, it can be very useful.

stdout Ignite at default startup

To make it easier to learn about the problems, without getting into separate files and not setting up a separate monitoring for Apache Ignite, you can start it in verbose mode with the command

 ignite.sh -v

and then the system will start to write about all events in stdout along with the rest of the application journaling.

Check the logs! Very often they can find solutions to your problems. If the cluster has collapsed, then very often in the log you can see messages like “Increase such and such timeout in such and such configuration. We fell off because of him. He is too small. The network is not good enough. ”

Cluster assembly

Unwelcome guests

The first problem that many face is the uninvited guests in your cluster. Or you yourself are an uninvited guest: you start up a fresh cluster and suddenly you see that in the very first topology snapshot instead of one node, you have two servers from the very beginning. How so? You ran only one.

A message saying that there are two nodes in the cluster

The fact is that by default Apache Ignite uses Multicast, and at startup it will search for all other Apache Ignite that are on the same subnet, in the same Multicast group. And if it does, it will try to connect. And in case of unsuccessful connection - it will not start at all. Therefore, in the cluster on my working laptop regularly appear extra nodes from the cluster on the laptop colleagues, which of course is not very convenient.

How to protect yourself from this? The easiest way to configure static IP. Instead of TcpDiscoveryMulticastIpFinder , which is used by default, there is TcpDiscoveryVmIpFinder . There write all the IP and ports to which you connect. It is much more convenient and will protect you from a large number of problems, especially in environments for development and testing.

Too many addresses

Next problem. You have disabled Multicast, start the cluster, in a single config file you have registered a decent amount of IP from different environments. And it so happens that you start the first node in a fresh cluster in 5–10 minutes, although all subsequent ones connect to it in 5–10 seconds.

Take a list of three IP addresses. For each prescribe ranges of 10 ports. There are a total of 30 TCP addresses. Since Apache Ignite should try to connect to an existing cluster before creating a new cluster, it will check each IP in turn. On your laptop, it may not hurt, but in some cloud environments, port scan protection is often enabled. That is, when accessing a closed port on some IP address, you will not receive any response until the timeout has passed. By default it is 10 seconds. And if you have 3 addresses of 10 ports, it turns out 3 * 10 * 10 = 300 seconds of waiting - the very 5 minutes to connect.

The solution is obvious: do not prescribe extra ports. If you have three IP, then you hardly need a default range of 10 ports. This is useful when you are testing something on a local machine and running 10 nodes. But in real systems, one port is usually enough. Or, disable the port scan protection on the internal network, if you have the option.

A third common problem is IPv6. You may see strange network error messages: could not connect, could not send a message, node segmented. This means that you have fallen off the cluster. Very often such problems are caused by a mixed environment of IPv4 and IPv6. This is not to say that Apache Ignite does not support IPv6, but at the moment there are certain problems.

The simplest solution is to give the Java machine the option.

 -Djava.net.preferIPv4Stack=true

Then Java and Apache Ignite will not use IPv6. This solves a significant part of the problems with collapsing clusters.

Preparation of the code base - serialize correctly

The cluster is going, you need to run something in it. One of the most important elements of your code interacting with Apache Ignite code is Marshaller, or serialization. To write something into memory, in persistence, send over the network, Apache Ignite first serializes your objects. You can see messages that begin with the words: “cannot be written in binary format” or “cannot be serialized using BinaryMarshaller”. There will be only one such warning in the log, but noticeable. This means that you need to tweak your code a little more to make friends with Apache Ignite.

Apache Ignite uses three mechanisms for serialization:

JdkMarshaller - regular Java serialization;
OptimizedMarshaller - slightly optimized Java serialization, but the mechanisms are the same;
BinaryMarshaller is a serialization written specifically for Apache Ignite, used everywhere under its hood. She has a number of advantages. Somewhere we can avoid additional serialization and deserialization, and somewhere we can even get a non-deserialized object in the API, work with it directly in the binary-format as with something like JSON.

BinaryMarshaller will be able to serialize and deserialize your POJOs that have nothing but fields and simple methods. But if you have custom serialization via readObject() and writeObject() , if you use Externalizable , then BinaryMarshaller will not cope. He will see that your object cannot be serialized by the usual recording of non-transient fields and gives up - rolls back to OptimizedMarshaller .

To make such objects friends with Apache Ignite, you need to implement the Binarylizable interface. It is very simple.

For example, there is a standard TreeMap from java. It has custom serialization and deserialization via read and write object. It first describes some fields, and then writes the length and the data itself to the OutputStream .

Implementing TreeMap.writeObject()

 private void writeObject(java.io.ObjectOutputStream s) throws java.io.IOException { // Write out the Comparator and any hidden stuff s.defaultWriteObject(); // Write out size (number of Mappings) s.writeInt(size); // Write out keys and values (alternating) for (Iterator<Map.Entry<K,V>> i = entrySet().iterator(); i.hasNext(); ) { Map.Entry<K,V> e = i.next(); s.writeObject(e.getKey()); s.writeObject(e.getValue()); } }

writeBinary() and readBinary() from Binarylizable work in exactly the same way: BinaryTreeMap wraps itself in a regular TreeMap and writes it to the OutputStream . This method is easy to write, and it will pretty much increase performance.

Implementing BinaryTreeMap.writeBinary()

 public void writeBinary(BinaryWriter writer) throws BinaryObjectException { BinaryRawWriter rewriter = writer. rewrite (); rawWriter.writeObject(map.comparator()); int size = map.size(); rawWriter.writeInt(size); for (Map.Entry<Object, Object> entry : ((TreeMap<Object, Object>)map).entrySet()) { rawWriter.writeObject(entry.getKey()); rawWriter.writeObject(entry.getValue()); } }

Run in Compute Grid

Ignite allows not only to store data, but also to run distributed computing. How do we run some lambda so that it spreads across all servers and runs?
First, what is the problem with these code examples?

What is the problem?

 Foo foo = …; Bar bar = ...; ignite.compute().broadcast( () -> doStuffWithFooAndBar(foo, bar) );

And if so?

 Foo foo = …; Bar bar = ...; ignite.compute().broadcast(new IgniteRunnable() { @Override public void run() { doStuffWithFooAndBar(foo, bar); } });

As many who are familiar with the pitfalls of lambdas and anonymous classes will guess, the problem is in capturing variables from the outside. For example, we send lambda. It uses a pair of variables that are declared outside the lambda. This means that these variables will travel with it and fly across the entire network to all servers. And then all the same questions arise: are these objects friendly with BinaryMarshaller ? What size are they? Do we even want them to be transferred somewhere, or are these objects so large that it is better to transmit some kind of ID and recreate the objects inside the lambda already on the other side?

An anonymous class is even worse. If the lambda can not take this with itself, throw it out, if it is not used, then the anonymous class will take it necessarily, and it usually does not lead to anything good.

The following example. Again lambda, but which uses Apache Ignite API a little.

Use Ignite inside compute closure wrong

 ignite.compute().broadcast(() -> { IgniteCache foo = ignite.cache("foo"); String sql = "where id = 42"; SqlQuery qry = new SqlQuery("Foo", sql).setLocal(true); return foo.query(qry); });

In the original version, it takes the cache and locally makes some kind of SQL query in it. This is such a pattern when you need to send a task that works only with local data on remote nodes.

What is the problem? Lambda again captures the link, but now not to the object, but to the local Ignite on the node with which we send it. And it even works, because the Ignite object has a readResolve() method that allows, during deserialization, to replace the Ignite that came over the network to the local one on the node where we sent it. But this also sometimes leads to undesirable consequences.

Basically, you simply transfer more data over the network than you would like. If you need to get from some code, the launch of which you do not control, to Apache Ignite or some of its interfaces, the simplest thing is to use the Ignintion.localIgnite() method. You can call it from any thread that was created by Apache Ignite, and get a link to a local object. If you have lambdas, services, anything, and you understand that you need Ignite here, then I recommend this method.

Use Ignite inside compute closure correctly - via localIgnite()

 ignite.compute().broadcast(() -> { IgniteCache foo = Ignition.localIgnite().cache("foo"); String sql = "where id = 42"; SqlQuery qry = new SqlQuery("Foo", sql).setLocal(true); return foo.query(qry); });

And the last example in this part. In Apache Ignite, there is a Service Grid, with which you can deploy microservices directly in a cluster, and Apache Ignite will help you keep the right number of instances online all the time. For example, in this service, we also need a link to Apache Ignite. How to get it? We could use localIgnite() , but then we’ll have to manually save this link to the field.

Service stores Ignite in the wrong field - takes it as a constructor argument

 MyService s = new MyService(ignite) ignite.services().deployClusterSingleton("svc", s); ... public class MyService implements Service { private Ignite ignite; public MyService(Ignite ignite) { this.ignite = ignite; } ... }

There is a simpler way. We still have full-fledged classes, not lambda, so we can annotate the field as @IgniteInstanceResource . When the service is created, Apache Ignite will put itself there, and it will be possible to use it calmly. I strongly advise you to do this, and not try to pass Apache Ignite and its children to the constructor.

Service uses @IgniteInstanceResource

 public class MyService implements Service { @IgniteInstanceResource private Ignite ignite; public MyService() { } ... }

Write and read data

Watching the baseline

We now have an Apache Ignite cluster and prepared code.

Let's imagine this scenario:

One REPLICATED cache — copies of data are available on all nodes;
Native persistence is included - we write to disk.

We start one node. Since native persistence is included, we need to activate the cluster before working with it. Activate. Then we run a few more nodes.
Everything seems to be working: recording and reading are normal. All nodes have copies of data, you can safely stop one node. But if you stop the very first node from which the launch started, then everything breaks down: the data disappears, and the operations cease to take place.

The reason for baseline topology is the set of nodes that store persistence data on themselves. All other nodes will not have persistent data.

This set of nodes is first determined at the time of activation. And those nodes that you added later are no longer included in the number of baseline nodes. That is, the set of baseline topology consists of only one, the very first node, when stopped, everything breaks down. To prevent this from happening, first start all the nodes, and then activate the cluster. If you need to add or remove a node with the command

 control.sh --baseline

can see which nodes are listed there. The same script can update the baseline to the current state.

An example of using control.sh

Data collocation

Now we know that the data is saved, we will try to read it. We have SQL support, you can do SELECT - almost as in Oracle. But at the same time we are able to scale and run on any number of nodes, data is stored distributed. Let's look at this model:

 public class Person { @QuerySqlField public Long id; @QuerySqlField public Long orgId; } public class Organization { @QuerySqlField private Long id; }

Request

 SELECT * FROM Person as p JOIN Organization as o ON p.orgId = o.id

will not return all data. What's wrong?

The person ( Person ) refers to the organization ( Organization ) by ID. This is a classic foreign key. But if we try to combine two tables and send such an SQL query, then with several nodes in the cluster we will not receive all the data.

The fact is that by default SQL JOIN works only within one node. If SQL constantly went around the cluster to collect data and return the full result, it would be incredibly slow. We would lose all the advantages of a distributed system. Therefore, instead, Apache Ignite looks only at local data.

To get the right results, we need to post the data together (colocation). That is, in order to correctly combine Person and Organization, data from both tables must be stored on the same node.

How to do it? The simplest solution is to declare an affinity key. This is the value that determines on which node, in which partition, in which group of records this or that value will be located. If we declare an organization ID in Person as an affinity key, this will mean that people with that organization ID must be on the same node as the organization with the same ID.

If for some reason you cannot do this, there is another, less effective solution - enable distributed joins. This is done through an API, and the procedure depends on whether you are using Java, JDBC, or something else. Then the JOIN will be executed more slowly, but they will return the correct results.

Consider how to work with affinity keys. How do we understand that such and such an ID, such a field is suitable for the definition of affinity? If we say that all people with the same orgId will be stored together, then orgId is one indivisible group. We cannot distribute it across multiple nodes. If 10 organizations are stored in the database, then there will be 10 indivisible groups that can be put on 10 nodes. If there are more nodes in the cluster, then all the “extra” nodes will be left without groups. It is very difficult to determine in runtime, so think about it in advance.

If you have one big organization and 9 small ones, then the size of the groups will be different. But Apache Ignite does not look at the number of records in affinity groups when it distributes them among the nodes. Therefore, he will not put one group on one node, and 9 others on another, in order to even up the distribution. Rather, he will put them 5 and 5, (or 6 and 4, or even 7 and 3).

How to make the data evenly distributed? Let us have

To keys;
And various affinity keys;
P partitions, that is, large groups of data that Apache Ignite will distribute between the nodes;
N nodes.

Then you need to satisfy the condition

 K >> A >> P >> N

where >> is "a lot more" and the data will be distributed relatively evenly.

By the way, the default is P = 1024.

Most likely, you will not succeed in even distribution. This was in Apache Ignite 1.x to 1.9. This was called FairAffinityFunction and did not work very well - it led to too much traffic between nodes. Now the algorithm is called RendezvousAffinityFunction . It does not give an absolutely fair distribution, the error between the nodes will be plus or minus 5-10%.