Secure communication in distributed systems

Hi Habr!

My name is Alex Solodky, I am a PHP developer at Badoo. And today I will share the text version of my report for the first Badoo PHP Meetup. Video of this and other reports from the mitap can be found here .

Any system consisting of at least two components (and if you have both PHP and a database, then these are already two components), encounters entire classes of risks in the interaction between these components.

The platform department in which I work integrates new internal services with our application. And solving these problems, we have gained experience, which I want to share.

Our backend is a PHP monolith that interacts with many services (there are about fifty self-written ones). Services rarely interact with each other. But the problems that I am talking about in the article are also relevant for microservice architecture. Indeed, in this case, the services are very actively interacting with each other, and the more interaction you have, the more problems you have.

Consider what to do when the service crashes or fails, how to organize the collection of metrics and what to do when all of the above does not save you.

Service crash

Sooner or later the server on which your service is installed will fall. This will surely happen, and you will not be able to defend against it - only reduce the likelihood. You can fail the hardware, the network, the code, the bad deployment - anything. And the more servers you have, the more often this will happen.

How to make your services survive in a world in which servers are constantly falling? The general approach to solving this class of problems is redundancy.

Redundancy is used everywhere at different levels: from hardware to entire data centers. For example, RAID1 to protect against the failure of the hard drive or a backup power supply from your server, in case of failure of the first. Also this scheme is widely applied to databases. For example, for this you can use master-slave.

Consider typical problems with redundancy on the example of the simplest scheme:

The application communicates exclusively with the master, while in the background, asynchronously, data is transferred to the slave. When the master falls, we will switch to the slave and continue to work.

After the master is restored, we will simply make a new slave out of it, and the old one will turn into a master.

The scheme is simple, but even it has many nuances characteristic of any redundant schemes.

Load

Suppose that one server from the example above can handle approximately 100k RPS. Now the load is 60k RPS, and everything works like a clock.

But over time, the load on the application, and hence the load on the master, increases. You may want to balance it by transferring part of the reading to the slave.

Looks pretty good. Holds the load, the server is no longer idle. But this is a bad idea. It is important to remember why you initially raised the slave - to switch to it in case of problems with the main one. If you start to load both servers, then when your master falls - and he sooner or later falls - you have to switch the main traffic from the master to the backup server, and he is already loaded. Such an overload will either make your system terribly slow or completely disable it.

Data

The main problem when adding fault tolerance to a service is the local state. If your stateless service, i.e., does not store any changeable data, then its scaling is not a problem. Just raise as many instances as we need, and balance the requests between them.

In the case when the stateful service, we can no longer do that. We need to think about how to store the same data on all instances of our service so that they remain consistent.

To solve this problem, one of two approaches is used: either synchronous or asynchronous replication. In general, I advise you to use the asynchronous version, since it is generally easier and faster to write, and, depending on the circumstances, you should see if you need to switch to the synchronous one.

An important caveat that should be considered when working with asynchronous replication is eventual consistency . This means that at a specific point in time on different slaves, the data may lag behind the master at unpredictable and different time intervals.
Accordingly, you cannot read data every time from a random server, because then different answers may come to the same user requests. To circumvent this problem, the sticky-sessions mechanism is used, which ensures that all requests from one user go to one instance.

The advantages of the synchronous approach are that the data are always in a consistent state, and the risk of losing the data is lower (since they are considered recorded only after all the servers have done this). However, this has to be paid for by the write speed and complexity of the system itself (for example, various quorum algorithms for protection against split-brain ).

findings

Reserve. If the data itself and the availability of a particular service are important, then make sure that your service survives the fall of a particular machine.
When calculating the load, take into account the fall of some servers. If there are four servers in your cluster, make sure that when one drops, the three remaining ones will pull up.
Choose the type of replication depending on the tasks.
Do not put all your eggs in one basket. Make sure that you distribute your backup servers far enough. Depending on the criticality of service availability, your servers may be in different racks in the same data center, or in different data centers in different countries. It all depends on how much you want a global catastrophe and are ready to survive.

Stupid service

At some point, your service may start to work very slowly. This problem can occur for a variety of reasons: excessive load, network lags, problems with hardware, or errors in the code. It looks like a not too terrible problem, but in fact it is more insidious than it seems.

Imagine: the user requests a certain page. We synchronously and consistently turn to four demons to draw it. They quickly respond, everything works well.

Suppose this case is handled using nginx with a fixed number of PHP FPM workers (with ten, for example). If each request is processed approximately 20 ms, then with the help of simple calculations it can be understood that our system is able to process about five hundred requests per second.

What happens when one of these four services starts to blunt, and the processing of requests to it will increase from 20 ms to 1000 ms timeout? Here it is important to remember that when we work with a network, the delay can be infinitely long. Therefore, you must always set a timeout (in this case it is equal to a second).

It turns out that the backend is forced to wait for the timeout to expire and get the error from the daemon. This means that the user receives the page in one second instead of ten milliseconds. Slowly, but not fatally.

But what is really the problem here? The fact is that when we have each request processed a second, the bandwidth tragically drops to ten requests per second. And the eleventh user will not be able to get an answer, even if he requested a page that is not related to the blunt service. Just because all ten workers are busy waiting for a timeout and cannot process new requests.

It is important to understand that by increasing the number of workers, this problem is not solved. After all, every worker needs a certain amount of RAM for his work, even if he doesn’t do the actual work, but just hangs waiting for a timeout. Therefore, if you do not limit the number of workers in accordance with the capabilities of your server, then raising more and more new workers will put the server entirely. This case is an example of a cascade failure, when the fall of a single, even if not critical for the service user, causes the entire system to fail.

Decision

There is a pattern called circuit breaker . His task is quite simple: he must at some point cut down a stupid service. To do this, between the service and the workers put some proksya. This can be either a PHP code with a repository or a daemon on the local host. It is important to note that if you have several instances (your service is replicated), then this proxy should track each one separately.

We wrote our implementation of this pattern. But not because we love to write code, but because when we solved this problem many years ago, there were no ready-made solutions.

Now I will talk in general terms about our implementation and how it helps to avoid this problem. And more about her and her differences from other solutions can be heard in the report of Mikhail Kurmaev on Highload Siberia in late June. Decoding his report will also be in this blog.

It looks like this:

There is an abstract service Sphinx, before which stands the circuit breaker. Circuit breaker stores the number of active connections to a particular daemon. As soon as this value reaches the threshold, which we set as a percentage of the available FPM workers on the machine, we consider that the service has started to slow down. When the first threshold is reached, we send a notification to the person responsible for the service. Such a situation is either a sign that the limits need to be revised, or a precursor of problems with dullness.

If the situation worsens, and the number of slowing down workers reaches the second threshold value - we have about 10% in production - we cut down this host completely. More precisely, the service actually continues to work, but we stop sending requests to it. Circuit browser discards them and immediately gives the workers an error, as if the service is down.

From time to time we automatically skip the request from some worker to see if the service has come to life anyway. If he answers adequately, then we again include him in the work.

All this is done in order to reduce the situation to the previous scheme with replication. Instead of waiting a second, before we realize that the host is unavailable, we immediately get an error and go to the backup host.

Implementations

Fortunately, Open Source does not stand still, and today you can take a ready-made solution on Github.

There are two main approaches to the implementation of the circuit breaker: a library running at the code level, and a standalone daemon that proxies requests through itself.

The library option is more suitable if you have one main monolith in PHP that interacts with several services, and the services with each other barely communicate. Here are a few implementations available:

If you have many services in different languages, and they all interact with each other, then the variant at the code level will have to be duplicated in all these languages. This is inconvenient in support, and ultimately leads to discrepancies in implementations.

Putting one demon in this case is much easier. In this case, you do not have to specifically edit the code. The demon tries to make the interaction transparent. However, this option is much more complex architectural .

Here are a few options (there is richer functionality, but there is also a circuit designer):

findings

Do not rely on the network.
All network requests must have a timeout, because the network can render indefinitely.
Use circuit breaker if you want to avoid cascading application failure due to slowing down one small service.

Monitoring and telemetry

What does it give

Predictability. It is important to predict what the load is now and what will be in a month in order to increase the number of service instances in a timely manner. This is especially true if you are dealing with an iron infrastructure, since ordering new servers takes time.
Investigation of incidents. Sooner or later something will go wrong anyway, and you will have to investigate it. And it is important to have enough data to understand the problem and be able to prevent such situations in the future.
Accident prevention. Ideally, you should understand which patterns lead to accidents. It is important to track these patterns and promptly notify the team about them.

What to measure

Integration metrics

Since we are talking about the interaction between services, we monitor everything that is possible in relation to the communication service with the application. For example:

number of requests;
request processing time (including percentiles);
number of logic errors;
number of system errors.

It is important to distinguish between logic errors and system errors. If the service crashes, this is a regular situation: just switch to the second. But it is not so scary. If you start some kind of logic error, for example, strange data enters the service or leaves it, then this should be investigated. Most likely, the error is related to a bug in the code. She herself will not pass.

Internal metrics

By default, the service is a black box that does its job as it is not clear how. It is advisable to still understand and collect the maximum data that the service can provide. If the service is a specialized database that stores some data of your business logic, keep track of exactly how much data, what type they are, and other content metrics. If you have asynchronous interaction, it is also important to keep track of the queues through which your service communicates: their arrival and departure speed, time at different stages (if you have several intermediate points), the number of events in the queue.

Let's see what metrics can be collected using memcached as an example:

hit / miss ratio;
response time for different operations;
RPS different operations;
breakdown of the same data by different keys;
top loaded keys;
all internal metrics that the stats command gives.

How to do it

If you have a small company, a small project and few servers, then a good solution would be to connect some kind of SaaS for collecting and viewing - it is easier and cheaper. In this case, usually SaaS have extensive functionality, and do not have to take care of many things. Examples of such services:

Alternatively, you can always install on your own Zabbix, Grafana or any other self-hosted solution.

findings

Collect all the metrics you can. Data is not superfluous. When you have to investigate something, you will thank yourself for your foresight.
Do not forget about asynchronous interaction. If you have any queues that arrive gradually, it is important to understand how quickly they arrive, what happens to your events at the interface between services.
If you write your service, teach him to give statistics on the work. Part of the data can be measured at the integration layer when we communicate with this service. The rest of the service should be able to give the conditional command stats. For example, in all of our services on Go, this functionality is standard.
Customize triggers. Charts are good, but only while you look at them. It is important that you have a customized system that will tell you if something goes wrong.

Memento mori

And now a little about sad things. It may feel that the above is a panacea, and now nothing will ever fall. But even if you apply everything described above, something will ever fall. It is important to take this into account.
The reasons for the fall are many. For example, you could choose an insufficiently paranoid replication scheme. A meteorite fell into your data center, and then into the second. Or you just unrolled the code with a cunning error that unexpectedly surfaced.

For example, in Badoo there is a page "People nearby". There, users are looking for other people nearby to chat with them.

Now for rendering the page, the backend makes synchronous calls to about seven services. For clarity, reduce this number to two. One service is responsible for rendering the central unit with photos. The second - for the block of advertising on the left below. There can get those who want to become more visible. If we have a service that displays this advertisement, the block simply disappears.

Most users do not even know about this fact: our team responds quickly, and soon the unit simply appears again.

But not every functionality we can quietly remove. If we drop the service responsible for the central part of the page, it can’t be hidden. Therefore, it is important to tell the user in his language what is happening.

It is also desirable that the failure of one service does not lead to a cascade failure. For each service, a code must be written that handles its crash, otherwise the application may fall entirely.

But that's not all. Sometimes something falls without which you cannot live at all. For example, a central database or session service. It is important to correctly work out and show the user something adequate, somehow entertain him, say that everything is under control. At the same time, it is important that everything is really under control, and monitor workers are notified of the problem.

Die right

Get ready for the fall. There is no silver bullet, so always lay a straw in case of a complete drop in service, even if your reservation is used.
Do not allow cascade failures when problems with one of the services kill the entire application.
Disable noncritical user functionality. This is normal. Many services are used only for internal needs and do not affect the functionality provided. For example, statistics service. It doesn’t matter to the user whether statistics are collected from you or not. It is important to him that the site worked.

Results

To reliably integrate a new service into the system, we write a special API wrapper around Badoo, which takes on the following tasks:

load balancing;
timeouts;
logic failover;
circuit breaker;
monitoring and telemetry;
authorization logic;
data serialization and deserialization.

It is better to make sure that all these items are also covered in your integration layer. Especially, if you use a ready Open-Source API client. It is important to remember that the integration layer is a source of increased risk of cascading failure of your application.

Thanks for attention!

Literature

Source: https://habr.com/ru/post/413555/

All Articles

Secure communication in distributed systems

Service crash

Load

Data

findings

Stupid service

Decision

Implementations

findings

Monitoring and telemetry

What does it give

What to measure

How to do it

findings

Memento mori

Die right

Results

More articles: