Patch me if you can: how do we debug on production. Part 1

UPD: the second part of the article is ready .

Hi, Habr! My name is Alexander Izmailov. In Badoo, I lead a team of release engineers. I know that in many companies you can send code changes to a specially trained person, he looks at them and adds where he should go (for example, this is exactly what happens with Git code). And I want to talk about how we have automated this process with us.

My story will consist of two parts. In this part, I will talk about what we wanted to achieve from the new system, how it looked in its first version, and why we finally had to redo it. In the second part, we will discuss the process of reworking the system and what unexpected bonuses this brought us.

Image: source

Once upon a time, in our company, everyone could make their changes directly to the master branch and personally lay them out. To do this, we had written a special utility MSCP, which worked quite primitively: copy the changes from one machine to another and expose the necessary rights.

As time went on, the company grew - and we had to think about automating processes, including the process of laying out small changes. So we got the idea to create a patch system. First of all, we wanted it to allow any developer to send their changes to Git and distribute them to the servers. For our part, we demanded that the changes be looked at by another developer and be tied to the task from our bug tracking system (we use Jira).

Patch collector 6000

I must say that not all developers liked these requirements. It seemed to us that spending a couple of minutes to create a task was not a thing, but for us this would mean a more deliberate use of the system. But the developers began to resist, arguing that laying out the changes takes several times less time than creating a new ticket. As a result, we still have “universal” tasks, to which hundreds of patches are attached.

In addition, the system was designed to fix urgent problems, and finding a reviewer for a patch at three in the morning can be difficult.

What do we want?

~~Yes, just the light in the window ...~~ Our problem could be divided into two parts: we needed some way of making changes to the repository and some way of arranging these changes.

We decided the first question rather quickly: we made a form to which we had to attach our changes and indicate the details (the reviewer and the task).

On the second page you could see the changes, reject or accept them.

After confirmation, the changes fell into the master.

The second question: how to quickly deliver these changes to the servers? Today, many use continuous integration, and it could cope well with the task if our "honest" build and layout did not take so much time.

Fair build

Our build has always been quite complicated. The general principle was this: in a separate directory, we decomposed the files as they would lie on the destination servers; then we saved this state in a snapshot (file system snapshot) and laid it out.

We put the PHP code into the directory, which was taken from the repository as is, added generated files (for example, templates and translations) to it. Statics we laid out separately. This is a rather complicated process, and you can devote a whole article to it, but as a result we had a version map to generate links to files for users, which would leave with the main code.

Further state of the directory had to be saved somewhere. For this we used a block device , which we called a loop. The entire directory was copied to an empty device, which was then archived and delivered to separate “main” servers. We retrieved the archives from these servers during the layout process. Each archive was 200 MB in size, and in unpacked loops weighed 1 GB. It took us about five minutes to build without static.

Honest layout

First we had to deliver the archive to the destination servers. We have thousands of them, so the issue of delivery for us has always been a big problem: we have several platforms (data centers), and on the thickest order of thousands of servers with code. In an attempt to achieve the best performance (minimum time and resources), we tried various methods: from simple SCP to torrents. As a result, we stopped at using UFTP. The method was fast (in good weather - a minute of time), but, unfortunately, not trouble-free. Periodically, something broke, and we had to run to the admins and networkers.

After the archive (somehow) found itself on the servers, it must be unpacked, which is also not free. Especially expensive this procedure seems to be, if you remember that it is performed thousands of times, albeit in parallel on different machines.

No build

So, the fair calculation of changes took a lot of time, and for the patch system, the speed of delivery was very important to us, because it was assumed that it would be used when something no longer worked. Therefore, we returned to the idea of using MSCP: quickly and simply in implementation. Thus, after the changes were made in the wizard, it was possible to expand the changed files in turn on a separate page.

It is alive

The system has earned. Despite some dissatisfaction with trifles, the developers could do their work, and for this they did not need either access to the master or access to the servers.

But, of course, with this method of layout, we have problems. Some were predictable, some we even decided. Most of them were related to parallel editing of files.

One patch for multiple files

An example of a predictable problem. New files were laid out in turn. What to do if you need to change several files and changes in them are related? For example, I want to add a new method in one file and immediately use it in others. While there is no loopback use of methods (see mutual recursion ), it suffices to remember about the correct order of the file layout.

Honest decision

To solve the problem, we needed to replace several files atomically. In the case of a single file, the solution is known: you need to use the rename file operation. Suppose we have an F file, and we need to replace its contents. To do this, create a TMP file, write the necessary information into it, and then rename TMP F.

Let's complicate the task. Suppose we have a directory D, and we need to replace its contents. Operation rename does not help us, because it cannot replace a non-empty directory. However, there is a workaround: you can replace the D directory in advance with a so-called symbolic link (symlink). Then the content itself will lie elsewhere, for example, in the D_1 directory, and D will be a link to D_1. At the moment when a replacement is required, the new content is written to the D_2 directory, to which a new TMP link is created. Now rename TMP D will work, because this operation can be applied to links.

This solution looks appropriate: you can change the entire directory with the code by copying old files and writing new ones on top. The problem is that copying all the code is long and expensive. You can replace only the subdirectory where the files have changed, but then all the subdirectories with the code should be links, because we cannot replace the filled directory with something in the layout process. Not only does this solution look very complicated - you need to remember to add some restrictions so that the two processes cannot change the same directory or directory and its subdirectories at the same time.

As a result, we could not find a technical solution, but we figured out how to simplify life a little: we made the layout of several files in one action in the interface. The developer specified the layout of the files, and the system delivered them.

Several patches per file

It is more difficult if the file is one, and there are several developers who want to change it. The first patch we applied, but not decomposed. At this point, the second patch arrives, and it is asked to decompose. What to do? Even more interesting, if the second patch has been applied, and at this moment we are asked to decompose the first.

Probably, it is necessary to clarify that we have always laid out only the latest version of the wizard. Otherwise there could be other problems. For example, the layout of the old version on top of the new one.

We did not come up with a really good solution to this problem. We showed the developers a diff between what they are laying out and what is on the machines at a given time, but this did not always work. For example, there could be a lot of trite changes, and the developer could hurry or just be lazy (anything can happen).

Many patches, and all change the same files.

This is the worst option, which even do not want to dwell. If the changes of several developers affected several of the same files, our patch system could not help much - it remained to rely on the developers' attention and their ability to communicate with each other. But in theory, it is quite possible to get the "fish", when in any order of the layout at some point on the servers will be partially inactive code.

Picture: source

Iron problems

Another problem arose when for some reason one of the servers became unavailable. We had a mechanism for excluding such servers from the layout, which worked quite well; Difficulties appeared after their return to service. The fact is that the versions of configs and code on the working servers are checked (there is a whole monitoring department!), And we ensure that when the server returns to the system, all versions are up to date. But we didn’t have any versioning for patches - we just copied new files to the current code.

We did not come up with a neat way to version the decomposed patches, but tried to solve the problem in a workaround. For example, rsync from a neighboring machine at the end of the layout process. But check somehow that all is well, we still could not.

We went through several solutions to this problem, for example, we wanted to apply patches on the “main” servers (it is important to remember that we are decomposing a packed version, that is, we need to apply a patch and pack the version back), but it was quite difficult to implement.

Honey spoon

But, apart from the problems, there were positive moments.

First, the developers quickly figured out that, apart from fixing things, with the help of a system of patches, you can sometimes upload new functionality, for example, when it is needed urgently. As in any company, we have force majeure. But if earlier we had to create an extraordinary build, to which testers and release engineers were distracted, then now the developer could arrange some changes on his own.

Secondly, some special person with the rights was no longer needed to fix something. Any developer himself could post his edits. But this is not all: the calculations of the builds as a whole have become simpler, now the problems were divided into critical ones and those that can be fixed with patches. This allowed us to roll back less often and make a quicker decision on whether we had successfully laid out.

Simply put, we liked the system and gained popularity. We tried to improve it further, but we had to live with the described problems for several more years. And about how we decided them, how the system works now, and how we almost ruined the New Year holidays in the process of updating, I will tell in the second part of the article.

Source: https://habr.com/ru/post/413503/

All Articles