Patch me if you can: how do we debug on production. Part 2

In the first part of my article, I talked about how we in Badoo created the first version of the patch system. In short, we needed to find a way to correct serious errors right on production, accessible to all developers. However, the first version was not without flaws: we used a kind of layout method that did not guarantee the atomicity of the patch calculations and the consistency of the code.

In this part of the article I will talk about a new way of code layout, which we invented, trying to solve our problems, and how our patch system was transformed with it.


Image: source

Universal solution - Multiversional Deployment Kit


After another review of our system Yura, youROCK Nasretdinov stated that he has an idea how to solve all our problems. All he asked for was a lot of time to redo the layout system. This is how the concept of the Multiversional Deployment Kit appeared, or, in common, the MDK (Jura compared it with other methods of code layout in his report on HighLoad ++ ).

The new system was designed to change our layout. For those who have not read my first part of the article, I will briefly describe how the deployment process looks like: first we collect all the necessary files in one directory, then save and deliver the state of the directory to the servers.

Before the MDK era, we used block devices (i.e., filesystem images), which were called loops, for storage and delivery. The directory was copied to an empty loop, it was archived and sent to the servers.

In the new system, we are not versioning the entire directory, but each file separately, so that the version of the file uniquely correlates with its contents. For directories, there are maps (maps) —special files in which versions of all files in a directory are recorded. These cards are also versioned, and it looks like this:



Looks familiar? This is how objects in Git are arranged (you can read about it here , but for understanding this article it is not necessary).

For versioning, we use the first eight characters from the MD5 hash (from its hexadecimal representation, to be exact) taken from the contents of the file. This version is written at the end of the file name or at the beginning of the map name (so that you can distinguish the map file from the generated version map):



The code version is the version of the www root directory map. In order to find the current map, we have a symbolic link (symlink) current.map.

Why not use git?
Despite the fact that MDK partially borrows ideas from Git, there are some differences. The most important thing is how the files are stored in the working directory (that is, on the machines). If Git stores only one current version there, then MDK keeps all available file versions there. At the same time, only one symlink current.map indicates the current version of the code, which uses autoload in its work and which can be atomically changed. For comparison, Git uses git-checkout to change a version, which changes files one by one and is not atomic.

Build with MDK


MDK is needed to save the directory state at the end of the build. To do this, we have a special place, which we call the repository, a repository of all versions of files that are valuable to us (that is, we may want to decompose). When the new contents of the directory are ready, we calculate the versions of all the files in it and report the missing ones to the repository.

Layout with MDK


During the layout on each of the host servers, we run a script that checks whether all the necessary files are on the server and requests those that are missing from the repository. We can only switch the version to a new one by changing the current.map symlink.

How should this solve our problems?


It was assumed that if only a few files changed in the new version, its assembly and layout using the new system should be at least comparable in time with the patch layout of individual files. If so, then for each patch we will simply generate a new version.

MDK implementation


MDK had one drawback: on the end machines, the name of each file must have its version. This is what allows you to store in the directory at once a lot of versions of a single file, but this does not allow you to include include.php from the code - you must specify the specific version. Add to this the various bugs that might well have remained in the code of the layout system, the new layout algorithm, which was more complicated than the old one - and it will become clear why we decided to implement the new system in small steps. We started literally from one or two servers and gradually expanded their list, simultaneously correcting emerging problems.

Considering that switching to a new system should have taken a long time, we had to think about how our patches will work during the transition period. At that time, we used the samopisny utility mscp, which decomposed the files one by one, for the patch layout. We taught her in advance to replace the current files on servers with MDK, but we could not add a new file to such servers (because we had to change the file map). I didn’t want to implement a very complex intermediate solution, because we were going to a bright future, where mscp is not needed. As a result, I had to put up with this problem. In general, during the transition period, the developers had time to suffer, but now it seems to us that it was worth it.

Do not trust anyone



Image: source

Probably, it will be logical to ask if there is a conflict of versions in MDK (i.e., a situation where two files with different contents are assigned the same version)?

In fact, we are fairly well protected from such errors. We { }.{} files like this: { }.{} , which means it must match much more than eight characters for an error to occur.

But one day something went wrong. After another calculation, we noticed an increase in the number of errors with the HTTP code 404 (file not found). A small investigation revealed that some of the static files are missing. It turned out that we laid out a very old static map and give links to files that should not already be on the servers. But where did this map come from? In the first part of the article I noted that the static is decomposed by a separate process, and only the version map leaves with the PHP code. When we generate a new version of MDK, we report the missing versions of the files to the repository, from which nothing is deleted (there is a lot of space, we don’t feel sorry). And we also often spread on styling, and therefore the static version map is one of those files that change more often than others. All this led to the fact that we are faced with a collision. After checking the version, MDK decided that everything was fine, because the file of this version was already there, and spread it on the servers. It's good that we discovered a mistake quickly.

Now, in addition to the version, we check the file size: if it is the same in the repository, then most likely it is the same file. In the worst case, we will have a story for a new article.

MDK - Christmas Thief



Image: source

And I want to tell about one more error, because it is at least amusing. It is not hard to guess that we had the process of cleaning up old versions of files on destination servers. In an attempt to quickly solve one of the problems, we made a fateful decision: we set the cleaning period in one day (instead of seven, as it was before). It worked - and the problem went away. We even lived like this for a while.

Somewhere at five o'clock in the morning on Sunday, a phone rang in my bedroom, a monitor attendant called: “Scripts do not work for us. They say you know what's the matter. ” For me, it sounded like “The juicer burned in the office. They say you know what's the matter. ” About the principles of our scripting framework, I knew only from articles and stories, I didn’t have any “personal relations” with him, and even more so I never repaired it. But I went to the servers to find out what was going on, and found that the problem was really “on our side”: there was simply no code on the servers.

I laid out the code again - and it all worked. The error, by the way, turned out to be primitive: on Saturday, not a single new version of MDK was decomposed, and the cleanup script, as it turned out, did not make any checks not to remove the current version. As a result, at five in the morning, he (on schedule) deleted the code from all the servers. Already after this story, we realized that with the old settings this would have surfaced on holidays of 7 days in length, for example, on the New Year holidays, just on Christmas Eve. “Christ was born - the code went away” - for a long time we could hear this joke.

New patch system


In the end, we have implemented a new layout system - and it's time to redo the patch system. There was no longer any need for mscp and there was no need to avoid generating new versions. To begin, we changed the life cycle of the patch. Now, after confirming the changes, he gets back to the developer, who decides when the patch is ready for display. He presses the Deploy button, after which we add a patch to the master, generate and decompose the new version of MDK. Developer involvement at this stage is no longer required.

We have achieved a very good layout speed: the changes fall on the servers in just a minute. To do this, however, we had to resort to a couple of tricks: for example, we still do not generate statics or translations - instead we take the version from the last decomposed build. Because of this, we keep the patch limit for JS and CSS files.

Experiments


We really managed to solve all the problems that we had before. You no longer need to think about how to correctly form changes that will not cause difficulties in the file-by-file layout — it’s enough just not to touch the statics, and everything will work.

But there was a new difficulty. Previously, we allowed developers to deploy their changes to one or more servers, just to make sure that everything will work with them. With the new system, this feature has disappeared, because master has now become the current version for all servers without exception.


Image: source

Because of this, a new requirement for the patch system has appeared: the ability to check your changes on a small number of servers without adding changes to the master is needed. New functionality we called experiments.

For a developer, the process looks like this: after receiving an apruve, a new page becomes available in the interface of the patch system, where you can select the servers on which you want to experiment. This can be a server group, one server, or any combination that our system understands. The system lays out the patch and notifies the developer. At the same time on the page a log of the latest errors from the affected servers appears.

We do not limit the developers, they can create experiments on the same servers. One can experiment on 10% of the cluster, the other - on the entire cluster.

This became possible due to the fact that we had the very “version of patches” that we lacked so much. This is a version that in theory can be unique for each server. It looks like a string of identifiers separated by commas, for example, "32,45,79". This means that the server must have all the changes from the wizard and patches numbered 32, 45 and 79. For each such version we generate our own version of MDK. We take fresh changes from the main branch, and then consistently impose each of the patches. If during the generation of any of the versions a conflict arises, we simply cancel the experiment for the most recent patch and notify the developer about it.

Generated files


From the first day of the existence of the system of patches, we went to the trick: we refused to generate static in order for the changes to get to the servers as soon as possible. Of course, we really wanted to get the opportunity to change the JS-code as we change the PHP-code, but all attempts to build this process were not crowned with success.

About six months ago, we again returned to this issue. Purpose: you need to change the statics, but you can not sacrifice the speed of the layout of PHP-code. Main problem: complete assembly takes eight minutes. What to do?

It is necessary to make compromises. We started with the fact that the JS-code can not be decomposed in the framework of experiments. This should significantly save time: it is enough to keep one version of statics up to date instead of generating dozens of different versions for different groups of machines. But it is still a long time. What else can you save? We did not figure out how to reduce the time, but decided that there would be no problem if the assembly did not block the layout of the PHP code.

We started generating static asynchronously. With changes to JS or CSS files, we run a separate process that creates a new map of static versions. The process of building the PHP code at the beginning of the work checks whether there is a new static map, and, if it exists, picks it up and lays it out on all servers. Solved the problem? Almost. With this approach, we went to a new restriction: you can not change the JS-and PHP-code in one patch, because we are decomposing these changes asynchronously and can not guarantee that they will be on the machines at the same time.

Total


We are very pleased with the update. It was not easy for us, but it made our system much more reliable. Experiments for developers have found an alternative use: with them you can easily collect specific logs from a couple of servers without adding your changes to the master.

We still have ideas for improving the system, for the implementation of which there has not yet been enough time. For example, we want to redo the process of creating a patch and add the ability to modify JS-files simultaneously with the main code to get rid of the latest restrictions.

Every day we post about 60 patches, sometimes they are several times more, for example, during the development of some functionality that is currently available only to testers. About a third of the patches go through experiments before they are posted. In total, during the existence of the system, we had about 46,000 patches for the master.

Source: https://habr.com/ru/post/413991/


All Articles