Gentleman's sysadmin set

Admin - this is the man without whom nothing in the IT company will not work. And with a happy and productive admin, things will move better and faster, so a comfortable working atmosphere is the concern of the company. The fact with which tools to make the team productive was the report of Anton Turetsky ( banuchka ) on Highload ++ 2017.

Anton loves infrastructure tasks and automation of everything that can be automated, so his story is based on the example of setting up infrastructure in a data center and related technologies (Docker, Consul, Puppet ...). But aspects that hinder quality work and how to solve them are as versatile as possible and are suitable for almost any executive team. So welcome under the cat for decoding this report.



Badoo is growing every year, here are a few numbers that reflect: 350 million messages per day, 364 million registered users worldwide, 300 thousand new users per day. But this is far from the most important thing, for a person who works in Badoo, the main thing is first of all a way of thinking and a team. Badoo is a family, it's about people and it's cool!

I want to start with a provocation, which perhaps someone will not support:

Admin is the main person in the company!

I think you will agree with me: the admin is the person without whom nothing in the company will work: the equipment comes to him, the system is set by him, the new equipment is allocated to him again. That's why I think that he is the main one.



I will give an example from personal practice in Badoo. Judge this situation yourself: we had a new project called ReThink. We renewed our logo: we changed the font and color of the letters from multicolored to purple, added a heart shape - monotonous and cool. But administrators to warn about what will happen ReThink - we just take and switch - warned last night almost before leaving home. And then banged a somewhat unpredictable load in one of the clusters. Thanks to the person who was on duty and helped the rest of the team just to find additional servers and dump them. The project actually shot, we did not fall, rolled out normally and everyone was happy.

In confirmation of my words, I want to say that a happy and productive administrator in a company is, among other things, beneficial and interesting to the company. I would like to ask all companies to make their admins happy . Then you will be fine!



Let's think about what makes the admin sad . Many will come to the head that the admin is sad from a fallen server and lost backups. This is all true, but if the admin would have thought every time and went into sadness when he did something wrong - and he does something wrong every day - the nerves wouldn't be enough.

Therefore, I denote the problem that lies in a certain human factor, namely in the context switching.

Context switch


There is a sufficiently large amount of research on what happens when a person is torn off, and why it is bad. One of the last good studies is the work of Chris Parnin , an employee of the University of Technology in Georgia. He collected a bunch of different data on this topic and made quite a few conclusions, the main one of which is:

A person who has been torn off from work on a task takes 10–15 minutes to return to it.

This is an average figure. Someone may have more, someone less, depending on the switching. By simple addition, it can be calculated that if you were distracted 4-5 times within one hour for something, a whole hour of working time is likely to be lost, and you are unlikely to do your task.

This is a theory - a person researched, came to conclusions. In practice, you probably came across such a situation: you come to work, spent the whole working day at work - you did something all day, didn’t have dinner, you didn’t respond to messengers and mail. By the end of the working day you are all tortured, it seems to you that you have done a lot of things. But at best, in the evening you realize that you did not do even half of what you planned for the working day. Worse, when a manager or a colleague comes to you and asks: “What did you do today at all?” And you understand that you ran, ran, ran - and there is nothing at the exit .

In many ways, this comes from the switching of our context and the inability to concentrate on the task. For the admin - a simple artist - this is so.

But there are still managers / team leaders and the other side. The timlid chip is that, like maniacs, this context-switching is not something that they can survive, but they even sometimes increase it to reduce it. That is, they focus a lot of meetings with this switch in a few hours, and then they rest in the evening, working on one task. The switching skill is developed to the point that it takes only 5 minutes to dive into a new task. This is very cool, and for the mere fact that they know how to do it, managers can be valued and respected. But for the admin and the performer it is better to get rid of switching .

Process opacity


The second important problem is the opacity of the processes, which can be divided into two zones:

  1. the opacity of the processes within the team ;
  2. opacity of processes outside the command .

Inside the team - this is something that we can influence: the lack of words or lack of agreement between the team members. The worst thing to which the opacity of the processes within a team can lead you is duplication of work . In principle, this is not bad, apart from the fact that you are losing, most likely, the working time of one of the employees.

Here you can find advantages and say: “Perhaps Vasya did better than Petya! Let's take his decision. ” But they could talk among themselves, and someone would do it alone. It is important.

If opaque processes are outside the team, for example, in general, something incomprehensible is happening in the company, inside the team this can lead to incorrect prioritization of tasks.



For example, a developer from a mobile web came to me and said that it is important for him to pick up a certain service that will give something for the new API today. I have many other tasks, and it does not seem to me at all that his task is priority. He was waiting for his release week, wait two more days, I will do later. For business, this is not always the case. If a command comes to us from above that the current task has a high priority, because it is part of a very large next task, then it is important that not even the manager reported it, but that every team member understands this simply without further ado .

On solving these two main problems within the team from the point of view of the performer and the admin, I would like to build my story today. I will talk about how we found a few rules in order to minimize context switching as much as possible and make the processes as transparent as possible .

How to solve the context switch problem


The admin came to work, drank a cup of coffee, read the mail, the backups work, nothing fell - sit, work, which can interfere.

Consider the usual situation. The man came fresh, everything is fine, he opened his work tools, wrote in the chat and in the mail, and then the phone rang - they asked what fell at night - distracted. Then the wife or girl posted a cool picture - you need to go poles, Facebook is also moving. Here friends come to discuss yesterday's football meeting, they call in the evening to drink beer or now tea. And all this comes to man from all sides little by little.



What to do with this problem? We have a person, there is his general social life, there is its working aspect. In this case, we can consider and optimize only the part that concerns its working tools . We cannot forbid him to go to drink beer after work or to use social welfare, because we are not in prison after all.

Therefore, we decided to look at what working tools the administrator has, where he is often pulled from, and what can we do to reduce it.

The first idea is rather strange, but we tried it - it is to allow the admin to simply not use chat , because they write a lot to chat. You work on a task, and one you wrote that it matters to him, the other that that matters to him. And we allowed admins not to use chat - do not respond and do not write anything there.

The idea, of course, did not take off, because besides the fact that you need to read what you need to read in the chat, chat is the fastest way to communicate. You just need to write there. Just a week later, it became clear that the idea was utopian, we decided to abandon it and went further.



We made a somewhat strange decision - we singled out one member of the team and told him: “Dude, you will be a conditional leader! This is not a career advancement, you simply know quite a lot about which of your colleagues in which area is good, you know the general flow of tasks and more or less about priorities. Therefore, come on, you will work on the following scenario. There is a pool of tasks that fall on all admins in a team, you can see who is doing what, you know what deadlines for the task, and you can always give it to the person who will cope with it as quickly as possible; or, if there is a lot of time to do it, you can assign it to the junior. Junior needs to say basic things, but you know that if he is helped, he will pump over and everything will be cool. ” In principle, the idea is quite sensible.

One of the reasons why she didn’t go down completely is that all of our admins like to work on what they like. We can do tasks, when everything is on fire and must be done - we do not understand, take and do, no matter who. Another thing is when you have a choice: "I am working on one task now and I want to set up replication in MySQL, I don’t want to touch Puppet - let someone else do it."

People started to bugger, some had few tasks, some had a lot, some got uninteresting - something incomprehensible and inexplicable. Perhaps it was our miscalculation, but this approach did not work.

At about the same time, we are trying to reload the Arbitrator with another duty. To the admin team, other teams set tasks to do something — backup, restore, etc. A person with such an application is, in fact, a client, and he is always waiting for feedback. When, having set the task, he sees that the task has passed in the general pool from the status “not assigned” to “assigned” to a specific performer, 2-3 hours passed, one working day, another, and the task has no beats, it is not clear, in general are engaged in his task or not.



There are admins who do not like to conduct their tasks in the form of correspondence. Therefore, the Arbitrator now has to arrange one-to-one rallies with each member of his team, lead almost every task, ask if there are difficulties on the task, how to help, and summarize the information gathered every 1-2 days.

Tasks began to somehow be conducted. But everything stalled, because our current Arbitrator just buried in so much knowledge . After all, in order for you to summarize something, you need to understand each subject area, think about what stage the employee has reached, what is stopping him, and writing this. When there are a lot of such tasks, the Arbitrator simply quits writing something, and the tasks cease to be conducted in the same way. Therefore, it was necessary to move on and change something again.

Eisenhower Matrix




Perhaps you have already seen this matrix, just do not know the name. The bottom line is that we divide the sheet with the tasks into 4 parts according to two parameters:

  1. urgent / not urgent;
  2. important / not important.

All of our tasks, we just scatter in this wonderful tablet, and begin to work.

It is worth noting immediately that the most productive and comfortable for the artist cell B is an important and not urgent task. This is a great motivator for a person, when your task is important either for the team, or for the project, or just for you. You understand that you are working not just on some kind of nonsense, but on what people will use, and this adds incentive. The advantage of non-delay is that you are left to yourself. You have time to read, test, make some calculations.

We sat, thought and came up with the idea of ​​dividing all the tasks that go into the operation department, and the tasks of the format are not very important and not very urgent to allocate into a separate project we called ITGROOVE . Here we referred to tasks that, in the future, maybe someday will actually become a problem, but now they are not a problem, and it would be nice to make them in some foreseeable future - a week, two.

After that, we introduced the function of day duty administrator , the essence of which is as follows. We have the first line of support and response to alarms and triggers, monitoring. If she cannot cope with the problem and makes a decision about what to escalate, then the first person who is involved in this task during the daytime is the day duty administrator.

If before that I told you that we are getting rid of the influence of context switching, here we are simply throwing a person at the embrasure and telling you to do everything in general, to switch as soon as you can.

In fact, this is not entirely true, because the day duty administrator performs the following actions: either he escalates the problem and sends it to the best specialist in this subject area who is available at the moment, or he is almost automatically fixing the problem himself. This is not mental activity - wake up a person at night, he will go and fix it.

As an added bonus, we offered the day duty officer, if he has nothing to do and bored, to engage in the project ITGROOVE. Not only does a person cover the rest of the team , he also closes unimportant and non-urgent tasks!

By introducing the role of day duty officer and dividing tasks into completely unimportant and project ones, we allowed the rest of the team to work in the most comfortable zone B on non-urgent, but important tasks. People just emerged from point A, looked around, and point B is there - and I feel comfortable and all is well - cool! Will be working!

I will not disregard the tasks from point C. It sounds like something crazy: “Urgent, but not important” - either urgent or not important. In our case, usually there is no work in this segment. Tasks with criteria “it does not matter, but urgently” either become “not important and not urgent”, or simply disappear, and we are not working on them.



Since I touched on the fact that we have introduced the role of day duty administrator, let's briefly look at what administrators we have in general:

  1. Admin ordinary. In principle, everyone is always engaged in everything, but the ordinary administrator mainly works on tasks in Jira.
  2. The day duty administrator mainly responds to the phone and to the escalation from monitoring.
  3. The night duty administrator , a kind of a mixture of the ordinary and daytime admins, answers calls and escalations at night, and during the daytime works as an ordinary administrator.

How to make processes transparent


The difficulty of our particular team lies in the fact that one of its parts is located in London, the other in Moscow, this is a fairly large shift in time zones. In Moscow, the guys start to work much earlier, in London they just come to work, and they have already done something. In turn, we in the London office, modifying in the evening, do some other things that people in Moscow, when they went home, did not know. To coordinate the processes within the team, we have a weekly Monday rally.



It looks like this:


But the problem is that somewhere by Tuesday evening or by Wednesday morning, coordination of actions is a little lost . For example, I started working on the task, went away, I have different tasks for this week, my colleague from Moscow is undergoing something similar. We will get out of sync until next Monday, before the next agenda - we need to do something about it.

Status Hero


There is a cool tool called Status Hero . Its essence is that when you come to work, you plan for yourself certain tasks. In Status Hero there are 3 fields to fill. And this is not a mandatory tool, we can not fill it and not use it.



The trick is this: I have come to work fresh, and I know that today I want to fix some DNS, set up reset of metrics in Prometheus, see how new schedules will work, and maybe close current tasks. I put all this in the plan for today.

But over the plan for today, my line is flickering, which says that yesterday you promised yourself to do this, and come on, you will first write what you did for yesterday from what you promised, and then what you will do Today.



Also there is a wonderful third item. This field is used to mark some external events that block the execution of tasks . For example, someone from the other team did not give you any information that a patch, a fix, the necessary data to do the work, and you are a shy guy and cannot call and demand it. Now you can write something here, it will be highlighted in red, and either the manager or the guys from the team will help you. That is, you will voice your problem , and you will not sit silently and wait, when the problem independent of you will be solved and you will be able to do your job.



In turn, the team also sees this. We have a special group in HipChat, where after someone has filled out the form, it is shown to the whole team. A man of a quick glance is enough to view the chat and understand what his colleagues will do. If suddenly there is some kind of blocker that he can resolve and thereby help his colleague, then he does it. That's cool!

Why does Status Hero work?




  1. The most important aspect is that you promise yourself . From practice I can say that if you promise yourself from Monday to Friday, then, most likely, by Thursday you will have made at least one of the points that you wrote on Monday. Status Hero every day you will be callous eyes and say: "He promised - did not!" And also colleagues know that you actually also promised, so you take and do, simply through force.


  2. The next positive point is that the transparency obtained allows us to help each other . When I see that, for example, my colleague is going to perform a certain task, in which my knowledge is probably more, or I can just help with something, I say: “Come on, I will help you. I know where to send the documentation and what to read, or do it right away so as not to lose a couple of days of work. It will be better for you. ”



  3. Now those quiet, who sat and did not say that something prevents them from working, can also quietly write, and they will be helped. Perhaps, some problem will be solved, which otherwise could not cope.

Status Hero revealed the problem


But not only did Status Hero help us in organizing this activity, he also revealed one rather strange problem for us. It consisted in the fact that at that time there was either no operational documentation or it was not enough.

It was possible to understand this approximately when you began to see what your colleagues are working on, help them and tell them how to do something. When you explain the same thing for the sixth time , you realize that if you wrote it once, a colleague would have walked through the script once, made edits and comments, and that would not have been distracted by the explanations six times . A person, in turn, would not need to ask about banal things that can be read about.

The documentation was, but in insufficient quantities, as it turned out. As soon as Status Hero began to be used, there were really more articles in the internal Wiki , articles began to be edited and commented, even likes were entered into Confluence, and they also began to supplement triggers in monitoring systems that work. We began to write more clearly, in human terms, about what is actually happening, who to call and where to look.

And that is not all. There is another aspect in which the Status Hero also helps us.

Team Contribution


Alexey Rybak spoke at HighLoad ++ with a story about the Review process in Badoo . This is a cool, mostly managerial thing, because they need to evaluate their staff: how we work, how the team works. From a manager’s point of view, this is a cool tool by which all information becomes structured.

From the point of view of the administrator - a simple employee - the opposite is true . It's almost like exam preparation per session. To complete the Review, there is a week for which you need to write what you have been doing for the last six months. But usually it reaches the last day, which is almost all spent on re-reading their tickets for a long time, to penetrate into them and write something about their achievements.

To make the Review writing process not so sick, we are invited to fill in snippets . This can be done both at the end of the working day and at the end of the working week.



But, since we already talked about the problem of context switching, it is obviously not always possible on Friday, for example, to recall what I did on Monday or Tuesday. At best, I will write what I did on Thursday and Friday, at worst it will be the last 3 hours of work on Friday. As for the daily snippets - the working day can be different, and in the evening I want to go home to the pub - anything, but not to do writing what I have done today.

Here again comes Status Hero. Every day we wrote in it what we promised to do and what we did. For the period that we need, you can simply make a selection of positive points - what we actually did.



Not only is this sample positive , there is one more plus: in the Status Hero we wrote for ourselves, and when we make a sample for writing a semi-annual report, then by reading what you wrote for yourself, you plunge into perspective in context. You do not need to get into the ticket and remember what you did there, long or not long.

It is beautiful and wonderful, but

“The theory without practice is dead, the practice without theory is blind”
A. Suvorov

One day in the life of the admin


, Status Hero , , Badoo. , .



, , . , . , , , . , -.

, , , , . , .



, , -, xCAT.



, , , Puppet — , Consul , Docker, glpi, . , .

- , .



, . -, . , , Raid, , .

xCAT , PXE dhcp . , dns , . , — — mac — IP , , .

, xCAT , . - Kernel Panic, . xCAT , -, , , , . - — 100 , -, . - , , SN . xCAT SN .

, , xCAT, -, , , dhcp , , , dhcp helper .

, , , , .

Docker


, Docker — . Docker , - .



Docker , , registry , , . , Docker , registry Badoo , . , Ceph Swift API .

, registry, Redis . HTTP , Docker distribution , , , docker-registry Redis endpoint Ceph.

HTTP nginx, SSL, basic Auth. , registry , pull push.

Consul


- Consul, , , service discovery Badoo, service discovery .

, Consul -, , . , 3 master- -.

, - Consul?

Puppet




Puppet.

Consul , ( ):


load balancing.

GLPI


, glpi, -. — .



:


GLPI FusionInventory , , , , , — , , . , .

5 , Wiki, 3-5 — . , : , , , — .



Badoo , . , .

, ( , ):


, , , , : «, , ».

References


context-switching, , , , , - Badoo.


Highload++ Siberia 25 26 . Badoo,

30 — .

Source: https://habr.com/ru/post/414749/


All Articles