Someone will fall soonBecause the data storage is stored by the holy of holies. If the data become unavailable, it will very soon smell fried. Or if suddenly the place is over - also an unpleasant surprise. Therefore, monitoring must be mandatory, and it must cover the storage system.
There are two main approaches to
monitoring storage . Either use a universal monitoring system like Nagios, Icinga, which will collect information using the SNMP protocol, or buy highly specialized software from the manufacturers of the storage systems themselves. Of course, the second option provides a more in-depth analysis of the state of iron, shows specific things like cache state, iops, hit rate, controllers loading, etc. This is the option that was chosen most often by our customers who have large and expensive arrays .
But by the way, not everything is so smooth with commercial monitoring software. In detail, I will tell you further. It will be, so to speak, experience first-hand. At one time, for almost 2 years I finished one such system for many thousands of green pieces of paper from a distinguished vendor. And pick it up so that even the support of the vendor began to consult with me. But some problems of software were replaced by others, as well as some Indians from support were replaced by new Indians - and then the thought came to me, and not to do anything at all radically ... In general, this is how it all started.
What is wrong with vendor software?
As I have already said, monitoring from the manufacturer perfectly monitors the storage system of the same manufacturer. This is its main advantage. The disadvantages grow from the same: arrays from other manufacturers are limited or not supported. It turns out that if you have several different arrays on your farm, then you need several different monitoring tools. Yes, and do not forget which of what and when it is necessary to look at it next time. Ideally, generally by admin for each array.
It is no secret that the tools from vendors-manufacturers cost money, and quite large ones. And the extension of support then costs a lot of money too. And some vendors have mastered the new focus: they announce the end of the life cycle of their software and offer to simply purchase another product, without migrating licenses. Just such a setup just a couple of months ago happened to one of our customers. There are no options: if you want to monitor the iron further, make a new purchase.
If you dig vendor software deeper, then other unpleasant features emerge. For example, in some products you can see the current status image, but you cannot see the history for the previous period. Either the story is limited: the log is overwritten once every 3 days. It’s just not necessary to talk about the accumulation of statistics. And often the event history is required for forecasts, for example, the purchase of spare parts, for reporting, and for investigating incidents. For example, brakes in some business system can be pushed onto the storage system and, if there is no actual data, there is nothing to hide behind.
And finally, it is impossible not to complain about the speed of updates and changes in vendor software. With this trouble for my long practice, I came across oh, how often! New models of arrays are released, new firmware is released, new settings appear. All this easily breaks the monitoring that works: then some kind of infa stops gathering, then arrays generally fall off. There was a case when the manufacturer disconnected support for old versions of SSL in the new microcode, and the monitoring software still did not support the TLS protocol. And no one could find the cause first. Already after my own investigation, I sent these introductory to the manufacturer, and they already then updated the ancient libraries. However, all this red tape lasted forever.
And once we failed the pilot at the customer. It was suggested to use vendor software, and the customer liked everything in terms of functionality and interface. But unfortunately their main productive systems were not supported. They were even ready to wait a month or two, but the vendor said that in the near future there were no plans to include these systems in support (and this was just an update of the Hitachi AMS line on HUS).
In general, quite a lot of inconvenience, and for some reason for a lot of money.
Long time I did not take checkers ...
Frustrated with this state of affairs, I thought about realizing my own monitoring for storage systems. If you know the array well and own its CLI, you can quickly get the necessary information about the state or get to the root of the problems. Of course, before that, we need to shovel a lot of docks, smoke forums and vendor knowledge bases, collect various information bit by bit. But when you know which command to dial with which key and what each column of output means, you are already a guru. It remained to integrate this knowledge into a convenient interface that will continue to do everything for you.
I admit that at first I conceived to write the interface from scratch too, but then I came across Zabbix - a mature tool with a large community, which also expands easily. It had everything I needed: an interface, a role model, notifications, a system of triggers, proxy clients-agents. It remained only to this combine correctly submit information about the storage systems and threshold values of different parameters. The case boiled over. We have a team of experts on arrays. Of course, it’s impossible to know all the arrays to one person, so we are divided by models and manufacturers.
Another difficulty in developing your own monitoring is the ability to access the hardware itself and so that it would not be terrible for them to load, break, and carry out all sorts of experiments. Fortunately, the resources of our laboratory allowed all this.
The first thing to monitor is the health of all hardware components. Something can be taken via SNMP, but in most cases it is a survey using a special protocol (SMI-S, REST API, SOAP API and others). I must say that the arrays themselves allow you to set up notifications about breakdowns on them. And all customers somehow use it. But what happens if the notification itself on the array broke? This happened, and more than once, when the array was silent for weeks and it seemed to everyone that everything was in order, - it is silent. And then suddenly it turned out that a critical number of disks flew on it, but it was too late.
The second important point for monitoring is performance. Because when the performance drops on the storage system with a write delay of a few seconds, Oracle can simply go and fall. Without a hitch. It is the performance in large infrastructures with multiple storage systems that is worst controlled. And in Zabbix there is a very convenient predictive analysis: you can, based on the forecast, set the value of the metric, what it will become in the future. For example, we made a trigger that will work if there is a forecast that the space for the current recycling is only 3 months. Or, for example, that the response time according to the forecast in 2 weeks will be 50 milliseconds longer. Monitoring gives us time to learn about future problems in advance and do something already.
At some point, we realized that knowing about the state of the storage system is, of course, good, but it is much better to understand what else is happening on the network and on the server side. As a result, after several months of work, it became possible to see the servers, the network, and the storage systems themselves in one interface. Not only plug-ins and connectors for storage systems appeared, but also a useful binding in the form of maps of the network topology. So far, of course, the plugin takes into account our experience and our needs, but if you tell us what you need to see in it, we will screw it up.
End-to-End Topology for a VMware Cluster: From Virtual Machine to Storage Volume
PerformanceOn the graph of the performance of the array, we see that the system is overloaded very much. High utilization of disk groups shows that the disks are overloaded. There are many I / O operations at the storage ports, which means that IT systems load the array from their side. Well, the characteristic graph of response time, as well as processor utilization above the recommended values. Verdict - we put too many tasks on the array, some of them need to be migrated.
Storage Area Network Map: Finding BottlenecksTotal
What have we got? We have equipped Zabbix, a popular and very common monitoring system with new features, including:
- Collecting information about the status of all hardware and logical components from disk arrays and storage network switches.
- Performance statistics for absolutely all systems for which we made plugins (vendors have gaps in this regard).
- Topological maps of both shared storage and end-to-end from virtual machines to storage volumes (currently only for VMware).
- Collect all inventory information.
- Reserve disk space.
Zabbix itself allows you to create very cool notifications, set thresholds, send informative emails about the problem. For example, if the port on the switch dropped (or the traffic on the port became very large), the letter will contain not only the name of the switch with the port number, but also information about the connected device.
What systems do we currently support? Many different:
- All Hitachi arrays (AMS, HUS, VSP, VSP G).
- Dell-EMC CLARiiON, VNX, Unity, ISILON, Compellent arrays.
- Arrays HPE 3PAR, P9500, XP7.
- IBM Storwize, DS5000 arrays.
- NetApp FAS arrays (7-mode, c-mode).
- Disk libraries HPE StoreOnce, EMC DataDomain.
- Brocade Silkworm, Cisco MDS Switches.
We also have extensions for some operating systems (Windows, ESX), with the help of which we collect data on FC HBA in order to further draw topological maps. Actively developing plug-ins for OpenStack and virtualization systems.
When developing plug-ins, we take into account the expertise of our engineers, who have a lot of case studies on how to solve problems on arrays, both hardware and performance. New plugins are developed on request in a short time due to the large number of own ready-made libraries.
Some of our customers set up the system like this: notifications automatically arrive at our mail with an indication of the contract number, contact persons and all parameters of the faulty component. This reduces the reaction time and ordering the necessary parts, since the engineer on duty does not need to call and specify a lot of information - even at night. The application goes immediately to work.
And how do you solve the problem of monitoring your infrastructure, in particular the storage system? Tell us about it in the comments or in a letter to the mail VRyzhevsky@croc.ru