"20 thousand IOPS per node - good performance, taking into account delays of 5 ms." For OLTP - no

KDPV


The reason for writing this article was a very decent review. How we tested VMware vSAN ... of CROC. The review is worthy, but there is a phrase in it, with which I have been struggling for more than a dozen years. Storage administrators, virtualizers and integrators repeat: “Delays of 5 ms are a great indicator.” Even the figure of 5 ms ten years does not change. I heard it live from highly respected admins already at least a dozen times. From less respected - dozens, and how many times I read on the Internet ... No, no, no. For OLTP loads of 5 ms, especially as they are usually measured - this is epic fail. I had to explain the reasons for this many times already, this time I decided to collect my thoughts in a reusable form.


At once I will make a reservation that in the above-mentioned article there are no such errors, rather the phrase worked as a trigger.


Typical start


Everything that is described in this article is true for common DBMS used for typical business OLTP. Most of all I have experience with MS SQL Server, but at least for PostgeSQL, Oracle and Sybase many points and conclusions will also remain true.


The performance of the DBMS is usually unhappy with everything. If there is a DBMS in a large system - and it, suddenly, almost always is - then this DBMS is a bottleneck. Well, or immediately becomes a bottleneck, if you start to optimize everything else. And so, the customer comes and says in a human voice: "Help! Save! Pay $ NNNNNNNN for servers and storage, but the speed does not grow! Already the administrator set up and advised the vendor, but it still does not move." If system developers fit the definition of Lavrov (we can do without an exact quotation), and operation and maintenance experts "fight incidents by rebooting the server," then the problem is often simple and simple: there are no indexes, no query queries, fatal configuration errors (for which the documentation in bold it says "you can’t do that !!!" ), excessive blocking, deadlocks and other simple and understandable nonsense. There are many such cases, most, but not all. If the system is in difficulty or load has passed some invisible limit, then it will either die from these problems or go to the next level.


SQL Server Diagnostic Tips

IMHO, the best tool now is the SQL Server First Responder Kit , promoted by Brent Ozar . This tool is developing very actively. There is still a decent set of Glenn Berry , he also did not abandon his project. Both sets are beautiful in their own way, reading comments and requests for the first time opens up a lot of new things. I myself always begin to look around with sys.dm_os_waitsats , a quick glance at the Error log and find out if there is at least some working backup system.


At this level, the server is no longer under the director's desk, the drives are no longer inside the server, but in the storage system, developers know about indexes, and administrators already know how to use PowerShell, and IT managers start saying buzzwords like SLA and RPO / RTO. Here at this level an interesting situation arises:



But no. The crocodile is not caught, the coconut does not grow, and the system performance is the same or lower than on the old server. I look in sys.dm_os_waitsats and see WRITELOG , PAGEIOLATCH_SH and PAGEIOLATCH_EX in the top, the average wait time is 5+ ms. Well, typically, cho: "Hey, admins and DBA, here you have a disk-bottleneck system" and here begins the old song about 5 ms:



By the way, all of the above is not only about "their" servers, but also about cloud services and virtualization. There is a handful of its specifics, but the typical clinical picture is about the same: moderately optimized database, intelligent development and maintenance staff, there is a reserve for the processor and memory, the "exhaust" from further investments is almost zero.


So here. This whole song about "5 ms" is nonsense and nonsense. If you say so yourself, read this article. And if they say that to you, prepare the arguments. Earlier, when I heard these words, I was angry, but I'm not angry anymore. I, like that pot of petunias from Hitchhiker's Guide to the Galaxy, have only one thought: “Well, here it’s again ...”.


Who is guilty?


Why is the database so slow? Well, it would seem that a typical server with 20-64 cores at a frequency of 2-3 GHz is capable of performing 50-150 billion simple operations, and the maximum (synthetic) database tests show on such machines only 10,000-50000 transactions per second. Hey! This is from a million to a dozen possible millions of transactions per transaction. It’s not just a lot, it’s a lot to do.
Such an overhead is the ACID transaction requirements.



By the way, letter-to-letter, these requirements are not met almost anywhere and never, but in distributed systems simply never (CAP-theorem interferes). For our situation, the “D” requirement is probably more expensive than the others; this requirement is provided by the key mechanism of all common OLTP DBMSs: WAL, write-ahead log (PostgeSQL), also known as transaction log (SQL Server), also known as REDO log (Oracle). Here it is - a stone on the neck of performance, and it is also the foundation of the transaction durability.


What is WAL


Let's forget for a while about modern SSDs, about cool storage systems. Suppose we have a server, in it one or several disks.
Any transaction, even the insertion of one record, is at least potentially, but in fact almost always and real, a non-atomic action. We almost always need to change not only the page where the record lies, but also the index pages, possibly the service pages. At the same time in the same transaction the same page can change many times. Plus, we can execute other transactions in parallel. Moreover, the adjacent transactions in time constantly “pull” the same pages. If we wait for each page to be written to disk before continuing, which is essentially what is required in Durability, then we will have to write many times more and wait for the mandatory end of each recording on non-volatile media. No caches, no swapping of operations in the queue, otherwise there will be no integrity! Moreover, we somehow need to note which data is already on fixed transactions, and which is not yet (and what was the data before). To understand - a typical single hard disk (HDD) in this mode will give 50-100 IOPS and this is a constant for 20 years. One even small transaction will require 5-10 write operations. Oh yes, to know what to write down - you need to read. Even very-very high-write OLTP systems read 3 times more than they write. Thus, our transaction costs 20-40 IO, which means 0.2-0.8 seconds per disk.
2 transactions per second. Isn't it enough? Let's try to scatter on the disks? Oh, and we still have to wait until the previous one is recorded and there is no parallelism in the end. How to be? And let's get a log file in which we will consistently record all write operations to the database and transaction marks! Pros:



This log is called write-ahead log / transaction log / REDO log.


Hooray! Super! There were 2 transactions per second, it became 300 - improved 150 times. And at what cost? As it turns out, the price is significant:



This explanation is very simplified, "on the fingers." That's enough for our topic. WAL is a key, fundamental mechanism for ensuring transactionality, it is mandatory write-through, single-threaded access only for sequential writing, in terms of storage, queue depth 1.


If you are interested in this topic

The write-ahead logging topic in the database should be known at least in the minimum amount by anyone who in one way or another administers the DBMS, or the DBMS infrastructure, or develops the databases.


WAL and storage


Storage vendors "from birth" are faced with a DBMS. It is for databases that business buys these insanely expensive complexes: from street price vaults for Dell-EMC, HP, Hitachi, NetApp, when setting up a budget, eyes are filled with tears from most top managers, if they, of course, do not receive a percentage of this price. But there is an engineering-marketing conflict. I will explain it using the example of Dell-EMC, but only because I remember where their documentation is.


So:


  1. Single-threaded log, out of turn
  2. The write-through log, i.e. latency, is "eternal" compared to CPU performance
  3. OLTP load is a lot of relatively small transactions.
  4. Most of the other DBMS loads will somehow parallel.

Amdal's law mercilessly tells us that a single-continuous low-capacity load will make adding processors useless, and the performance will be determined by the magazine. Moreover, at this moment we will not give a damn about the performance of storage in IOPS, and only latency will become important.
But do not discount other disk operations - reading and writing to data files and tempdb . Reading is also a “waiting” operation. While the data page is not read from disk to memory, the processor cannot process it. But for these operations, large queues are possible and rearrangement of operations in this queue: the DBMS often knows which pages need to be loaded into memory, which pages should be flushed to disk and queued for reading at once a lot. Since in this scenario it is important when the last operation from the stack ends, in this load we are on the contrary more important than IOPS than the latency of a separate operation. To understand the scale: read operations in a typical OLTP system 85% -95%. Yes, yes, write operations are much less.


The development engineers of the storage vendor work closely with the database vendors, and are well aware of all the technical nuances of the database operation with the disk subsystem. Proper planning, partitioning and allocation of disk resources for a DBMS is a complex and important competence of the storage system administrator . In the same Dell-EMC, even the basic white-paper H14621 and H12341, according to the partitioning recommendations for SQL Server, are on a hundred pages. Hey! This is not a detailed dock, this is the most common white-paper! There is still a lot of specific ( h15142 , h16389 ... there is their darkness). Do not fall behind and "subcontractors" from VMware - Architecting Microsoft SQL Server on VMware vSphere . Pay attention, these are documents not only and not so much for DBA, but for administrators of infrastructure and storage systems.
I also note that in all these documents separate LUNs are cut for data, for logs and for the tempdb database. Yes, somewhere in fresh documents it is neatly said that for All-Flash solutions it makes no sense to separate magazines into physically separate media, but LUNs still suggest cutting them separately. If the data and logs fall into one LUN, then from the point of view of the OS it will be one IO queue. And here there will be a problem. In transactions with the journal latency will immediately be an order of magnitude more. And due to the fact that there are unmovable logging operations in the queue, IOPS on the data files and tempdb . This is not a "discovery of the century", it is the alphabetical truth of working with the database. It is not outdated and not canceled with the advent of All-Flash. Yes, delays in operations with SSD are faster by an order of magnitude than in operations with HDDs, but still a couple of orders of magnitude slower than operations with memory. IO is still a bottleneck DBMS.
And the technical documents correctly emphasize that the number of IOPS is not important in the transaction logs, but it is important that the latency be minimal (in modern writing that less than 1 ms).


And marketers need to sell. Hyper Convergence! Virtualization! Flexibility to deploy! Deduplication! Easy to set up! Lots and lots of IOPS! Beautiful presentations, confident voice, strict suits. Well, how else to sell a solution with a 6-7-digit price tag in dollars? It’s somehow forgotten that either latency or throughput can be achieved from the storage system, but not both at once, that some license for the load balancer costs like another shelf, that if the intensive recording lasts more than an hour, then the controllers memory it’s not enough and the productivity will go down to “as if there is no cache”, that training the customer’s employees costs another 100,000 rubles for the first course, and similar tricks ...


5 ms


Either having heard enough, having read marketers, or from laziness, or because of some cockroaches, but for some reason, the administrators of the storage system often do something like this. We take a large shelf, combine it all into something flat, cut it into thin provisioned LUNs and distribute it across the LUN to the server. Or two, because "the system partition is well deduplicated." And when, I see that with the disk subsystem by the SQL ad-ad-ad, then the same song begins, that "5 ms is a great indicator", that "100000 IOPS", "Your load on the storage system is less than 5%"


NO



When some data download on a developer’s laptop is 5 minutes, and on a 40 nuclear server with 1 Tb of RAM and a storage system for half a million dollars the same task takes an hour, then even the most patient customers will have questions about the reasonableness of costs.


Average delay per WAL sectionthere will never be more transactions per second than:
5 ms200
1 ms1000
0.5 ms2000
0.1 ms10,000
0.05 ms20,000

What to do


Tips for administrators and DBA


For OLTP, stop reading "recycling" and IOPS. Separately, I’ll note that you don’t look at IOPS with a large queue depth at all: even on data partitions, large queues usually have a short burst or something that does not affect the actual OLTP performance.


Sharing disk space on a LUN is not a DBA whim. The database has several different disk subsystem load profiles. At a minimum, the following can be distinguished:



There are other types of load, but they are slightly exotic (for example, there may be a repository of files stored in the database in the form of a FileStream directory). All these types of loads have different, often contradictory disk requirements. If they are all dumped on one partition, then you not only degrade performance, but it is very important that you lose the ability to understand why the system is slowing down, and also you are unable to improve only the part that needs improvement without global improvements / upgrades to the storage system. Therefore, the main recommendation:


Read the recommendations of the manufacturer of storage systems and DBMS, try to understand "why they advise so much" and design the system taking into account different types of load. .




MS SQL Server


MS SQL, bottleneck , - :


  1. . It is right. . 1000 5-30 1000 INSERT . , , , , " — ".
  2. tempdb " ". . , , .
  3. , BULK INSERT . , "Simple" "Bulk logged". , , Simple/Bulk logged Full . — The Data Loading Performance Guide , . ( ETL, OLTP) We Loaded 1TB in 30 Minutes with SSIS, and So Can You
  4. SQL Server Delayed Transaction Durability — , .
  5. SQL Server In-Memory OLTP . , .
  6. , , AlwaysOn .

***


That's all. . 20000 IOPS 5 latency 4-16 OLTP. OLTP , .


PS: SSD.

. Intel Optane. SSD "" 4, . SSD, , , . SSD . , "" , . Intel Optane: ( , ) 1 20 . , . SSD 100-300 . SSD.
, . OLTP "", in-memory ACID. latency 20 "" . low-latency Optane ( ? ).
( ) Optane.


Source: https://habr.com/ru/post/414269/


All Articles