"20 thousand IOPS per node - good performance, taking into account delays of 5 ms." For OLTP - no

KDPV

The reason for writing this article was a very decent review. How we tested VMware vSAN ... of CROC. The review is worthy, but there is a phrase in it, with which I have been struggling for more than a dozen years. Storage administrators, virtualizers and integrators repeat: “Delays of 5 ms are a great indicator.” Even the figure of 5 ms ten years does not change. I heard it live from highly respected admins already at least a dozen times. From less respected - dozens, and how many times I read on the Internet ... No, no, no. For OLTP loads of 5 ms, especially as they are usually measured - this is epic fail. I had to explain the reasons for this many times already, this time I decided to collect my thoughts in a reusable form.

At once I will make a reservation that in the above-mentioned article there are no such errors, rather the phrase worked as a trigger.

Typical start

Everything that is described in this article is true for common DBMS used for typical business OLTP. Most of all I have experience with MS SQL Server, but at least for PostgeSQL, Oracle and Sybase many points and conclusions will also remain true.

The performance of the DBMS is usually unhappy with everything. If there is a DBMS in a large system - and it, suddenly, almost always is - then this DBMS is a bottleneck. Well, or immediately becomes a bottleneck, if you start to optimize everything else. And so, the customer comes and says in a human voice: "Help! Save! Pay $ NNNNNNNN for servers and storage, but the speed does not grow! Already the administrator set up and advised the vendor, but it still does not move." If system developers fit the definition of Lavrov (we can do without an exact quotation), and operation and maintenance experts "fight incidents by rebooting the server," then the problem is often simple and simple: there are no indexes, no query queries, fatal configuration errors (for which the documentation in bold it says "you can’t do that !!!" ), excessive blocking, deadlocks and other simple and understandable nonsense. There are many such cases, most, but not all. If the system is in difficulty or load has passed some invisible limit, then it will either die from these problems or go to the next level.

SQL Server Diagnostic Tips

IMHO, the best tool now is the SQL Server First Responder Kit , promoted by Brent Ozar . This tool is developing very actively. There is still a decent set of Glenn Berry , he also did not abandon his project. Both sets are beautiful in their own way, reading comments and requests for the first time opens up a lot of new things. I myself always begin to look around with sys.dm_os_waitsats , a quick glance at the Error log and find out if there is at least some working backup system.

At this level, the server is no longer under the director's desk, the drives are no longer inside the server, but in the storage system, developers know about indexes, and administrators already know how to use PowerShell, and IT managers start saying buzzwords like SLA and RPO / RTO. Here at this level an interesting situation arises:

DBMS is a bottleneck.
The server, it seems, by all indicators should be sufficient.
Programmatically, the DBMS can be further improved, but it is difficult (either to switch to more expensive licenses, or to switch to the "red zone of the Shvilev curve" for optimization)
The disk system is bought expensive and, it seems even is somehow configured.

But no. The crocodile is not caught, the coconut does not grow, and the system performance is the same or lower than on the old server. I look in sys.dm_os_waitsats and see WRITELOG , PAGEIOLATCH_SH and PAGEIOLATCH_EX in the top, the average wait time is 5+ ms. Well, typically, cho: "Hey, admins and DBA, here you have a disk-bottleneck system" and here begins the old song about 5 ms:

We have 5 ms on SLA
Yes, we have a shelf of 20,000 IOPS drags
We are a vendor said that all database files can be on one partition
We have virtualization and hyperconvergence and we cannot allocate separate disks for database
According to our data, server utilization 5%
Everything is configured on the recommendations
Your database does not need high performance, it does not do more than 300 IOPS (and we have the same shelf for 20,000 IOPS)

By the way, all of the above is not only about "their" servers, but also about cloud services and virtualization. There is a handful of its specifics, but the typical clinical picture is about the same: moderately optimized database, intelligent development and maintenance staff, there is a reserve for the processor and memory, the "exhaust" from further investments is almost zero.

So here. This whole song about "5 ms" is nonsense and nonsense. If you say so yourself, read this article. And if they say that to you, prepare the arguments. Earlier, when I heard these words, I was angry, but I'm not angry anymore. I, like that pot of petunias from Hitchhiker's Guide to the Galaxy, have only one thought: “Well, here it’s again ...”.

Who is guilty?

Why is the database so slow? Well, it would seem that a typical server with 20-64 cores at a frequency of 2-3 GHz is capable of performing 50-150 billion simple operations, and the maximum (synthetic) database tests show on such machines only 10,000-50000 transactions per second. Hey! This is from a million to a dozen possible millions of transactions per transaction. It’s not just a lot, it’s a lot to do.
Such an overhead is the ACID transaction requirements.

A tomicity - either the entire transaction is completed or the whole is not completed.
C onsistancy - at the input and at the output from the transaction the system is in a holistic state
I solation - transactions do not see each other intermediate states
D urability - if the transaction is successfully completed (commited), then regardless of the circumstances, the changes made must remain in the system.

By the way, letter-to-letter, these requirements are not met almost anywhere and never, but in distributed systems simply never (CAP-theorem interferes). For our situation, the “D” requirement is probably more expensive than the others; this requirement is provided by the key mechanism of all common OLTP DBMSs: WAL, write-ahead log (PostgeSQL), also known as transaction log (SQL Server), also known as REDO log (Oracle). Here it is - a stone on the neck of performance, and it is also the foundation of the transaction durability.

What is WAL

Let's forget for a while about modern SSDs, about cool storage systems. Suppose we have a server, in it one or several disks.
Any transaction, even the insertion of one record, is at least potentially, but in fact almost always and real, a non-atomic action. We almost always need to change not only the page where the record lies, but also the index pages, possibly the service pages. At the same time in the same transaction the same page can change many times. Plus, we can execute other transactions in parallel. Moreover, the adjacent transactions in time constantly “pull” the same pages. If we wait for each page to be written to disk before continuing, which is essentially what is required in Durability, then we will have to write many times more and wait for the mandatory end of each recording on non-volatile media. No caches, no swapping of operations in the queue, otherwise there will be no integrity! Moreover, we somehow need to note which data is already on fixed transactions, and which is not yet (and what was the data before). To understand - a typical single hard disk (HDD) in this mode will give 50-100 IOPS and this is a constant for 20 years. One even small transaction will require 5-10 write operations. Oh yes, to know what to write down - you need to read. Even very-very high-write OLTP systems read 3 times more than they write. Thus, our transaction costs 20-40 IO, which means 0.2-0.8 seconds per disk.
2 transactions per second. Isn't it enough? Let's try to scatter on the disks? Oh, and we still have to wait until the previous one is recorded and there is no parallelism in the end. How to be? And let's get a log file in which we will consistently record all write operations to the database and transaction marks! Pros:

Information about the operation can be much more compact than recording the entire page (typical page size is 8 KiB, information recorded in the journal is often 0.5-1 KiB).
Instead of an entry stating whether a transaction is recorded directly or not into the page, all you need to do is to mark the beginning and commit the transaction in the log.
Pages can be written not after each transaction - even several times smaller. The process of reading / writing data is completely "untied" from the log.
The main thing. If we put our journal on a separate disk and write records sequentially, then due to the fact that it is not necessary to constantly reposition the disk heads, even a consumer HDD in this mode squeezes out up to 1000 IOPS, considering that small transactions "cost" 2-4 log entries, then you can squeeze 200-400 TPS
In case of failure, the state of the data file can be restored by such a log, and if you cancel a transaction on it, you can roll back the changes.

This log is called write-ahead log / transaction log / REDO log.

Hooray! Super! There were 2 transactions per second, it became 300 - improved 150 times. And at what cost? As it turns out, the price is significant:

In all common DBMS logging is strictly consistent. One stream is responsible for logging. Do you have 100 processors? Cool. And in the journal will still write one thread. The queue depth is exactly one.
Still - no OS caches, no permutations of operations. The requirements of durability remained. Write-through operations: until the disk responded, "I wrote it down, I accurately recorded exactly on the surface, not in the cache." The DBMS does not continue.
If you put the log file on a data disk, almost all the benefits of sequential writing will disappear. Moreover, for good, if there are several databases on the server, then there are several disks for logs.
Rollback of transactions (at least in MS SQL Server) - read the log and restore the state of it. This is as much or even more write operations as there were write operations in the transaction. Rollback - expensive!

This explanation is very simplified, "on the fingers." That's enough for our topic. WAL is a key, fundamental mechanism for ensuring transactionality, it is mandatory write-through, single-threaded access only for sequential writing, in terms of storage, queue depth 1.

If you are interested in this topic

Absolutely introductory article How databases are arranged.
For SQL Server, there is an excellent series of articles How to stop calling the SQL Server transaction log as a log file and stop fighting for its size.
It is interesting to look a bit from the other side, for example, decoding an excellent report about in-memory DBMS Tarantool
Each DBMS has extensive sections explaining the work of WAL.

The write-ahead logging topic in the database should be known at least in the minimum amount by anyone who in one way or another administers the DBMS, or the DBMS infrastructure, or develops the databases.

WAL and storage

Storage vendors "from birth" are faced with a DBMS. It is for databases that business buys these insanely expensive complexes: from street price vaults for Dell-EMC, HP, Hitachi, NetApp, when setting up a budget, eyes are filled with tears from most top managers, if they, of course, do not receive a percentage of this price. But there is an engineering-marketing conflict. I will explain it using the example of Dell-EMC, but only because I remember where their documentation is.

So:

Single-threaded log, out of turn
The write-through log, i.e. latency, is "eternal" compared to CPU performance
OLTP load is a lot of relatively small transactions.
Most of the other DBMS loads will somehow parallel.

Amdal's law mercilessly tells us that a single-continuous low-capacity load will make adding processors useless, and the performance will be determined by the magazine. Moreover, at this moment we will not give a damn about the performance of storage in IOPS, and only latency will become important.
But do not discount other disk operations - reading and writing to data files and tempdb . Reading is also a “waiting” operation. While the data page is not read from disk to memory, the processor cannot process it. But for these operations, large queues are possible and rearrangement of operations in this queue: the DBMS often knows which pages need to be loaded into memory, which pages should be flushed to disk and queued for reading at once a lot. Since in this scenario it is important when the last operation from the stack ends, in this load we are on the contrary more important than IOPS than the latency of a separate operation. To understand the scale: read operations in a typical OLTP system 85% -95%. Yes, yes, write operations are much less.

The development engineers of the storage vendor work closely with the database vendors, and are well aware of all the technical nuances of the database operation with the disk subsystem. Proper planning, partitioning and allocation of disk resources for a DBMS is a complex and important competence of the storage system administrator . In the same Dell-EMC, even the basic white-paper H14621 and H12341, according to the partitioning recommendations for SQL Server, are on a hundred pages. Hey! This is not a detailed dock, this is the most common white-paper! There is still a lot of specific ( h15142 , h16389 ... there is their darkness). Do not fall behind and "subcontractors" from VMware - Architecting Microsoft SQL Server on VMware vSphere . Pay attention, these are documents not only and not so much for DBA, but for administrators of infrastructure and storage systems.
I also note that in all these documents separate LUNs are cut for data, for logs and for the tempdb database. Yes, somewhere in fresh documents it is neatly said that for All-Flash solutions it makes no sense to separate magazines into physically separate media, but LUNs still suggest cutting them separately. If the data and logs fall into one LUN, then from the point of view of the OS it will be one IO queue. And here there will be a problem. In transactions with the journal latency will immediately be an order of magnitude more. And due to the fact that there are unmovable logging operations in the queue, IOPS on the data files and tempdb . This is not a "discovery of the century", it is the alphabetical truth of working with the database. It is not outdated and not canceled with the advent of All-Flash. Yes, delays in operations with SSD are faster by an order of magnitude than in operations with HDDs, but still a couple of orders of magnitude slower than operations with memory. IO is still a bottleneck DBMS.
And the technical documents correctly emphasize that the number of IOPS is not important in the transaction logs, but it is important that the latency be minimal (in modern writing that less than 1 ms).

And marketers need to sell. Hyper Convergence! Virtualization! Flexibility to deploy! Deduplication! Easy to set up! Lots and lots of IOPS! Beautiful presentations, confident voice, strict suits. Well, how else to sell a solution with a 6-7-digit price tag in dollars? It’s somehow forgotten that either latency or throughput can be achieved from the storage system, but not both at once, that some license for the load balancer costs like another shelf, that if the intensive recording lasts more than an hour, then the controllers memory it’s not enough and the productivity will go down to “as if there is no cache”, that training the customer’s employees costs another 100,000 rubles for the first course, and similar tricks ...

5 ms

Either having heard enough, having read marketers, or from laziness, or because of some cockroaches, but for some reason, the administrators of the storage system often do something like this. We take a large shelf, combine it all into something flat, cut it into thin provisioned LUNs and distribute it across the LUN to the server. Or two, because "the system partition is well deduplicated." And when, I see that with the disk subsystem by the SQL ad-ad-ad, then the same song begins, that "5 ms is a great indicator", that "100000 IOPS", "Your load on the storage system is less than 5%"

For OLTP systems on a partition with WAL / 5 ms transaction logs, this is an invalid figure. On the "near-commodity" piece of iron for a price of 1000 (in words: a thousand) times cheaper than a normal indicator, it will now be 0.1-0.3 ms. And tomorrow - 0.01 ms. The speed, as in HDD 2008 at the price of the whole apartment entrance in Moscow is not needed. No “serviceability” is worth it.
Vendor writes that transaction logs are not demanding on IOPS and they can be put on the HDD? Yes it is. But for this it is necessary that none of these discs ~~infection~~ The task besides writing logs to the DBMS did not touch. And so that the storage system responded to the server that the data was recorded immediately as the data went into non-volatile memory (this is much earlier than they will be written)
Thin disks for real OLTP databases are evil.
For WAL, it is absolutely uninteresting how many IOPS can be squeezed out there at a queue depth of 10 or 20. There is no depth.
For WAL, there is absolutely no indication that the IO queue in the OS is "only about 1". She will not be anymore.
No, DBA and DB developers are not "Kryvoruki woodpeckers who can’t properly configure the WAL record to be parallel" (the real opinion of the admin)
The logic of those who like to consider recycling “since your system is crookedly configured in one section does not do 10,000 IOPS, then it must be transferred from the high-end array to mid-range” - this is the wrong logic.
If a 40-core server has a processor load of 2.5 percent, this does not mean that it has nothing to do, but, most likely, it means that there is some kind of task that blocks all others.

When some data download on a developer’s laptop is 5 minutes, and on a 40 nuclear server with 1 Tb of RAM and a storage system for half a million dollars the same task takes an hour, then even the most patient customers will have questions about the reasonableness of costs.

Average delay per WAL section	there will never be more transactions per second than:
5 ms	200
1 ms	1000
0.5 ms	2000
0.1 ms	10,000
0.05 ms	20,000

What to do

Tips for administrators and DBA

For OLTP, stop reading "recycling" and IOPS. Separately, I’ll note that you don’t look at IOPS with a large queue depth at all: even on data partitions, large queues usually have a short burst or something that does not affect the actual OLTP performance.

Sharing disk space on a LUN is not a DBA whim. The database has several different disk subsystem load profiles. At a minimum, the following can be distinguished:

Work with data files. Usually this is reading and writing random blocks of 8/64 KiB. Readings 80-95%. Queues arise: during maintenance periods, during bulk loading periods, on ineffective or mass queries, and at checkpoint. Performance is affected by readability. It is important that the 8/64 KiB alignment of the “through” blocks pass through the entire storage system.
Working with tempdb is the same as working with data files, but readings typically 40-75% and write responsiveness can be important. In modern MS SQL systems, this database can be loaded several times stronger than the database with data. In a non-cluster DBMS configuration, this section should be excluded from any storage systems replications. Its contents after restarting the service will still be destroyed.
Work with archived data / DWH. Readings close to 100%. The size of one reading block is usually 64 KiB. Requests read a lot and in a row, so the queue can jump up to 1000 and more.
Work with transaction logs. Read only for maintenance (backup, replication, etc.), application performance is affected only by the record. Record in blocks 0,5-64 KiB. No queue, in one thread. Delay is critical for applications.
Backup and restore. From a database point of view, this is reading in large blocks (often 1 MiB). It is important that this load can rest on the channels / buses (both FC and Ethernet) and the performance of the storage processors in some cases. Backing up a single server can affect the performance of other servers on the same SAN / storage.
Work with application files: these are logs, default trace, binary files, etc. This load is rarely significant and is important only at system startup.

There are other types of load, but they are slightly exotic (for example, there may be a repository of files stored in the database in the form of a FileStream directory). All these types of loads have different, often contradictory disk requirements. If they are all dumped on one partition, then you not only degrade performance, but it is very important that you lose the ability to understand why the system is slowing down, and also you are unable to improve only the part that needs improvement without global improvements / upgrades to the storage system. Therefore, the main recommendation:

Read the recommendations of the manufacturer of storage systems and DBMS, try to understand "why they advise so much" and design the system taking into account different types of load. .

, . Dell/EMC SQL Server .
. "" (, NUC c SSD, , ). --, .
DBA, - ( 200 ).
(etrolaster ), , , . +0,5 , 0,2, 0,7 3 .
, . tempdb , , , RCSI 12 .
Latency throughput. , " ", . throughput latency, . .

MS SQL Server

MS SQL, bottleneck , - :

. It is right. . 1000 5-30 1000 INSERT . , , , , " — ".
tempdb " ". . , , .
, BULK INSERT . , "Simple" "Bulk logged". , , Simple/Bulk logged Full . — The Data Loading Performance Guide , . ( ETL, OLTP) We Loaded 1TB in 30 Minutes with SSIS, and So Can You
SQL Server Delayed Transaction Durability — , .
SQL Server In-Memory OLTP . , .
, , AlwaysOn .

***

That's all. . 20000 IOPS 5 latency 4-16 OLTP. OLTP , .

PS: SSD.

. Intel Optane. SSD "" 4, . SSD, , , . SSD . , "" , . Intel Optane: ( , ) 1 20 . , . SSD 100-300 . SSD.
, . OLTP "", in-memory ACID. latency 20 "" . low-latency Optane ( ? ).
( ) Optane.

Thanks

eugeneb0 apatyukov .

Source: https://habr.com/ru/post/414269/