Battle of MERGE. Chronicle with conclusions and morality

A few weeks before the important komitfest - the last one before the feature freeze version of PostgreSQL 11 - hackers readers, squeezing the left bag of chips, followed the MERGE thriller. Thriller director, 2ndQuadrant CEO Simon Riggs , with impressive persistence and ingenuity, tried to drag a patch that implements the syntax of the MERGE command into the version. Riggs Comiters since 2009, and with the status of a commiter, you can approve patches yourself. He was opposed by no less respected commuters and veterans of PostgreSQL. Passions were seething, obviously and implicitly, yet it did not come to direct insults - an amazing fact for the regulars of many domestic forums. However, some tension remained until now, when the question was settled, and there is nothing to argue about.

But passions (about them will be further), but I would like to dispassionately understand the essence of this completely non-contrived problem.

Merge outside

If it is quite simple, the thing is this: we have 2 tables with the same fields and different data. Let the name and age. We need to combine them into one. But it would be necessary to decide what to do with the personalities that are in both tables. Most likely, we will want everyone to be in the final table, and update the information to match individuals. It is clear that even in this formulation, this is a very common task. It can be solved without MERGE , making a complex query, you can use triggers and so on. But it is inconvenient. However, this task is solved by the non-canonical version of MERGE, which is called UPSERT (UPdate + inSERT).

The operator MERGE is in the standard SQL-2003 and already in all its glory in SQL-2008. It is implemented in Oracle, DB2 and MS SQL, which means the lack of MERGE will upset those who are thinking of switching from these DBMS to PostgreSQL. Simon Riggs' longing as soon as possible, already in PostgreSQL 11, was nourished by the wishes of the 2ndQuadrant clients, and not by ambitions or peevishness.

In fact, MERGE has rich capabilities, data does not necessarily have to be taken from tables, especially from those similar in structure.

The command syntax is:

  MERGE INTO tablename USING table_reference ON (condition) WHEN MATCHED THEN UPDATE SET column1 = value1 [, column2 = value2 ...] WHEN NOT MATCHED THEN INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...]);

You can, however, like this:

 MERGE [hint] INTO [schema .] {table | view} [table_alias] USING { subquery | [schema .] { table | view}} [table_alias] ON ( condition ) [ merge_update_clause ] [ merge_insert_clause ] [ error_logging_clause ] ;

This syntax is implemented in Oracle. If in words, MERGE performs actions that modify the entries in the target target_table_name table using the data_source in a single SQL command that can do INSERT, UPDATE or DELETE for the entries in target_table_name, according to the conditions. In this case, target_table_name can be a view, and the data_source can be a set of tables or views, the result of a subquery .

First, the MERGE operator performs a left outer join on the data_source with the target_table_name , suggesting 0 or more change candidates; in the specified order WHEN clauses are calculated; as soon as the condition is satisfied, the corresponding action is taken. The keywords WHEN [NOT] MATCH THEN are not often used in SQL , so we remind you that this is an if-else control construct in other languages. MERGE acts in the same way as UPDATE, INSERT or DELETE with respect to target_table_name , only the syntax of the whole command differs.

A clause with ON must make a connection across all primary key columns or, if other columns are specified, then a unique index must be used so that [NOT] MATCHED immediately determine the actions for the candidate record to eliminate interaction with other transactions.

MERGE deterministic command: you cannot update the same record several times in the same MERGE command.
Example:

 MERGE CustomerAccount CA USING RecentTransactions T ON T.CustomerId = CA.CustomerId WHEN MATCHED THEN UPDATE SET Balance = Balance + TransactionValue WHEN NOT MATCHED THEN INSERT (CustomerId, Balance) VALUES (T.CustomerId, T.TransactionValue);

or with a subquery:

 MERGE INTO bonuses D USING (SELECT employee_id, salary, department_id FROM employees WHERE department_id = 80) S ON (D.employee_id = S.employee_id) WHEN MATCHED THEN UPDATE SET D.bonus = D.bonus + S.salary*.01 DELETE WHERE (S.salary > 8000) WHEN NOT MATCHED THEN INSERT (D.employee_id, D.bonus) VALUES (S.employee_id, S.salary*.01) WHERE (S.salary <= 8000);

In IBM DB2, the syntax will work too. As they say , "under the hood" it will be performed similarly to the UPDATE FROM construct.
In MS SQL since 2008 there is also MERGE .

But even behind a single, standard syntax, the problem of choosing from a considerable number of mechanisms and methods of implementation begins. The team should work on different levels of transaction isolation, with different blocking algorithms, focusing on a highly competitive or not very mode of operation. And, as you can guess, to implement this complex logic, you need to touch on a lot of the DBMS components.

UPSERT, pseudo-MERGE

It is clear that the developers of the DBMS were looking for compromise solutions, refusing to literally reproduce the syntax of the standard. The advantage of this approach is freedom. You can use mechanisms that are organic for a specific DBMS; you can optimize the implementation for tasks that you consider most relevant for your users.

For example, in MySQL there is a REPLACE command that works as an INSERT , but if the new and old lines have the same values in the PRIMARY KEY or UNIQUE index, then the old line is killed before the new one is inserted. But there is also INSERT ... ON DUPLICATE KEY UPDATE , where INSERT and UPDATE occurs (instead of DELETE in REPLACE ). This is a UPSERT . INSERT IGNORE , there is an INSERT IGNORE that simply does not perform an insert without giving an error (but warning) under certain restrictions on the target table.

Chronicles PG MERGE

In the PostgreSQL community, talk about MERGE started in 2005, when Jaime Casanova asked: Did anyone in the community MERGE in developing MERGE ? Peter Eizentraut (Peter Eisentraut) suggested discussing whether to develop some of the MERGE variants for PostgreSQL: similar to the implementation in MySQL, or it is better to send forces to the Oracle- MERGE version of the MERGE type. However, is it worth making efforts in this direction?

In the middle of a short discussion, the main character of this narrative, Simon Riggs, appears with the words:
MERGE is useful for both OLTP systems and DW (Data Warehouse - data warehouses, that is, analytical applications, where complex queries, but not too competitive environment and data are rarely updated, and if they are updated, then usually in large chunks. <...> We can implement MERGE as a COPY FROM option, it will be very cool.

Everyone agrees: yes, cool. More precisely, almost everything: Stephen Frost (Stephen Frost): I think I am not the only one who says that we need a full-fledged one that meets the MERGE standard.

Bruce Momdzhian (Bruce Momjian) has another, more pragmatic sentence: it seems to me that we need to implement some options in MERGE that we can implement, and in others we will give an error (and in those cases when we need to block the entire table). And after we get feedback from users and we will think what to do next.

But so far nothing is happening.

The ice has broken

In 2008, Simon Riggs again calls to deal with MERGE - which of the ways to choose (by that time a new version of MERGE appears in the SQL-2008 standard, while draft). He describes in detail the current implementations of Oracle, IBM, and MS SQL and the alternative syntax from MySQL and Teradata. And a bit later, he already mentions the beginning of work in the 2ndQuadrant in this direction.

Peter Yazentrut writes in his blog : Of course, Riggs is one of the most qualified specialists, he can lead the work on the implementation of MERGE.

But here comes the first unexpected turn: a student who is a participant in the development of the GSoC program, that is, Google Summer of Code, is connected to the problem. His name is Boxuan Bxzhai - I do not dare to transcribe the surname. Soon he writes that the work is almost done.

But almost does not count. Greg Smith of the 2ndQuadrant (i.e., Simon Riggs’s associate) writes:
So, we have a patch, in a code of which half a dozen serious unsolved problems. On the small, I am silent. The problems are too deep to refine the code for the commitfest. Meanwhile, from Boksuana long been heard nothing. We could help him, but where is he? Who knows?

Discussion of ways to implement again flashes in 2014 , but again nothing happens: there is no code.

Finally, already in 2017, Simon Riggs writes:
I am working on code to commit MERGE to PostgreSQL 11 . We use the same mechanisms that form the basis of an already running INSERT ON CONFLICT , so no infrastructure changes will be needed, basically just implementing the syntax on top of it. But I write my code from scratch, I don’t use previous work.

We are talking about Peter Geoghegan ( VMware ) implemented by that time in 9.5 alternative INSERT .. ON CONFLICT UPDATE syntax INSERT .. ON CONFLICT UPDATE , different from the SQL standard, but still related to MERGE and REPLACE in MySQL.

At first, Simon’s work was greeted by a nice work! However, Robert Haas , although he supports, he warns about possible serialization anomalies. Like, dealing with INSERT .. ON CONFLICT UPDATE , without MERGE at its base, is somehow calmer.

UPSERT author in PostgreSQL, Peter Gagan:
I would not mix the ON CONFLICT DO UPDATE and MERGE . <...> For loading large portions of data ( bulk load ), I would, for example, use the merge join algorithm. <...> In general, the advantages of MERGE would be related to the fact that ordinary connections would work there in the usual way: nested loop, hash, merge . And in INSERT … ON CONFLICT there are no joins at all.

Haas: Like Peter, I think that if you do it this way, such a strong blockage when executing a DML query looks so-so. It is unlikely that someone will be pleased that only one person can work with MERGE at one time.

For the curious: Geygan examines the subtleties and coarseness of the differences between MERGE and MERGE here and here (we keep archived PostgreSQL correspondence on our website).

Simon resists. He appeals to the Newest History. Like, about sectioning, they also said "new syntax, nothing more." And it turned out to be a very useful thing. I do not propose to immediately implement everything that is in MERGE. We will do the same as with partitioning - we will break the development into phases.

And one more argument, in my opinion very convincing: Good. But let's choose. I offer a practical option. Soon he will knock 10 years from the first serious attempt to develop MERGE . Isn’t it time to start doing something, get some useful solution instead of waiting another 10 years of the Perfect Solution? Assuming that it exists at all.

Finally, the patch arrives in the community. What date? Guess please. No, they did not guess: Simon sends it on December 30, 2017. And it stipulates that this is a WIP patch, that is, Work in Progress is a patch in work.

Simon, January:
The patch is completed without any special bugs. 1200 lines of code plus tests and documentation. I'm going to commit it to this komitfest, and RLS (Row Level Security - protection at the recording level) and support sectioning will finish later.

Caste of Comiters

Here we will have to take another step aside and explain the role of the commiter in the community. The functions of the commiter, that is, the one who has the power to accept the patch in the next version, have historically changed. Once, at a time when there were still few developers, the right to distribute was distributed generously. For example, the famous (in a completely different field) Julian Assange received the title of cometer, being the author of only six patches. It’s not easy now to be a commiter, there is no upstart in the list of a couple dozen people. Boys Momdzhan ( EnterpriseDB ) has 13363 komitov, Tom Lane (Tom Lane, Crunchy Data ) 13127, and Robert Haas ( EnterpriseDB ) - 2074. By the way, the only comiter from Russia is Fyodor Sigaev (Teodor Sigaev, Postgres Professional ) with his 383 komitami . Simon Riggs himself has 449. I repeat: he, as a commiter, has enough authority to take and commit patches — his own and his employees. Another thing is that it is hardly worth doing this, frankly disregarding the opinion of other leading luminaries-cometers. They may also deprive the status of a commiter, but at least they will roll back ( revert ) the patch.

Fracture in battle

Of course, in the “bezbezhny” patch, made, in general, hastily, find all the new errors. New versions are coming in response.

At the end of January, a new actor appears: developer 2ndQuadrant Pavan (his name is everything, by name; completely Pavan Deolasee). Now the community is dealing with the tandem: Pavan sends new versions and thanks for the criticism, and Simon breaks through them with a remarkable marketing rush.

Haas: I don’t think it’s worth making unilateral decisions to exclude features that work everywhere. If we agree that some features will not be included in this patch - this is one thing. And quite another thing is that in the comments on this occasion everyone expressed disagreement. And we have not really heard the reasons why these features should be excluded.

The logic was presented as follows:

a priori, there are serious problems because they cannot but be in developments in the "cavalry attack" style.
support even important features like new sectioning in versions 10-11, CTE (Common Table Expressions = WITH-requests) or RLS (Row Level Security) can be done after the patch is adopted in the current version, but only if the proposed architecture is suitable for building on top of its necessary functionality.

The second Peter Gagan formulates as follows:
I usually pay attention to the support of various functionality, because if it is, then it strengthens the general confidence that the design is done as it should . And if such problems are caused by the support of expressions WITH [that is, CTE ], then I have a thought that the underlying architecture is such that it causes problems here and there.

Meanwhile, the hour X (the last komitfest) is approaching, and the clouds are gathering over the MERGE. It is not that the founding fathers specifically sought out serious problems in the architecture of Simon and Patan abruptly doing the patches. Problems did not have to look, they willingly opened themselves.

The end is near

The plot is accelerating. Despite the cool attitude of other committees to his venture, on April 2, Simon decides to commit the Command Command following SQL: 2016 patch, adds files, Depesz (Hubert Lyubachevsky) has time to announce it on his blog, but on the same day Simon rolls back everything because of mistakes.

The next day, commit again, adding support for WITH .

In response, the allegations are truly grave. Andres Freund (Anders Freund, EnterpriseDB ) writes:
The architecture for MERGE in parser and executor did not make me feel reliable. Creating hidden joins during parsing is a completely bad idea. This executor structure must be completely changed.

Tom Lane:
The design of the parsing tree is weak.

You overload the InsertStmt function, - he continues - it does not use INSERT at all, but at the same time it randomly has the same fields as the original one. And not all, but some. This is bad, it leads to confusion.

Let's dab observation Fyodor Sigaeva :
INSERT nodes connected to MERGE appeared in the parser, hung with a bunch of additional fields. If you look at the execution plan in ANALIZE , you will not immediately understand whether you are dealing with a regular INSERT , or with MERGE : in order to figure it out, you need to look at additional fields.

Simon, calmly: OK, we will change this and tomorrow we will send a new file .
Haas: I agree with Peter. Architecture selection failed.

Simon does not give up. On April 6 , responding to the criticism of Tom Lane, the commit is a new patch as amended in the parser.

Negotiations and surrender

Bruce Momdjan April 6 :
I want to note that people did not ask you to work hard to fix something urgently. They asked you to withdraw the patch. You can, of course, work by the sweat of your brow, in the hope that they will change their minds, but - once again - they did not ask you that.

Simon: If Tom [Lane] and Andres [Freund] for these remaining few days still feel that their fears are not dispelled, I will be happy to roll back the patch without further ado.

Tom Lane: I still vote for the patch to be rolled out. Even if he was perfect now, now there is no time for people to be convinced of this - up to his throat for other urgent matters.

Everything.

Simon said OK, and the battle for MERGE over. All patches have been rolled back, the topic has been moved to the next komitfest with the status of “Waiting for the author's finishing touches”. The show participants were reconciled.

However, judging by the correspondence of the last weeks, some tension seems to remain.

Promised morality

Fortunately, the PostgreSQL community has natural and formal mechanisms for (almost) conflict-free sifting attempts at immature solutions. Even if they are punched by respected developers at the level of the head of the company, whose contribution to the development of PostgreSQL is huge. And invest pushing customers who lack the functionality.
Unfortunately, the community often slips. It is inertia in making even unambiguously relevant developments. Sometimes irrational perfectionism is included. The experience of Postgres Professional, where I work, confirms this. We punched a large and important patch of INCLUDE-indexes for 3 years. The most useful series of patches for working with JSON / JSONB is still waiting. The expression “to give our development to the community” means not to actually give, but to break through : the guest is met with open arms and escorted to quarantine.

PS: Disclaimer from the author : we just wanted to show a piece of the life of the community. All matches are random :)
PPS: Samurai Natalia Levshinoy .

Source: https://habr.com/ru/post/412605/

All Articles