Column DBMS vs lowercase, how about a compromise?


Column DBMS actively developed in zero years, at the moment they have found their niche and practically do not compete with traditional lowercase systems. Under the cat, the author understands whether a universal solution is possible and how expedient it is.
“Progress is visible in everything ... you don’t need to be afraid that they will call you in the office and say:" We conferred here, and tomorrow you will be quartered or burned at your own choice. "It would be a difficult choice. I think he would be stumped by many of us. ”
Jaroslav Hasek. The adventures of the brave soldier Schweik.

Prehistory


How many databases there are, so much this ideological confrontation lasts. The author, out of curiosity, found in the bins the book of J. Martin from IBM [1] of 1975 and immediately came across the words (p. 183): “Binary relations are used in [...], i.e. relationship of only two domains. It is known that binary relations give the most flexibility to the base. However, in commercial tasks, relations of various degrees are convenient. ”Relationships are understood here as relational relations. And the mentioned works are dated 1967 ... 1970.

Let Sybase IQ was the first industrially used column DBMS, but at least at the idea level, everything was spoken 25 years before it.

At the moment, the following DBMSs are column-based or to some extent support this possibility (taken mainly here ):

Commercial


Free & open source


Differences


A relational relationship is a collection of tuples, in essence a two-dimensional table. Accordingly, there are two storage options - row or column. Separation is a bit artificial, logical. Database developers have long been no longer involved in planning drums and tracks. Optimally decomposing the DBMS data by file system (moms) is the task of the DBMS administrators, and how the file system manages the data on the physical disks is known mostly to the developers of the file systems.

It would be logical to provide the DBMS itself to decide in which order to store the data. Here we are talking about some hypothetical DBMS that supports both storage options and has the ability to assign a table to any of them. We do not consider quite popular option to support two databases - one for work, the second for analytics / reports. As well as column indexes a la Microsoft SQL Server. Not because it is bad, but to test the hypothesis that there is some more elegant way.

Unfortunately, no most hypothetical DBMS can choose how best to store data. Since does not have an understanding of how we are going to use this data. And without this, the choice is impossible, although it is very important.

The most valuable quality of a DBMS is the ability to quickly process data (and the ACID requirements, of course). The speed of the DBMS is mainly determined by the number of disk operations. From here there are two extreme cases:


But these are extreme cases, in life everything is not so obvious.


In other words, under no circumstances will our hypothetical database management system assume the responsibility of choosing between the row and column-based storage options, the DB designer must do this.

However, given the above, and the database designer will be in a situation of very difficult choice. He would be stumped by many of us.

What if


In essence, both the column and inline variants are extreme cases of the same idea — cut the table into “ribbons” and store data line by line within each tape. Just in one case, the tape is one; in the other, the tapes degenerate into one column.

So why not allow intermediate options - if the data of some columns come / read together, let them be on the same tape. And if there was no data in the tape (NULLs), then nothing needs to be stored. At the same time, the problem of the maximum row size is removed - you can split the table when there is a risk that the row does not fit on one page.

This idea is not so special that the author has ever seen it and applied it himself. The element of novelty is to enable the database designer to determine for himself how exactly his table will be divided into parts and in what form the data will fall on the disk.

We did it for ourselves as follows:


How can it look like in practice?

For example:

CREATE TABLE twomass_psc ( ra double precision, decl double precision, … -- #pragma page_break j_m real, j_cmsig real, j_msigcom real, j_snr real, h_m real, h_cmsig real, h_msigcom real, h_snr real, k_m real, k_cmsig real, k_msigcom real, k_snr real, -- #pragma page_break … coadd_key integer, coadd smallint ); 

For example, the main table of the atlas 2MASS is taken, the legend is here and here .
J , H , K are infrared sub-ranges, it makes sense to store data on them together, because in research they are processed together. Here, for example :


First available picture.

Or, even more beautiful:

It's time to confirm that this has some practical meaning.

results


Below is:



Let's take a closer look at how it works:




Sources


[1] J.Martin. Database organization in computing systems. “World”, 1978
[2] Column indices, features of use
[3] Daniel J. Abadi, Samuel Madden, Nabil Hachem. ColumnStores vs. RowStores: How Different Are They Really? , Proceedings of the ACM SIGMOD International Conference on Data Management, Vancouver, BC, Canada, June 2008
[4] Michael Stonebraker, Uğur Çetintemel. “One Size Fits All”: An Idea Whose Time Has Come and Gone , 2005

Source: https://habr.com/ru/post/413051/


All Articles