Library sorting

Take a reverse-ordered array and apply sorting to it with simple inserts .

Look at the creaking of the next element in the right place. For it, you need to free up the insertion space, because of which you have to move all the previously inserted elements.

And how it would be nice if there were free spaces between the elements inserted earlier! Then you would not have to drag a string of elements just for the sake of inserting one.

In 2004, three computer science experts — Michael Bender, Martin Farah-Colton, and Miguel Mostiro — decided this way to modify the sorting with simple inserts. They proposed to form an ordered part of the array, leaving gaps between the inserted elements.

The librarian needs the books to be arranged alphabetically on a long shelf: starting from the left of the letter “A”, the books stand flush with each other all the way to the “I”. If the library received a new book related to section "B", then to put it on the shelf in the right place, you have to move each book, starting somewhere in the middle of section "B" until the last "I". This is a sort of simple inserts. However, if the librarian reserved space in each section, it is enough for him to move only a few books to make room for new books. This is the basic principle of library sorting.

Algorithm

1. Create an empty auxiliary array, several times larger than the main one.
2. For the next item, look for the insertion point in the auxiliary array.
- 2.1. If you find a place to insert, then move the item and return to step 2.
- 2.2. If there is no space for insertion, re-balance the auxiliary array and return to step 2.
3. After processing all the elements we transfer them back to the main array.

At first glance, it seems that sorting is easy and simple. To dispel this illusion, consider the key points of the algorithm in more detail.

Size of auxiliary array

While there is no established opinion, how many times the auxiliary array should be larger than the main one.

If you take too much, then there will be a lot of space between the elements, but the search for the insertion point and rebalancing will be slower, due to the large distances between the elements. Rebalancing will happen less frequently, but will have to spend more resources on them. If you take too little, searching and rebalancing will be cheaper, but you will have to reformat the array more often. In general, there is still need to test with different values and method of scientific spear to determine the best option.

If we have determined how many times the auxiliary array is larger than the main array, then the formula for determining the exact number of elements for it looks like this:

NewSize = ε × (Size + 1) - 1

NewSize - the number of elements in the auxiliary array
ε - how many times the auxiliary array is larger than the main one
Size - the number of elements in the main array

If we simply multiply the size by the coefficient: NewSize = Size × ε , then for uniform distribution we will not have enough cells in the amount of ε - 1 pieces. That is, it is possible to arrange them evenly, but either the first filled cell or the last one will be right next to the edge of the auxiliary array. And we need the empty spaces of the filled cells to be reserved on all sides - including before the first element and after the last one.

It seems to be a trifle, but in fact it is important to rebalance, in order to guarantee free places for insertion in any place, including when processing the last elements of the main array.

Search insert location in auxiliary array

Of course, a binary search is needed here. However, the classic implementation does not suit us.

First, the auxiliary array basically consists of emptiness. Therefore, recursively dichotomizing the structure, we will mostly stumble upon unfilled cells. In these cases, you need to go a little left or right to the nearest non-empty cell. Then at the ends of the segment there will be significant elements that allow to calculate the arithmetic average and continue the binary depth search.

Secondly, do not forget about the edges. If you need to insert a minimum or maximum element, then a binary search among those previously inserted will not lead to anything. Therefore, it is worthwhile to provide for boundary cases - first check whether it is necessary to put an element near the left or right border of the array and, if not, then use a binary search.

Third, taking into account the specifics of the application, it is worth making additional amendments to minimize the number of rebalancing of the array. If the inserted element is equal to the value on one of the ends of the segment, then, perhaps, you should not insert it in the middle of the segment. It is more logical to put next to an element equal in value. This will more effectively fill the empty space of the auxiliary array.

You can come up with the fourth, fifth, and so on. The quality of the search for the insertion point directly affects the sorting speed, since the choice of unsuccessful insertion points leads to unnecessary rebalancing. For example, it may be worth dividing the segments not exactly in the middle, but closer to the left or right edge of the segment, depending on which edge is closer in value to the inserted element?

The binary search algorithm itself harbors pitfalls, and taking into account the above-mentioned additional nuances it finally becomes a completely non-trivial task.

Array rebalancing

Binary search is not the hardest thing to implement in this sorting. There is still rebalancing.

When there is no space for insertion (elements close by value are found, but there are no free cells between them), you need to shake the auxiliary array so that the insertion space is free. This shaking of the array is rebalancing.

Moreover, rebalancing is local or complete .

Local rebalancing

We shift as many elements as necessary to release the insertion point. The implementation of such a balancing is very simple, you just need to find out the empty cell nearest to the insertion point and use it to move several elements.

There are possible nuances. For example, in which side to look for the nearest vacant place? To avoid a situation where a shift is not possible (that is, if in one of the sides all cells are occupied to the very edge of the array), you can focus on the position of the insertion point relative to the middle of the array. If you need to insert in the left part of the array, then move to the right, if in the right - left. If ε ≥ 2, then this approach eliminates the situation in which a shift is impossible (because half of the auxiliary array has more than enough space for all elements).

In the author's interpretation of the algorithm, it is the local rebalancing that is meant.

Full rebalancing

An interesting local alternative is complete rebalancing. That is, in the auxiliary array, shift all the existing elements so that there are (almost) identical gaps between them.

I tried both options and for the time being I observe that with local rebalancing the algorithm works 1.5-2 times faster than with full. However, from the complete rebalancing can still be a good judge. For example, if local rebalancing has to be done too often, this means that there are many “blood clots” in the array in certain areas, slowing down the whole process. A complete rebalancing, carried out one-time, allows you to get rid of all local congestion at one stroke.

Let us analyze exactly how to produce a complete rebalancing.

First you need to calculate how many cells we can allocate for each element in the auxiliary array. It should be remembered that empty cells must be both before the first and after the last filled cell. The formula is as follows:

M - the number of cells that can be allocated to each element
NewSize - auxiliary array size
Count - the current number of non-empty elements in the auxiliary array

This fraction should be reduced to an integer value (i.e., rounded down). From the formula it is obvious that the more elements already transferred to the auxiliary array, the less cells we can allocate for the neighborhood of each element.

Knowing M , we easily get the exact position for each non-empty element in the auxiliary array, on which it must be after the completion of rebalancing.

NewPos = Number × M

NewPos - new item position after rebalancing
Number - what is the count of a non-empty element in the auxiliary array (1 ≤ Number ≤ Count)
M is the number of cells allocated to each element.

New positions are known, can one just go through non-empty elements in the auxiliary array and transfer them to other places? Oh, no, don't hurry. It is necessary not just to transfer the elements, it is important to preserve their order. As a result of binary search and insertion, elements can appear both strongly left and strongly to the right of the position in which they should be after rebalancing. And on the place where you need to move, there may be another element that also needs to be attached somewhere. In addition, it is impossible to transfer an element if there are other elements between its old and new positions in the auxiliary array - otherwise the elements will mix, and it is extremely important for us not to confuse the order.

Therefore, to rebalance it will not be enough to go through the usual cycle and simply shift each element from one pocket to another. We'll have to use recursion. If an element cannot be transferred to a new place (between its old and new positions there were other elements), then you first need to deal (recursively) with these uninvited guests. And then everything will be placed correctly.

Degenerate case

For most sorting inserts, a reverse-ordered array is the worst situation. And the sorting of the librarian, alas, is no exception.

The elements tend to the left edge of the auxiliary array, with the result that empty spaces quickly end. It is necessary to rebalance an array very often.

By the way, if we take an almost ordered array (the best case for sorting with simple inserts), we get the same problem. The newly arriving elements will hammer not the left, but the right side of the auxiliary array, which will also lead to too frequent rebalances.

Library sort effectively handles random data sets. Partial ordering (both direct and inverse) degrades speed performance.

Algorithmic complexity

On large sets of random data, the algorithm gives the time complexity O ( n log n ). Not bad at all!

On sets of random unique (or mostly unique) data, with proper selection of the coefficient ε and successful implementation of the binary search, the number of rebalances can be minimized, or even avoided. It can be argued that the algorithm has the best time complexity O ( n ).

A large percentage of recurring data and the presence of ordered (in direct or reverse order) subsequences in the array leads to frequent rebalancing of the auxiliary array and, as a result, to the degradation of the time complexity to O (n ² ) in the most unfavorable cases.

The minus of the algorithm, of course, is that O ( n ) additional memory is required for the auxiliary array.

Possible ways to improve

Although the algorithm itself is instructive and effective on random data, for a decade and a half, very few people showed interest in it.

If you google at the request of the “library sort”, you will find a cursory article in the English Wikipedia, the author's PDF (from which little is clear) and the rare reposting of this scant information. Plus there is a good visualization in Youtube, where the main and auxiliary arrays are originally combined. All links are at the end of the article.

Searching for the query “library sorting” is even more fun — in the sample you will find out by different sortings from different libraries, but these algorithms will have no relation to authentic library sorting .

And there is something to improve:

Empirical selection of the optimal coefficient ε .
Modification (taking into account the specifics of the general algorithm) of the binary search for the most efficient determination of the insertion point.
Minimize rebalancing costs.

If you polish these places, then maybe the library sort will even be equal to quick sort in speed?

Source

I did not have time to prepare the implementation in Python, there is a working version in PHP.

Main algorithm

function LibrarySort($arr) { global $arr_new;//  $e = 3;//     $rebalance_local = true;// (true)   (false)  //   $newSize = $e * (count($arr) + 1) - 1; $arr_new = array_fill(0, $newSize, null); //       $arr_new[LibrarySort_Pos(1, 1, $newSize)] = $arr[0]; //    -    //     $start = 0; $finish = $newSize - 1; $i = 1; //      while($i < count($arr)) { //        $pos = LibrarySort_BinarySearch($arr[$i], $start, $finish, $newSize); if($pos === false) {//        //    LibrarySort_Rebalance_Total($i, $newSize); } else {//  ,   ,    if($arr_new[$pos] !== null) {//   if($rebalance_local) {//  (+ ) LibrarySort_Rebalance_Local($arr[$i++], $pos, $newSize); } else {//  LibrarySort_Rebalance_Total($i, $newSize); } } else {//   ,   $arr_new[$pos] = $arr[$i++]; } } } //      $pos = 0; for($i = 0; $i <= $newSize - 1; $i++) { if($arr_new[$i] !== null) $arr[$pos++] = $arr_new[$i]; } return $arr; }

New element position in additional array after complete rebalancing

 // $number-    $count //     //$number -      ( )  //$count -       //$newSize -     //$number <= $count <= count($arr) <= $newSize) function LibrarySort_Pos($number, $count, $newSize) { return $number * floor(($newSize + 1) / ($count + 1)) - 1; }

Binary search for insertion point in auxiliary array

 //       //$search -     ,      //($start, $finish) -   ,     //$newSize -     function LibrarySort_BinarySearch($search, $start, $finish, $newSize) { global $arr_new;//  //      //      ,     //  ,       while($arr_new[$start] === null && $start < $newSize - 1) { ++$start; } //         , //         if($search == $arr_new[$start]) { return LibrarySort_PosNearby($start, $newSize); //  ,        } elseif($search < $arr_new[$start]) { //      //     if($start > 0) {// $start    $finish = $start; $start = 0; return floor(($start + $finish) / 2); } else {//$start == 0,      return $start;//    ,    } } //      ,     //  ,       while($arr_new[$finish] === null && $finish > 0) { --$finish; } //         , //         if($search == $arr_new[$finish]) { return LibrarySort_PosNearby($finish, $newSize); //  ,        } elseif($search > $arr_new[$finish]) { //      //     if($finish < $newSize - 1) {// $finish    $start = $finish; $finish = $newSize - 1; return ceil(($start + $finish) / 2); } else {//$finish == $newSize - 1,      return $finish;//    ,    } } //     , //,    -   //   ,       If($finish - $start > 1) {//   ,    3  $middle = ceil(($start + $finish) / 2); //   $middle_Pos = 0; // ""     $offset = 0; //         //,    /,      while($middle - $offset > $start && $middle_Pos == 0){ if($arr_new[$middle - $offset] !== null) { $middle_Pos = $middle - $offset; } elseif($middle + $offset < $finish && $arr_new[$middle + $offset] !== null) { $middle_Pos = $middle + $offset; } ++$offset; } //    , ,     , //              If($middle_Pos) { if($arr_new[$middle_Pos] == $search) { return LibrarySort_PosNearby($middle_Pos, $newSize); } else { if($arr_new[$middle_Pos] > $search) { $finish = $middle_Pos; } else {//$arr_new[$middle_Pos] < $search $start = $middle_Pos; } return LibrarySort_BinarySearch($search, $start, $finish, $newSize); } } else {//$middle_Pos == 0 -   (   )     return $middle;//   - ,     } } else {//  ,       return floor(($start + $finish) / 2); } return false;//  ,       }

If the search element is equal to one of the ends of the segment

 //    ,        //$start - ,        //$newSize -     function LibrarySort_PosNearby($start, $newSize) { global $arr_new;//  //       for($left = $start - 1; $left >= 0; $left--) { if($arr_new[$left] === null) {//  return $left;//   } elseif($arr_new[$left] <> $arr_new[$start]) {//     break; //   ,      } } //     ,    for($right = $start + 1; $right <= $newSize - 1; $right++) { if($arr_new[$right] === null) {//  return $right; //   } elseif($arr_new[$right] <> $arr_new[$start]) {//     break; //   ,      } } return $start; //          .      ,     }

Local rebalancing of additional array

 //    //$insert - ,    //$pos -            //$newSize -     function LibrarySort_Rebalance_Local($insert, $pos, $newSize) { global $arr_new;//  // $pos  $insert,       while($pos - 1 >= 0 && $arr_new[$pos - 1] !== null && $arr_new[$pos - 1] > $insert) {--$pos;} while($pos + 1 <= $newSize - 1 && $arr_new[$pos + 1] !== null && $arr_new[$pos + 1] < $insert) {++$pos;} $middle = (integer) $newSize / 2;//  if($pos <= $middle) {//      if($arr_new[$pos] !== null && $arr_new[$pos] < $insert) ++$pos; //  $right = $pos; while($arr_new[++$right] !== null) {} for($i = $right; $i > $pos; $i--) { $arr_new[$i] = $arr_new[$i - 1]; } } else {//      if($arr_new[$pos] !== null && $insert < $arr_new[$pos]) --$pos; //  $left = $pos; while($arr_new[--$left] !== null) {} for($i = $left; $i < $pos; $i++) { $arr_new[$i] = $arr_new[$i + 1]; } } $arr_new[$pos] = $insert; }

Full rebalancing of an additional array

 //    //$count -        //$newSize -     function LibrarySort_Rebalance_Total($count, $newSize) { global $arr_new;//  global $library_Number;//     global $library_LeftPos;//        $library_Number = $count; //        $library_LeftPos = $newSize - 1;// ,     //         $i = $newSize - 1; while($i >= 0) { if($arr_new[$i] !== null) {//   $pos = LibrarySort_Pos($library_Number, $count, $newSize);//   newSize=$newSize"); if($i == $pos) {//      --$i;//      } elseif($i < $pos) {//    $arr_new[$pos] = $arr_new[$i]; $arr_new[$i] = null; --$i;//      } else {//$i > $pos -     //      LibrarySort_RemoveLeft($i, $pos, $count, $newSize); $i = ($i > $library_LeftPos) ? $library_LeftPos - 1 : --$i; } --$library_Number;//      } else {// ,   --$i;//      } } }

Moving an item to the left with full rebalancing

 //     . //    ,   //$i -     ,    //$pos -       //$count -        //$newSize -     function LibrarySort_RemoveLeft($i, $pos, $count, $newSize) { global $arr_new;//  global $library_Number;//     global $library_LeftPos;//        $left = false; $left_Pos = false;//      $j = $i;//      //     while($j > 0 && $left === false) {//            --$j; //     if($arr_new[$j] !== null) $left = $j;//    } if($left === false || $left < $pos) {//   (  )    //     } else { //$left >= $pos,     --$library_Number;//,       $left_Pos = LibrarySort_Pos($library_Number, $count, $newSize);//     //        LibrarySort_RemoveLeft($left, $left_Pos, $count, $newSize); //  ,     } //    ,   $arr_new[$pos] = $arr_new[$i]; $arr_new[$i] = null; //,         if($pos < $library_LeftPos) $library_LeftPos = $pos; }

I had to code from scratch myself, based on the general description of the method. I did not see the speed close to quick sorting; my library sort variant sorts 10-20 times slower than quick sort. But the reason, of course, is that my implementation is too raw, much has not been taken into account.

I would like to see the version from the creators of the algorithm. I will write to the authors today (and throw off a link to this link), they will suddenly respond. Although ... I remember I tried to contact Allen Bichik ( ABC-sorting ) and Jason Morrison ( J-sorting ), but the result was the same as if I had written to Sportloto.

UPD. Martin Farah-Colton replied that they never did the implementation:

I’m never accepted that algorithms.

The main thing - the idea :)

Algorithm Characteristics

Title	Library Sort, Librarian Sort, Library sort
Other name	Gapped Insertion sort Sorting Inserts
The authors	Michael A. Bender, Martin Farach-Colton, Miguel Mosteiro
Year	2004
Class	Sorting inserts
Comparisons	there is
Time complexity	the best	O ( n )
	average	O ( n log n )
	the worst	O ( n ² )
Extra memory difficulty	O ( n )

Links

Library sort

Library Sort algorithm visualization

Insertion sort is O (n log n)

The authors of the algorithm:

Michael A. Bender
Martin Farah-Colton
Miguel Mosteiro

Articles series:

Excel application AlgoLab.xlsm
Exchange sorting
Sorting inserts
- Librarian sort
- Solitaire sorting
- Sort "Tower of Hanoi"
- Sort by Young table
- Inversion Sort
- Comparing sorting inserts
Sort by selection
Merge sort

Sorting added to AlgoLab. So you can experiment with small data sets.

You can decide for yourself how many times the auxiliary array is larger than the main one. To select the coefficient ε, you can right-click on the cell with the “Library sorting” and select the item “Change note”. And in a note, carefully put an integer value for e from 2 to 5. If you enter something else instead of these numbers, the default value = 2 will be used.

You can also choose the type of rebalancing. If you set local = 1, then local rebalancing will be used. If local = 0, complete.

And do not forget to set the optimal scale for the process sheet before starting the visualization, otherwise the auxiliary array will not fit on the screen.

Source: https://habr.com/ru/post/416653/

All Articles