As we added the entrances to the map and reduced the size of the bases by 10%



At the end of last month, 2GIS began to display porches. We show entrances to organizations as much as from 2013, and entrances seem to be the same entrances. So why just now? All internal products and processes are ready, all you need to do is finish a little more and fix the display in the UI.

In addition to the standard answer “There were other priorities”, there is also a not quite standard one: “Not everything is so simple.” This article is about what the difficulties were and how we solved them.

First, the entrance is not the entrance. So, in one entrance can lead several entrances, usually from different sides of the building. Incorrect and the definition of "(multi-level) block of apartments with a common entrance." It happens that one entrance leads to the first floor of the entrance, and the other to the second and subsequent floors.

Entrances with two entrances

Secondly, I want to look for entrances.

Entries in the search results

This is quite a demanded opportunity, since the entrances are far from always located in the obvious way.

House with not the most obvious numbering of entrances

In addition to numbers of entrances there are also names (as a rule, from one letter).

Letters in the names of entrances

Or even as in Kaliningrad - a separate address is assigned to the entrance.

Independent numbering of entrances in Kaliningrad

Thirdly, we decided that if we do a search for entrances, then why not immediately support and search by apartment number. We decided - and collected ranges of apartments, however, so far without a floor-by-floor reference. As with the entrances, apartments can have not only numeric names (most often there are variants with the letter “a”), and ranges are not always continuous. In the old houses of St. Petersburg, the numbering is rather strange: apartments 1 and 3 can be located in the same staircase, and apartment 2 - in the opposite part of the building.

Typical entrance of old Petersburg

Lyrical digression about data validation
Do not try to validate the data collected is very smart - in the northern capital there are cases when you can get into one apartment from several entrances, and in some European localities the address contains, in addition to the house, the entrance name and apartment number at this entrance. Peter is also pleased with the building with two eighth entrances.

Fourth, I want to show entrances on the map always, and not only when we study the information of a particular house or porch.

And, finally, there are many entrances - according to the current estimates, there are about one hundred thousand in Moscow alone. The first assessment "on the fingers" and at all gave some astronomical numbers - we were wrong every six in a big way.

Findings:


Getting to the decision


The first two points require changes in both domestic and external products, but this is a routine. The latter is a completely different matter. Since we do not like to upset users, we set ourselves a restriction: the data should be available offline, and adding them should not increase the size of the databases.

How? On the one hand, it is necessary to reduce the current size of the data, on the other - you can think of a format for storing information about the approaches so that the amount of additional data is minimal.

We reduce the amount of data


You can reduce it in two ways:


The evolution of data storage format until the era of porches

The very first and easiest option


The traditional way to save complex data is to normalize it, put it into a tabular database and create auxiliary indexes . Sometime 2GIS did just that, except to reduce the size, the contents of each table were sorted so that the cell values ​​as often as possible coincided with the values ​​from the previous row. We stored the columns independently, and sequences of identical values ​​collapsed.

A highly simplified example of optimizing the storage of a table with buildings:

Example of optimized table storage

Normalization allows reducing redundancy, but there is also a negative side — to form a UI element for an object, you have to make several queries that refer to a large number of tables. However, at that time, these tables were used not only to obtain data for display, but practically for all tasks, including text search and various types of search of the passage. For all these tasks, the normalized data just allowed to reduce the amount of processed information.

The data for drawing the map already had its own binary format and were stored in a separate block. Gradually, we made separate binary formats to speed up the search for directions and search by text. In the database there was only information that was used to display the object cards, as well as for the connection of some objects with others.

Cards and communications

Simplify work


How can you simplify working with data? Receive at once all the data necessary for drawing the card, by object identifier. Since the online version already receives all data from the API in one JSON request, it is also possible to combine the formats used by all our products.

And json'y to display cards, and communications need to be obtained for a limited number of objects, on the strength of several dozen at a time. But there are scenarios where individual attributes should be obtained at once for large samples, up to a hundred thousand. It is more logical to distinguish such attributes in a separate entity with a separate storage format - properties. Properties can be type and stored more efficiently in binary format.

A naive approach to storing json's is using a key-value database. Take, for example, Moscow, as the largest of our projects. Even in the simplest form - json itself is stored for each object, 8 bytes of the identifier and a separator character - the reference book will take 1.9 GiB. The resulting size is almost six times larger than the previous version described, and it is still without links and properties.

We deliberately inflated objects, stuffing in them information about everything that might be needed to display their cards, and yet it is still necessary to store the names of fields, quotes, commas and brackets.

Compress data


Normalization is not the only way to eliminate redundancy. The second popular method is compression. The lzma-archive of our incredibly thick file takes only 55 MiB.

Of course, we cannot use this format directly, since to access an arbitrary object, you need to unpack all the data stored before, and lzma archives are not unpacked very quickly.

How do we get fast random access on the one hand, and reduce the size of the required space using data compression? The answer is to use pagination.

Binary Storage Format

Now that the data is broken down into pages, we can compress each one separately. To access any place we need to unpack and scan the page, but it is much faster than unpacking and scanning the entire archive.

In this format, all data is stored - json'y, relationships and properties. It remains to associate this data with object identifiers. For each identifier, we need to store one or more pairs <storage number, data number in storage> .

Simplified format for object identifier associations and data in repositories
All sequence numbers, offsets and sizes are stored in a compressed format similar to UTF-8 , where small values ​​take up only one byte. This allows us to reduce the size of the links by simply sorting the contents of the binary repositories by reducing the frequency of occurrence.

Sort and reduce size

Some properties have many frequency values, and therefore sorting gives a big gain in size. On the other hand, because of it, the sequence number of the data cannot be the same for all storages.

Also, not all objects have data in all storages, and therefore storing storage numbers is more efficient than referring to empty objects.

Accelerate data retrieval


This format has one problem. To find the number of the object that stores the indexes for the specified identifier, we need to run a binary search inside the data of the first object. To do this, you must either load 10.9 MiB into memory, or do 21 additional readings. Both solutions do not suit us, and therefore we reduce the number of readings by storing data in the form of a tree.

The format of the tree of quick data retrieval by identifier

We group data by 32 objects, and in intermediate levels we store 128 first identifiers of nested nodes. We added three levels of the tree and reduced the number of additional readings to four (and in fact, taking into account the caching of previously read tree nodes, even 1-3). Price - a little less than 400 KiB.

At this stage, we rather effectively store connections and properties, but json's take up a lot of space. It's clear. The larger the page size, the more objects it gets there and the better the compression algorithm is able to determine which information is redundant. But, on the other hand, the more work we need to do to read a single object. Json'y same - quite large objects, and therefore in one page there are very few of them. Therefore, you need to help the algorithm to better cope with their work.

We break json'y into parts


First, many objects have coinciding data schemes, only attribute values ​​differ. Secondly, many attribute values ​​are the same even for objects with different schemes. We select the schemes and values ​​of attributes in separate stores, and we will store json's in the form: link to the scheme + links to the attribute values.

Json’s breakdown into schema and data

The number of attribute names is limited in our data schema. So, we can put them in a separate file and store the number instead. We will also make a few more changes, taking into account that json's are always objects here.

The actual storage format for json object schemas

Yes, we essentially squeezed our data, reducing the scope for the algorithm. But on the other hand, we have slowed down access to the data quite a bit, but the algorithm still helps by compressing similar values ​​stored nearby.

We set the page size to 1 KiB and it turns out that while we optimized the format so that, on the one hand, the reading was not very slow, and on the other - the data was packed as well as possible, we ... had already bypassed the "optimized tables" both in size and for ease of use. But not for nothing that it was all! The maximum gain should be from the compression of the values ​​of attributes, properties and schemes. We set zlib on and check that it takes a little time to read data from the database against the background of the rest of the work. Satisfied with the result, switch to other tasks.

Get rid of unnecessary


We begin to reduce with the search for data that can be eliminated. It turns out that during the existence of the format, we have learned to do without the largest connection. This is 10% of the size!

The code for this data was still tied, but the necessary changes are rather trivial. But already released applications are not so easy to change. We strive to maintain as long as possible not only backward, but also direct compatibility. And if the first is familiar to everyone, then about the second, many may not happily think. We have to support it because of the rather long tail of users who, for various reasons, have turned off the automatic update and are not in a hurry to switch to the new version of the application.

Distribution of users by version
Distribution of users by version

At the very top - the distribution of users on the latest versions of Android-applications. Below is iOS.

It is easy to see that users of iOS devices update much more willingly, but even among them there are a lot of users of old versions.

Also, we are still releasing new data for the frozen version under Windows Phone. And let WP8 users make up only a small fraction of our audience, in absolute numbers it is almost 200,000 per month.

We have already had a mechanism for a long time, which allows us to release several data formats, and automatically determines which versions should be accessed. The opportunity is good, but you still need to learn how to unload these formats. The first big task was the implementation of the service, which will receive all data and filter the new for the old database format and the old for the new.

A nice bonus from the work done is a reduction in the size of monthly updates, since the remote connection has changed a lot from month to month, inflating the size of diffs.

If you look at the remaining data, then in total you can squeeze out the same 10%, but the price will be incomparably higher. While decided not to touch.

Optimize current storage format


As it was written above, we made pages of the size of 1 KiB and packed not all binary storages.

The first thing we are doing is trying to pack more and pages with links, and check that the difference in the speed of obtaining data is in the region of error.

The next item is choosing the optimal page size. The larger the page size, the more efficiently the data is compressed, but the slower the data is sampled. And if, with an increase in page size, time and memory costs grow linearly, the gain becomes less noticeable. After the tests we decide to increase the size to 8 KiB.

Impact of page size on large samples
If an increase in the page is expectedly slowed down by small samples, then the large ones from hundreds of elements even accelerate. This means that, in a good way, you need to choose different sizes for storage depending on the use cases of the data stored in them.

In total, only these two points allow reducing the base by 18%!

Change the compression format


zlib is, of course, a classic, but zstd provides a high decompression speed with a greater degree of compression. Moreover, zstd allows you to first build a single dictionary for all available data, and then save it once and compress all pages with it. Thus, we decently slow down the process of forming a database file, but then squeeze out an additional 8%.

We add entrances


Easy way


The easiest way is to make each porch a separate object, index them separately (by rough estimates + 10% of the index size) and store their geometry separately in the data for drawing a map.

This method will inflate the database in total by 3%. At the previous stages, we won more than enough to calm down and work with the entrances in the usual way, but ... at the time of the start of work, we were not sure about it, and work on the alternative format was parallel.

Detailed assessment, for those interested
Let's try to evaluate the increase in the size of the package with the database for each object:

8 byte - identifier
6 bytes - numbers of used storages (data + five properties; properties are separated from the main data and stored in a binary form, as they are often needed for a large number of objects at once)
3 bytes - index in data storage,
2 bytes - the offset of the object data,
5 bytes - data values ​​- 2 bytes per circuit, 3 bytes on average for information about apartments (we assume that there will be a lot of repetition and the data itself is stored once),
6 bytes - the offset of the coordinates data (the other properties have many duplicate values ​​and will be collapsed),
8 bytes - coordinate values.

Total 38 bytes per object. In the case of Moscow, this is 4.5 MiB for more than 124 thousand collected inputs.

Next, we need to store more and the connection between the house and the inputs, it is 2.5 bytes for each apartment building and 8 bytes for each input. Another 1 MiB.

Now we consider how much it will take in the map.

First, we must store the geometry. For polylines, the first point always takes 8 bytes, and all subsequent ones are stored in the form of diffs of the required accuracy. Here we are satisfied with precision to decimeters. Most of the inputs consist of two points, not far from each other, and therefore we can assume that the geometry itself will occupy 10 bytes. Total 1.2 MiB.

We also need to associate an input identifier and an object with its geometry. The index is the same binary storage as everything else, only the data contains the communication destination (1 byte), the layer number (1 byte) and the object number (3 bytes). Plus 8 bytes for the identifier, as well as a quick search tree. Total another 1.5 MiB.

As it was said at the very beginning of the article, we want to constantly display the entrances on the map, and the easiest way to do this is to unload another layer with points, but ... you can also re-use the geometry layer by creating a new conventional sign that will display the desired image at the last point of the polyline.

10% of the search index is 5.9 MiB. Summing up everything, we get 14.2 MiB, which is just over 3%.

Current option


Instead of indexing the entrances and inflating the search index, we search for houses, but additionally mark the words of the query and select addresses from it (relevant for searching for entrances in Kaliningrad), entrances and / or apartments. Thus, at the exit we have a house identifier and typed text fields, by which we must find the entrance we need.

Note
This is a controversial point. On the one hand, it allows us to reduce the size of the base, and on the other hand, it places restrictions on the input format — the entrances will not be based on many requests that would be properly processed by using an honest search.

Further, instead of unloading individual objects, we include all information on entrances directly into these buildings.
And, finally, we transfer part of the geometries to the reference book.

The latter should reveal more.

First, we note that most of the inputs consist of two points and have the same length. Such inputs can be stored in the form of a point + direction, i.e. save 1 byte for each input.

Secondly, we assume that the majority of houses in modern cities are typical, therefore, the displacements of their entry points relative to the center point of the house will coincide up to a turn.

We already have the central points of the buildings. What if for a building to add a new property - its integer “direction”, and for each entrance to allocate two more - integer displacements in decimeters along and perpendicular to the direction line? Considering how we store json'y with information, the input geometry on average will take a little more than two bytes.

An additional bonus is that since the geometry is in the directory, we no longer need to store the correspondence between the input identifier and the object number in the map layer.

Minus - complicated code. Previously, we simply said “show such objects” to the map, and now when displaying inputs we extract this data from json and add dynamic objects to the map. It's not very scary. By the time the arrows of the inputs are displayed, json'y of the corresponding objects we already have, that is, there is no need to additionally go to the base. With the displayed entry points, everything is somewhat worse - now we have to determine in the background which houses are visible, pull the data of these houses from the database, parse the json's, and, if there are entrances, create dynamic objects for them.

Note
Ambiguous solution. Even taking into account all sorts of optimizations and fairly quick work, we still perform additional actions, without which it was possible to do without increasing the base size by 0.2% (972 KiB for Moscow).

Since we have already written additional code to display the entrances of entrances, nothing prevents us from starting to store in the directory and two-point entrances to the organization. The win in this case will be small, but free.

How much did it give us? 3% turned into 0.5%. It could have been even less, but we left identifiers in the data (which are rather poorly compressed) in order to simplify the processing of users' messages about inaccurate entries.

Result


We added a large number of entrances, while reducing the file size of the map data by 0.5% and reducing the file size of the directory data by 26.6%. The latter is still the largest of our files, but it is only one of the four “heavy” files, so the overall change was more modest - 10.1%.

Resizing the base of Moscow over time

Entrances collected until all. The size of the bases will grow a little over time (according to current estimates, the same Moscow will increase by 0.4%), but in any case, the goal to release porches without increasing the size is more than achieved.

What's next?


If we talk about product improvements, then we are going to support the entrances and apartments in the search tips, as well as in the search for the starting and ending points of the search for directions. We are also thinking about displaying important entrances to buildings in a similar way to entrances (mainly in shopping centers).

In technical plans, however, there is a check of several ideas that may lead to a further reduction in the size of the file with reference data, and you should carefully look at other files.

And, of course, we will correct the mistakes that exist so far.

One more thought: use JSON intelligently


From the above written conclusion suggests that you can not think much about binary formats and just use JSON. This is not entirely true. It works for us, because all at once we need to receive from the strength of several dozen objects. Plus we use rapidjson, and it is very smart and consumes little memory.

It is also worth limiting the transfer of data in JSON format from C ++ to a UI written in another language.

First, you have to turn it into a string.

Secondly, the parsers that are available from the UI-part will reassemble this line, and will do it much slower.

It is better to get all the data from JSON by yourself, and on the UI side, throw through simple interfaces adapted for displaying specific elements.

Source: https://habr.com/ru/post/415263/


All Articles