Sheepdog

Sheepdog is a scalable system that provides distributed block devices to virtual machines. Its development began in 2009 by developers from the Japanese company Nippon Telegraph and Telephone Corporation . Sheepdog is an open source application under the GPL2 license. The latest version 0.9.3, released in November 2015, will be the successor of version 1.0, suitable for commercial use ¹ . (already became - comment. per.)

Purely for the sake of interest, the first version (0.1.0) was released by developers in August 2010 - and at the same time, sheepdog support was immediately included in the main QEMU development branch.
The first tests of the sheepdog I spent in November 2011 ² and the results were quite good for I / O operations. However, then the Sheepdog system still had problems with restoring the fallen node. Probably this problem was soon resolved, as the development of the application is quite lively, but at that time I used another solution.

Opportunities

The principle of Sheepdog is very well described in the published presentation, so I will limit myself to a brief overview.

It is scalable
The cluster volume can be arbitrarily increased both at the node level, increasing the capacity and space for data during operation, and by increasing the number of nodes. The larger the node, the better the VDI I / O performance.

He is simple
Unlike other systems, such as CEPH, Sheepdog does not work directly with the file system but operates with blocks of a fixed size, and therefore does not require separate daemons to service metadata. All management is performed using a single tool dog , which is directly associated with the sheep ( sheep )
(ceph uses objects too.)

Calculates fallen node
Each VDI consists of these blocks (objects) that replicate simultaneously to several nodes, so if one of them falls, data remains available, and objects from the fallen node will begin to replicate to the other node.

Supports snapshots at the block device level
Snapshots in Sheepdog work the same as in Btrfs. Inscribed VDI blocks are saved, and new data is written to new blocks.

The following features may be problematic in certain circumstances:

Sheepdog does not support SPOF
If VDI is used as a block-device through QEMU, a problem may arise if it is simultaneously connected in several places. This could prevent SPOF ³ which Sheepdog doesn't have. However, in the new version of Sheepdog, VDI can be blocked so as not to allow more than one connection.

The life cycle of data objects
VDI objects can only be deleted by deleting all clones and snapshots associated with them. This is exactly the same as for Btrfs. Therefore, removing unused snapshots may not be enough to make room for storage.

Communication daemon

Sheepdog is ridiculously small compared to Ceph or GlusterFS. This is because he is not trying to solve all the problems himself, but to the maximum he uses what is already working.

In turn, it provides a block device that can be used as a physical disk, as well as a software raid, etc.

It only cares about the distribution of data objects between the nodes on which it is running.

However, he needs information that the communication daemon provides - a key component, without which Sheepdog will not work.

Communication daemon - does not provide the ability to exchange data between nodes. This is what sheep sheep are doing. Through him, the sheep's only find out which nodes are currently alive.

corosync

First of all, Sheepdog assumes that nodes will communicate with each other through corosync . It supports up to 64 nodes, although theoretically it should be able to serve more, its use is optimal for small clusters of up to 16 nodes.

As a rule, corosync also uses Pacemaker, so there is no need to install anything else.

Installing corosync on debian

Corosync is in the distribution repositories, and its installation is simple:

$ apt-get install corosync libcorosync-common4

Configure corosync

zookeeper

Sheepdog developers recommend using zookeeper for larger clusters. According to the developers, a Sheepdog test repository with 1000 nodes ⁴ was built and tested. .

Installing zookeeper on Debian

 $ apt-get install zookeeper zookeeperd

Run the demon ..

 $ /usr/share/zookeeper/bin/zkServer.sh start

The default port on which zookeeper is running is 2181

Run sheepdog with zookeeper support:

 $ sheep -c zookeeper:IP1:PORT1,IP2:PORT2,IP3:PORT3 ...other...option...

The zookeeper bonus is that in his case, Sheepdog has a clearer and easier node configuration, but there is a problem that the Debian installation package does not include its support.

Therefore, to get a Sheepdog with zookeeper support, you have to compile it from the source code. Although I can not rule out that at present the situation may be different.
(zookeeper support still requires compiling from source - approx. per.)

Configure the sheep demon

The node becomes part of Sheepdog when the object manager, the sheep daemon, starts. He always works in duplicate:

The first instance of the process runs as a gateway ( Gateway ), which accepts I / O requests from clients (for example, from QEMU block device drivers), calculates target nodes and sends requests for further processing between them. That is, it establishes many network connections.
The other works as a local object manager ( Object Manager )

The configuration parameters of the sheep daemon can be passed as command line arguments during execution. If they are not present, the default values will be used with which you should be careful:

Port number
Unless otherwise noted, the sheep daemon runs on port 7000

Path to the repository
Unless otherwise noted, the shep directory uses the / var / lib / sheepdog directory, and VDI objects are stored in its subdirectory obj .

Theoretically, nothing prevents several instances of a sheep from operating on the same node - the main condition is that everyone uses their own port number and their own storage. The issue of the IP address of the node is almost resolved. Each running instance of the sheep daemon running on a different port will automatically connect to an existing cluster!

The important information is that the port number is part of the configuration of the VDI container. You need to know if you want to change the configuration of the sheep daemon to run on another port of the existing cluster.

Therefore, if you run an instance of the sheep sheep with a different port number, but with the same path to the object storage, you can lose information in existing VDI containers.

Demon sheep as gateway

On machines that do not have storage space for VDI objects, the sheep daemon can be run exclusively in the gateway mode, with the -G flag.

In this case, when distributing VDI objects, local storage will not be used at all, and the data will be distributed directly to other nodes.

Demon sheep as object manager

The second running instance works as a local object manager, accepts I / O requests from an instance that runs as a gateway, and performs r / w operations in the local object storage ( Object storage )

Object Storage

By default, the storage for VDI objects in Sheepdog is /var/lib/sheepdog/obj , which is also used by the sheep daemon as part of its internal directory structure - this is the default storage path.

If you want VDI objects to be stored elsewhere, you can pass the path where the other block device is mounted as a parameter at startup.

 sheep ... /cesta_do_přípojného_bodu

Paths can be transferred even more. The new version of Sheepdog supports the so-called multi-device technology, which allows you to dynamically increase the memory capacity, if necessary, without having to restart Sheepdog . Increasing storage capacity works like Btrfs.

 sheep ... /cesta_do_A,/cesta_do_B,/cesta_do_C

(the first specified directory will be used only under metadata - comment. per)

Additional storage can also be added (or deleted) via dog node md

...

The multi-device functionality offered is especially useful when the storage file system does not support this "by design" (unlike Btrfs or ZFS). In general, the choice of the file system for storing objects, their properties, parameters and settings can significantly affect the performance of the IO file system of the virtual machine.

Multi-device technology requires extended attributes from the file side, which is not a problem for modern file systems, such as btrfs ⁵ or ext4. But some old file systems like reiserfs or ext2 ⁶ Do not support them.

If you want to use a file system that does not support extended attributes to store objects, we need to compile Sheepdog without multi-device support.

Storage type - plain versus tree

When formatting a cluster, among other options, you can specify the type of storage (backend storage). You can set the type to plain or tree . For plain type, the directory structure will look like this:

 | |- obj | |- <id > | | |-< > | | |-< > | | |-< > | | |- ... | |- <id  > | | ... |- config [] |- epoch | |- < > | |- < > | |- ... |- journal \- sheep.log []

All VDI objects in the obj directory will be sent to a subdirectory whose name is based on the current era identifier. That is, for each epoch, the corresponding VDI objects will be stored separately. However, during one epoch a large number of VDI objects may appear in the directory, which later slow down access to files. Therefore, you can choose the second option, which is a tree :

 |- obj | |- aa | | |-< > | | |-< > | | |-< > | | |- ... | |- ab | | ... | |- meta | | |- < > | | |- ... | |- 0a | | ... |- config [] |- epoch | |- < > | |- < > | |- ... \- sheep.log []

In this type of storage, the sheep daemon creates a set of 256 subdirectories with the names 0a, ..., 99 in the obj directory, and then scatters objects based on the last two characters VDI id , which is unique not only for each VDI container but also for its snapshots or clones.

VDI object names

When Sheepdog saves data in the VDI container, files will appear in the obj datastore, each will have its own name consisting of several elements:

 ../obj/8f/00e8b18f00000005 ^^

The first two characters indicate the type of object. Data objects start with 00... and metadata (which can be stored in a different directory) 80...

 ../obj/8f/00e8b18f00000005 ^^^^^^

Then follows the VDI ID. It is unique not only for each container, but also for its snapshot or clone.

 ../obj/8f/00e8b18f00000005 ^^ ^^

The last two digits of the VDI identifier indicate - in the case of the storage type tree - which contains the subdirectory to which the object belongs.

 ../obj/8f/00e8b18f00000005 ^^^^^^^^

And the container VDI identifier in hexadecimal form is followed by the sequence number of the object in the VDI container

The epoch

In the epoch subdirectory there are binary lists of objects belonging to the era. The epoch number increases each time each cluster changes — when a node is added or removed. Each such change triggers a recovery process, during which the current state of local objects will be checked on the nodes, followed by an increase in the epoch.

How to optimally select the storage of VDI objects

The available storage capacity is calculated based on the free space on the nodes. The space chosen by the sheep always depends on how much space is available in the block device where the VDI objects are stored.

The size of a VDI container is only a virtual figure, which is in no way related to how much space VDI objects occupy. It is important to know how Sheepdog processes data in a cluster:

Sheepdog always tries to keep data evenly between all the machines of the era.

This means that if one of the nodes falls, the era will change, and Sheepdog will immediately begin the recovery process, it will create the missing VDI objects on the remaining nodes to compensate for the loss.

A similar situation will arise when adding a new node. Sheepdog will begin to evenly move VDI objects from the new nodes to its repository, so the percentage filling of the data space on the nodes will be as balanced as possible. Use the following command to get a global overview of how much space is currently used on your nodes:

 nod1 ~ # collie node md info -A Id Size Used Avail Use% Path Node 0: 0 1.1 TB 391 GB 720 GB 35% /local/sheepdog-data/obj Node 1: 0 702 GB 394 GB 307 GB 56% /local/sheepdog-data/obj Node 2: 0 794 GB 430 GB 364 GB 54% /local/sheepdog-data/obj Node 3: 0 1.6 TB 376 GB 1.2 TB 22% /local/sheepdog-data/obj Node 4: 0 1.2 TB 401 GB 838 GB 32% /local/sheepdog-data/obj Node 5: 0 1.5 TB 370 GB 1.1 TB 24% /local/sheepdog-data/obj Node 6: 0 1.6 TB 388 GB 1.2 TB 23% /local/sheepdog-data/obj

I / O performance

It is important to say that Sheepdog works differently than Ceph, and has other priorities.

For Ceph, the weight of the OSD device plays a crucial role in the placement of data objects, as well as the performance of the block device, the connection to the host, and the speed of response. (actually not - comment)

Whether Sheepdog does something like that, I don't know. Maybe. Data for him in the first place. Performance in terms of I / O operations is secondary. Of course, with more powerful nodes, its I / O performance may improve, but it always depends on the specific structure. (however, tests show better sheepdog performance compared to ceph - comment)

I added a new node in Sheepdog with data stored on a rotating 2TB SATA II disk. The maximum write speed for this disc is about 80M / s. In fact, it varies greatly, because SATA drives cannot read and write at the same time.

Initially, the average write speed on VDI on this disk was about 20 ~ 30M / s, since in addition to data from VDI, 392G container data was replicated on it as part of the recovery process. It lasted 6.5 hours. Write speed ranged between 40 ~ 55M / s.

Obviously, in this case, the write speed was limited by the I / O performance of the local block device.

For Sheepdog, the following rule applies: "The more VDI objects will be on nodes with fast connections, the better will be the performance of I / O operations."

Due to the fact that moving VDI objects in the background, means that the “quick death of the node” will slow down the replication of those data objects that take up the most space, this will manifest itself as a decrease in the I / O performance of the VDI container operations

Occupied space

When placing data objects, the amount of free space is crucial for the sheep demon. The mechanism seems simple. The sheep demon, through which data is communicated with the VDI container, from time to time determines the utilization rate of the free and occupied space on the nodes, which it sorts. The data is then distributed to the nodes with the lowest utilization rate.

If there is a predominantly fast write path, the write operation to the VDI container will also be fast. Since the faster the I / O operations of the VDI container are performed, the earlier the sheep daemon can proceed to the next operation.

It is important that with Sheepdog there is no situation where one of the nodes will be completely overflowed. If the utilization rate on a node becomes significantly worse, then the sheep demon starts moving its data objects to another location.

Sheepdog works the same way as Btrfs - using only the actual space occupied. Thus, you can create a virtual VDI container with a volume of 1 TB, which will actually take up the same space as the data stored in it. From this point of view, in VDI containers it is desirable to use such formats of virtual disks and file systems that are able to wipe one after the other.

Cluster startup

While stopping all nodes can be executed simultaneously, nodes cannot be started at once !!! The nodes should be connected gradually. Starting from the node that was first listed in the list of nodes.
(this is an extremely strange statement - comment)

VDI

This is a common abbreviation for Sheepdog to refer to a virtual disk, not its specific format ⁷ . In general, this is a virtual box with fixed-size bays, in which Sheepdog then puts the data that the client transmits.

VDI creation

Before we create or import the first VDI we need to format the cluster. When the cluster is formatted, the parameters are set, which will then be used by default when creating each subsequent VDI.

An example showing the creation of a new VDI named Disk1 and a size of 1 GB

 root@nod1 :~# collie vdi create disk1 1G root@nod1 :~# collie vdi list Name Id Size Used Shared Creation time VDI id Copies Tag Block Size Shift disk1 0 1.0 GB 0.0 MB 0.0 MB 2015-12-04 14:07 e8b18f 2 22

Id
VDI identifier

Size
The size of the VDI, which is not necessarily preallocated.
If this VDI is in an incremental format (qcow2 and the like) created with qemu-img convert , it will not match the size of the virtual disk, but it will constantly increase.

Used
Information about how much space is occupied by VDI data objects.
VDI, which does not require the allocation of data objects during creation, will take up at all, since no data objects have yet been created for it.

Shared
The amount of data objects shared by other VDIs

Creation time
VDI creation time

Block size
The size of the VDI object. Attention! It is not listed in MB, but as a power of two bytes. In older versions, only fixed-size 4 MB objects were used. Currently, VDI may have larger objects. The optimal size for a VDI object on a regular virtual machine looks like 64MB (26). The default size of 22 (4 MB) is also minimal. Less can not be set. The smaller the object size, the greater the number of files that Sheepdog will have to handle during VDI, and working with files is not a cheap problem from the point of view of IO. A large number of files, especially with slow SATA controllers, can lead to a sharp deterioration in the speed of reading and writing. The maximum size of objects that can be used is 31 (2 GB). This can be beneficial if a large amount of static data, such as backup copies, is consistently stored in VDI.

VDI id
VDI identifier.

What does VDI contain?

VDI content is data. This is a distributed block device, so Sheepdog does not solve this data or garbage useful. From this point of view, VDI looks like a logical LVM volume. The pre-populated VDI corresponds to the classic LV section with the selected range, while the VDI resembles the thin LV section created in the pool (see LVM (thin_provisioning)), but with the difference that the extents of the data (objects) are not stored on local block devices, and scattered between the nodes.

VDI format works in this analogy as a file system. Some occupy reserved extents (objects) sequentially, others match them as their inodes, and then send data directly to them. The wrong combination of file system storage on the node, VDI format and internal file system can lead to a significant deterioration in I / O performance.

How to get information about VDI

To learn more about the VDI format, you can use qemu-img info :

 root@nod1 :~# qemu-img info sheepdog:localhost:8000:disk1 image: sheepdog:localhost:8000:test2 file format: qcow2 virtual size: 12G (12884901888 bytes) disk size: 4.0G cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false

From the output of the command, you can find out that disk1 has a nominal size of 12G. Currently it takes only 4G. Since it is in the qcow2 format, it is obvious that it was created as an incremental one.

 root@nod1 :~# collie vdi list Name Id Size Used Shared Creation time VDI id Copies Tag Block Size Shift disk2 0 4.0 GB 4.0 GB 0.0 MB 2015-12-04 16:07 825dc1 2 31 root@nod1 :~# qemu-img info sheepdog:localhost:8000:disk2 image: sheepdog:localhost:8000:disk2 file format: raw virtual size: 4.0G (4294967296 bytes) disk size: 4.0G root@nod1 :~# find /datastore/obj/ | grep 825dc1 /datastore/obj/meta/80825dc100000000 /datastore/obj/c1/00825dc100000000 /datastore/obj/c1/00825dc100000001

In this case, Disk2 was created as a pre-allocated raw VDI of 4 GB in size with a block size of 2 GB, which actually takes up only two 2 GB

Export VDI to file

VDI content can be exported from Sheepdog in several ways. Perhaps the fastest is using dog read . The command is a bit confusing, but simply means: "Download the contents of the VDI and send it to STDOUT ..", which can be redirected to the file:

 root@nod1 :~# collie vdi read disk1 > /backups/soubor.raw

If VDI has 10G, but only 2G is used, it will create a file with a total capacity of 10 GB.

In this command, the contents of the VDI are unchanged , so if the contents of the VDI, for example, a virtual disk in the compressed qcow2 format, can be used directly

 .. -drive file=file:/disk1_exportovany_z_vdi,..,format=qcow2 ..

Another way to get the contents of a VDI file is to use qemu-img convert . This is not so fast, but it allows you to convert VDI to another format using various options of the corresponding virtual disk format.

 root@nod1 :~# qemu-img convert -f qcow2 -O file -o preallocation=full,nocow=on sheepdog:localhost:8000:disk1 /disk1_exportovany_z_vdi

Incremental backup

Create incremental backup

Delta between the first and second snapshot ..

 root@nod1 :~# collie vdi backup test -F snap1 -s snap2 /backups/soubor_diff

Recover VDI from incremental backup

 root@nod1 :~# collie vdi restore test -s snap1 /tmp/backup restoring /tmp/backup... done

When importing an incremental backup VDI, of course, you should have snapshot from which the backup was made.

Verification by reading the original contents of the test image ...

  root@nod1 :~# collie vdi read test 0 512 -s 3

Import VDI from file

Importing an existing virtual disk as a local FS file can be done in the same way as export. But with the difference that dog write is used ("Reading data from STDIN and writing to a VDI file ..")

 root@nod1 :~# collie vdi write disk1 < /backups/soubor.raw

Content can only be imported into an existing VDI.
An imported VDI always takes up more space than the original file, since the data blocks from which the VDI is restored contain data on the marked-up area.

If the VDI does not yet exist and we do not know how much space is required for the virtual disk, we can use qemu-img convert

 root@nod1 :~# qemu-img convert -f file -O qcow2 -o redundancy=2:1 ./disk_ukladany_do_vdi sheepdog:localhost:8000:disk

Although VDI formats such as qcow2, qed and others can be used in VDI. For I / O efficiency, it is better to allocate data blocks in advance.
, Sheepdog VDI.

http://www.sheepdog-project.org/doc/vdi_read_and_write.html

VDI

VDI , .

 root@nod1 :~# collie vdi check disk1

VDI, . Sheepdog , . , .

: 'dog node kill' , ethernet , sheep . (, Ethernet), sheep . .

IO VDI

VDI . . , IO VDI.

Sheepdog SYNC, . -, VFS, , , .

VDI Sheepdog , . . Sheepdog VDI-.

VFS-

IO VDI sheep -n . SYNC , , VFS . , VFS , , !

. , — , , .

  sheep -n ...

Sheepdog . -D

- . sheep - IO — SSD . , VDI, SYNC . .

, VDI , VDI . , , , VDI.

 sheep -w size=20000,directio,dir=/dir ...

size

directio
sheep , . SSD.

dir

      ,     .

( ) dog .

VDI dog vd cashe flush , VDI!

Magazine

— . VDI, VFS , (/store_dir/journal/[epoch]/[vdi_object_id]), , .

IO , (cik, cak), (sequence).

, Sheepdog VDI , , SSD-. , VDI , , VDI.

sheep —

 $ sheep -j size=256M ...

, VDI , . -, — — -:

 $ sheep -j dir=/dir,size=256M ...

dir = , . , SW RAID SSD.

: sheep , . , skip, .

 $ sheep -j dir=/dir,size=256M,skip ...

^↑ ^{. 2015 . http://events.linuxfoundation.jp/sites/events/files/slides/COJ2015_Sheepdog_20150604.pdf}
^↑ ^{http://www.abclinuxu.cz/blog/kenyho_stesky/2011/11/sheepdog-hrajeme-si-v-hampejzu}
^↑ ^{SPOF ( S ingle p oint o f f ailure) , . SPOF VDI iSCSI tgtd}
^↑ ^one
^↑ ^{Btrfs — COW, , . , -, . , , . , , . , :}

^{autodefrag — , .}

^{nocow — (), — GlusterFS}

^{Btrfs , , FS, Sheepdog .}
- ^{, , ,}
- ^{multipath, .}
^{, , , Sheepdog.}
^↑ ^{Ext2 . - FS Btrfs, ext2, inode. ext3 ext4 . inode , Sheepdog . , -, . , Sheepdog , dog vdi check . , ext2 , — - dog vdi md , VDI .}
^↑ ^{, , vdi QEMU_Block_Disk}

Links

https://github.com/collie/sheepdog/wiki — ,
http://www.osrg.net/sheepdog/ — Nippon Telegraph and Telephone Corporation
http://www.sheepdog-project.org/doc/index.html — Sheepdog 0.8.0; — Valerio Pachera
http://www.admin-magazine.com/Archive/2014/23/Distributed-storage-with-Sheepdog — Udo Seidela Sheepdog , 23- Admin 2014

Source: https://habr.com/ru/post/412739/

All Articles