OpenStack: the whole truth about the "royal" release

Woland at M. Bulgakov said that “for no reason at all, no one will ever fall on his head”. Maybe so, but when two and a half years ago I was asked if I wanted to get to know OpenStack, it was the very well-veiled brick (and at the start not even a brick, but a granite slab). It was 2016 that became for me the so-called “point of no return”, marking the beginning of the rapid development of the concepts of the open world and greatly influencing the mentality, turning my future life into a holiday. "Holiday", which is always with me.



2016 - our days


OpenStack was not love at first sight. The first release I deployed for the tests was Kilo, who easily rhymed with the word “sad”. Having existed for exactly three weeks, he was hopefully replaced by Liberty, which, under the attractive wrapper - release notes - failed to meet the high expectations. Mitaka not so much brought a new functionality, but contained a lot of fixes and “patches”, and so far (!) It is successfully used in productive environments. Exit Newton, in fact, was a turning point in the history of OpenStack, providing such a number of architectural, logical and, as a result, configuration changes that forever closed the upgrade path from the previous version for many private clouds. But it was with the release of Ocata in 2017, according to analysts , that the “golden age” of OpenStack began, which includes Pike, Queens and, I sincerely hope so, Rocky, who is on a low start, will enter.


This article will discuss the latest stable release of OpenStack - Queens, some innovations and shortcomings - from the perspective of the person who automates its deployment based on the Ubuntu 16.04 LTS distribution (and continues to work because there is no limit to perfection) .

There is not a lot of material about Queens in the network (if you exclude from the sample official documentation and reports from the recent OpenStack Summit in Vancouver), the number of feedback from cloud providers and system integrators can be counted using the fingers of one hand. It is not surprising, because its predecessor - Pike, the official support of which will last another eight months - with the hundreds of users worked out and well-documented upgrade procedure looks more suitable for implementation. Our team, closely following the development of Queens from the very beginning, went further than many and confidently let the "new" into production. So how deep was the rabbit hole?

(hereinafter, OpenStack terminology will be widely used; it is assumed that the reader is at least familiar with typical architecture)

Useful features


  1. A nice bonus in the new release for me personally was the extension of the functionality of the nova-manage utility: now you can delete hosts from some cells (Cells v2) and transfer them to others! In Pike, we had to write separate requests to the database, and now it is available out of the box.
  2. Creating a user role (_member_ by default) for Keystone is “cut out” from the bootstrap stage. The reason for this was the final transition to API v3, which has other forms of authorization and authentication mechanisms, which also increases the security of the infrastructure (after all, there was a fixed id for the user role ...) However, this does not mean that the user role is not needed - early or late it will still have to create.
    Forecast: since this release, Identity API v2 has been deprecated; his support is likely to stop completely in a year (pessimistically - in the Stein release).
  3. Neutron l3- and dhcp-agents got the possibility of automatic failover - automatic switching (residing) of networks and routers from switched off agents to active ones. DVR HA functionality is enabled by default ( [DEFAULT]/router_distributed , [DEFAULT]/l3_ha ). DVR / SNAT is allocated in a separate namespace. When creating a router, snat is created by default, taking another IP address from the internal subnet:

    This bundle allows in the event of a SNAT service failure to quickly switch from the current to the backup router of an l3 agent running on another node.
  4. All Neutron vpn-agent functionality is shifted to the l3-agent; the only change necessary to make its config:

     [agent]/extensions = vpnaas 

    Finally, the presence of a vpn agent ceased to interfere with the construction of the automatic deployment logic in the event of the inclusion of a particular role (earlier, the vpn and l3 agents were mutually exclusive: if you put one package, the other is deleted).
  5. The Glance registry (a proxy for interacting with the database in API v1) has been declared an obsolete service, and now only the glance-api service can and should be configured. There were no surprises, but they will be discussed later.
  6. Mamma mia! Heat-dashboard is in a separate package (plug-in panel) for Horizon. The plugin has undergone many changes, but the most unexpected functionality was ... drag & drop objects in the template generation process. It is easier to see once:


  7. Cinder has added support for volume multi-attach - the ability to connect one disk to multiple virtual machines. So far, this functionality has been implemented only for a limited number of drivers supported by the service: LVM, NetApp / SolidFire, Dell EMC ScaleIO and Oracle ZFSSA - in fact, this is a very big step forward for the entire project (the implementation of the mechanism took more than two years ).
    Prediction: for Ceph RBD, Cinder volume multi-attach support has not yet been announced; most likely, it should be expected no earlier than in a year (optimistically - in the Stein release).

Annoying bugs


(The bugs listed below were detected when deploying OpenStack Queens on a Ubuntu distribution 16.04.4 LTS (Xenial Xerus); there is a possibility that they will not appear on other Linux distributions)

  1. In the Linux community, there is such a thing as “maintainer” - a specialist who accompanies / maintains / leads a certain software component (in a particular case, a binary package). So, the Ubuntu mainteners again (the bug existed in the Pike release) provided the packages for the regular time-series database service - Gnocchi. Installing them in the Python 2.7 environment via apt "breaks" some related services running under libapache2-mod-wsgi: Keystone, Cinder, Nova Placement, Horizon. The sad case is when you want to use a single method of delivering packages to the system. However, if for Gnocchi at least they tried to build packages, then for Octavia only pip and git still exist.
  2. I do not presume to say, but perhaps the same maintainers put their hands on the creation of nova-compute packages. At the very least, this would explain why when they are installed on the system (kernel assemblies 116, 119 and 124; package versions from 17.0.1 to 17.0.4), the installer “falls out” with an error:

     Setting up nova-compute-libvirt (2:17.0.1-0ubuntu1~cloud0) ... adduser: The user 'nova' does not exist. dpkg: error processing package nova-compute-libvirt (--configure): subprocess installed post-installation script returned error exit status 1 dpkg: dependency problems prevent configuration of nova-compute-kvm: nova-compute-kvm depends on nova-compute-libvirt (= 2:17.0.1-0ubuntu1~cloud0); however: Package nova-compute-libvirt is not configured yet. 

    The point is this: during the installation, a script is launched that should create the nova system group and the nova system user. The script works with an error, the installation crashes, and workaround is added to the automation: do the mentioned gestures before installing the packages. By the way, this bug is still not closed.
    UPD: in the process of writing an article, the bug was finally confirmed and set to medium priority. The main maintainer at the time of solving the problem (the cyclic dependencies of the Nova packages from each other) suggested another workaround: first install the nova-common package, then the nova-compute package. In the near future we can expect the version of packages 17.0.5, devoid of this drawback.
  3. Testing the work of the Neutron FWaaS service, it turned out that when you remove the firewall from a distributed router, it happens ... absolutely nothing. The firewall from the status “online” goes into the eternal status “pending delete”, and you can solve the problem either by using database requests or by deleting the entire router (this bug existed in the Pike release). The main problem at the moment is that the fix (which really solves!) Was proposed a few months ago, but has not yet got into the last stable release.
  4. Already at the stage of stress testing of the infrastructure, it was found that “neutron-openvswitch-agent EATING CPU AAAAAAA” - in idle mode, the openvswitch agent loaded one of the processor cores 100%:

     $ ps aux | grep 16233 neutron 16233 99.5 0.0 311112 143156 ? Rs 19:47 67:11 /usr/bin/python2 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_agent.ini --log-file=/var/log/neutron/neutron-openvswitch-agent.log $ time strace -c -p 16233 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000362 0 95725 epoll_wait 0.00 0.000000 0 15 read 0.00 0.000000 0 6 open 0.00 0.000000 0 6 close 0.00 0.000000 0 6 stat 0.00 0.000000 0 15 fstat 0.00 0.000000 0 20 sendto 0.00 0.000000 0 79 41 recvfrom 0.00 0.000000 0 2 setsockopt 0.00 0.000000 0 94 epoll_ctl ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000362 95968 41 total real 0m10.300s user 0m0.324s sys 0m2.576s 

    The solution to the problem was found by the service developers as soon as possible; The fix is provided in the Neutron 12.0.1-0ubuntu1.1 package version.

Royal fakap


Spoiler
Glance, from which you couldn’t have expected a trick, was nominated for The Most Epicfail Service award , which was established by me personally, and leads by a large margin from competitors - poorly documented, hard-to-debugging services - Designate, Octavia and Watcher. The final summing up will take place not earlier than the release of Rocky, but the position of the favorite will be difficult to shake.

In OpenStack Queens, developers have introduced a new method for downloading images - web-download, which allows the end user to "upload" the image by reference. Once upon a time, when the Image API v1 was not yet deprecated and the API v2 did not replace it, this functionality existed. It would seem that could go wrong? ..


In the glance-api service config, both methods for loading images — directly and by reference — are enabled by default ( [DEFAULT]/enabled_import_methods ). Pursuing a simple goal to test the option, I disable one of the methods, overload the service, and rush!

 CRITICAL glance [-] Unhandled error: ValueError: tuple.index(x): x not in tuple ERROR glance Traceback (most recent call last): ERROR glance File "/usr/bin/glance-api", line 10, in <module> ERROR glance sys.exit(main()) ERROR glance File "/usr/lib/python2.7/dist-packages/glance/cmd/api.py", line 97, in main ERROR glance fail(e) ERROR glance File "/usr/lib/python2.7/dist-packages/glance/cmd/api.py", line 71, in fail ERROR glance return_code = KNOWN_EXCEPTIONS.index(type(e)) + 1 ERROR glance ValueError: tuple.index(x): x not in tuple 

Having fiddled with the patch, again I am overloading the service:

 glance-api[26538]: ERROR: Value for option enabled_import_methods is not valid: Value should start with "[" systemd[1]: glance-api.service: Main process exited, code=exited, status=4/NOPERMISSION systemd[1]: glance-api.service: Unit entered failed state. systemd[1]: glance-api.service: Failed with result 'exit-code'. 

Seriously?! The option in the config has acquired the following inadequate appearance:

 [DEFAULT]/enabled_import_methods = [glance-direct] 

Of course, I was carefully opened the bug. The developers at the time of solving the problem even posted an announcement of this, known issues, behavior (known issues) and sent a commit to the master branch of the project. Two months after opening the bug, the fix “went away” to a future release; happy owners of Queens will have to wait for the version of Glance 16.0.2 packages.
All is well that ends well.

"Do not lose your head, shake your muscles"


During the deployment automation work (also implying configuration and functional testing steps), Queens showed himself to be strong, not overloaded with new features, without a crutch sticking out from everywhere, setting a high bar to his successor.

However, despite the fact that Queens is the seventeenth (!) Account release of OpenStack, the general trend has persisted for many years: “the unforeseen is not unforeseeable unforeseen intuition ”. This quote from the track Atlantida Project, in my opinion, best describes the whole range of sensations derived from interaction with the product. On the one hand, the deployment of standard services (Keystone, Glance, Swift, Cinder, Nova, Neutron, Horizon) with basic settings has been debugged for a long time and will not cause problems even for a novice engineer. On the other hand, when it comes to the introduction of relatively young services, the aforementioned “holiday” begins: understand how you want in meager documentation, get ready for days to look through the code and sit on a popular bug tracker.

However, all this suffering is just the side effects of the Stockholm syndrome. But in reality, an open-class exercise shakes the brain faster than any logical game, develops useful skills in a geometric progression, keeps it in good shape and certainly does not let you get bored. OpenStack was resolutely not my love at first sight, bringing to a white-hot and faint faint, but definitely deserves a bit of love now. And if you are ready to touch your way through the blackness of the consoles almost to the touch, driven by the need to realize yourself (and make the open-source world a little better), then this may be your way.

Source: https://habr.com/ru/post/415199/


All Articles