CMG impact 2016 conference overview

This article is devoted to the conference, which was held almost 2 years ago. Why write about such old events? Firstly, in my opinion, not many people know about this conference. Secondly, my personal impressions of her are so strong even two years later that I simply cannot share them. Thirdly, I really wanted to write, but it was not very clear how to do this, since I had never written reviews before; this is my third attempt to write about this conference. And, of course, I want to thank the company Distillery, in which I worked at that time, and my supervisor Sergei Meshcheryakov for the opportunity to attend this conference.



The CMG impact International Conference is held annually by the American Association of Experts in Improving the Performance of IT Systems The CMG annual conference has been held since 1980.

The conference is dedicated to Performance engineering and capacity planning (performance improvement and capacity planning). The organizers, speakers and participants of the conference are highly qualified specialists in IT or in the field of capacity planning, many of whom started working on mainframes, then went into distributed systems, and currently continue to work in leading companies in the industry. The qualifications of many of them are amazing. There were companies or their representatives at the conference that relate to monitoring or testing the performance of Dynatrace, NewRelic, Soasta, Jmeter, BMC, Moviri, BezNext and many others.

120 presentations were made at the conference from more than 15 countries of the world, mainly the USA, Canada and European countries. It was held for five days from November 7 to 11, 2016. The mode of operation was as follows: the reports at the plenary session began at 8.00 in the morning and continued in sectional reports in 8 rooms until 19.00 in the evening with a short lunch break around noon. Each working day of the conference ended with a general buffet table, during which it was possible to personally talk with all the speakers and discuss the reports presented. It was rather difficult to make a choice of which report to visit, since in parallel sessions the reports of interest were being held simultaneously in different halls.

In this article I will briefly describe the reports I liked most.

The American Association of IT Performance Improvers Computer Management Group in 1974 established the annual Abraham Michelson Award for its professional contribution to evaluating system performance. This award is traditionally presented at the CMG impact conference.

Opening of the conference and presentation of the AA Michelson Award (Michelson Award)


The conference opened with the award of the AA Michelson Award Andre Bondi (Andre Bondi), independent consultant, author of the book Foundations of Software and System Performance Engineering: Process, Performance Modeling, Requirements, Testing, Scalability, and Practice.
The first plenary session began with a report by Andre Bondi. The key idea of ​​the report was that tuning will never lead to a performance improvement at times. According to the author of the report, the greatest productivity growth can be obtained by eliminating architectural errors in the system. I think many of you know this from personal experience. Many times during my career I came to the same conclusion. For example, moving from one kind of more productive system to a more modest opensource system can give a performance boost if the team removes the architectural errors of the previous version of the system during the transfer.

Achieving Scalability and Performance> 1M Concurrent Users at Time
Lukas Sliwka, Grindr


The next interesting presentation was the report of those. Director Grindr Lukasa Cream. Grindr is an online dating app. Lukash talked at the same time about the transformation of the company, and about the culture of development (they switched to scrum), and about the technical transformation of the system. Grindr has over 2.5 million users of which 1 million are active. Prior to conversion, the main users were located in the USA and Canada, as the number of users decreased from the USA.



Servers and data warehouses were also located mainly in the United States. In addition, the application has already moved to the most powerful server possible on the hosting, and the company is faced with an acute problem of optimization and scaling. The problem of optimization and scaling was solved radically - the team rewrote the entire project from Ruby on Rails to Scala, which took six months. Scala was chosen for two reasons: firstly, it was possible to write clean code with good performance; secondly, Java developers were more available to hire, unlike Node.js developers, who were expensive and all worked on Facebook.

The interesting experience of the Grindr company proves that for successful development a new architecture is sometimes needed. Further, the analysis of the duration of the response in the context of each country was carried out, and it turned out that the longer the response, the smaller the number of users in a given country. The development team optimized the response time by significantly reducing it using the CDN, and distributing the data warehouse across the cloud servers with centers in Europe, Asia and Latin America. After the performance issues were resolved, the number of users of the application increased worldwide. This example proves the direct relationship between short response times and the number of users.

The second part of the report was devoted to team management. Grindr works on scrum. The division of teams is made by product, that is, each team is fully responsible for some service or product for the user, and for the business value that the user receives. The company is a metric-driven company, and each team has its own metrics and KPIs, which they must achieve. Middle management is completely absent. The company has a flat structure, and the team itself decides what and how to do to achieve the objectives. Lukash's reports are on youtube . Lukash's interview is available at the link .

Is Your Capacity Available?
Igor Trubin, Capital One Bank


Interestingly, the author began his report with the information that Capital One Bank decided to become an open information company. It is well known that banks usually do not talk about technologies and processes. Used in their work, considering this information confidential. However, in the modern world, in order to compete for leading specialists and their talents, who, as a rule, go to work on Google and Facebook, banks need to be more open.

The report was devoted to assessing the combination of system availability and its performance margin. Since no one needs bandwidth that cannot be used.

Igor describes a specific form, how to measure its capacity and its availability. He has metrics such as average time between falls, average time to recover, downtime per year, and others. Igor gives a formula with which you can calculate the availability of your entire infrastructure for users. You can see his report in more detail.

Digital Experience Capacity Planning
Amy Spellman and Richard Guimark (Amy Spellmann and Richard Gimark)


Planning report for IoT Business metrics, IT metrics and Facilities metrics.

The report is about the fact that many infrastructures are now working close to their capacity, and what will happen when IoT works everywhere. For me, the most interesting in the report was its scale. That is, there are professionals who plan for the site, for the organization, and this is the whole industry. How often do we rise to such a large-scale vision?

Enterprise performance assurance based on BigData analytics
Boris Zibitsker (Boris Zibitsker)


Boris talks about BigData and approaches to working with big data. He outlines the following stages of working with data. Predictive analysis, as the business must make decisions based on the data in the current moment. Time is money and nothing has changed for many years, except that time has become even more expensive now. Information is expensive if it is relevant.
Working with big data clusters allows you to give the necessary analysis on time. Then Boris describes the stages. Two main approaches are to analyze data in a stream in real time or to collect from DataLake.

The report describes the approach of using big data processing algorithms and machine learning for performing RCA in case of failures, as well as making predictions based on such reports on the future behavior of the system.

An important point is to check the reliability of the results, through a comparison of the actual behavior with the predicted.

The Top 10 Performance Defects Costing Online Organizations Millions
Craig Hyde, Rigor and Rigor


Craig describes the top 10 most expensive errors that affect the performance of the site. Craig cites figures that, on average, for 1 second of user expectations, the company has the potential to lose $ 102 million. Interesting, yes? The company analyzed about 500 websites and compiled the top 10 major issues that lead to poor performance. Craig's recommendations on the use of caches, CDN, setting the correct resolution of images, the use of compression. And most importantly, testing what happened, as it turned out, many people think that they use cache, but about 70% of the content may not be cached due to incorrect settings. Craig recommends setting a baseline for performance and sticking to it, setting a performance target that needs to be achieved, testing and optimizing bottlenecks. Tools for test webpage test, google analytics, pagespeed, rigor free report. The most funny for me were the pictures on the sites in a large extension, while the size does not allow to evaluate this, so reducing the resolution does not lead to a deterioration in quality.

I did not manage to find Craig’s slides, but here’s a report on the same topic of one of the company's employees.

Risky business
Jeff Buzen


A few words about Jeff - he is a teacher at Harvard, the author of 3 books, the first published in 71, and the last in 2015 - Rethinking Randomness: A New Foundation for Stochastic Modeling, an article on the wiki , Interview with Jeff .

Report on modeling and forecasting performance. Jeff describes the risks that arise when modeling a system — the risk of a model, the risk of system load parameters, the risk of forecasting, the risk of application, the risk of possibility of use — at work. Jeff describes in detail all the possible risks that arise when modeling a system and trying to predict how many resources will be needed and how much accessibility is needed. Options for how not to write SLA - 90% of requests are executed in 0.5 seconds. Less risky - the queue length is less than 33x 90% of the time. His book is about this. The prediction that we use from classical mathematics does not always give the correct applicable results, although the formulas may be correct. Predictions are highly dependent on model assumptions. It is more preferable to use the forecast based on the analysis of alternatives (AoA, Analysis of Alternatives).

I sat and thought - how far it is from my experience - oh, we have only 2 times the performance margin, what? How did you know? Well, there was a peak and there were queues in the system, let's take a bigger server. What do you say? Well, what is the queue? Well, before we began to use them, the system just fell. This is the approach to planning performance starting with a system model - some other planet.

How to get value out of BigData?
Renato Bonomini, Stefano Doni, Moviri


Next was a workshop from the company Moviri . Moviri is a company founded by a lecturer at the Polytechnic University of Milan, with an office in Milan and Boston specializing in performance and throughput analysis. About how important it is not only to collect a lot of data, but also to extract from them. Stack Yarn, HDFS, Pig, Hive, Spark, Zookeeper, Cassandra, Cloudera, Kubernetes. The report showed how much more convenient it became to work with changes in performance with systems operating in containers.

The company Moviri invited me to the office, which I took advantage of, as in about a month I was going to Italy. It was very nice to meet with Stefano Doni and meet Luca Forni, look at the office and talk about everything related to performance analysis, starting with performance analysis and ending with the problems faced by consultants when dealing with the customer’s team.

Moviri blog .

Performance or Capacity? Different approaches for different tasks
Alexander Gilgur (Alex Gilgur), Facebook Company


The report will be useful to those who are engaged in performance prediction and capacity planning. Alexander gives examples and approaches that should be used for each of the cases. In general, although capacity and performance are similar concepts, different methods should be used for forecasts, with an emphasis on the ultimate goal of the work. What are we doing this for? We want to understand how much equipment we need and how much bandwidth we want to provide, or we want to predict system performance.

Here you can read the article by Alexander .
Slides

How to get started.
Justin Martin (Justin Martin), Cerner


About why you need to do performance monitoring. But the truth! Many people live without monitoring and everything seems to be normal. Indeed, many do not need it. As long as they do not publish an article about their wonderful site on the site and people do not go to them to see what is there. Or until something else leads to many, many users.



In the report, Justin tells quite simply what can be done to get started. Capacity management in 90 days

Steps

  1. Determine the peak hours for your system.
  2. Examine the limits of your project's performance (the moment when the system already stops processing requests and performance begins to degrade)
  3. Reduce productivity losses
  4. Balance the need - perhaps you can shift some loads from peak hours to less busy times.



Justin Linkedin . You can see the slides in the article about the conference, which I will publish at the end.

Dynamic load Balancing Infrastructure
Yuri Ardulov (Yuri Ardulov), RingCentral


Yuri describes the transition of RingCentral from the monolith, with a legacy code on microservices. Problems with monolithic code were that it was difficult to change the configuration, it was not possible to do a continuous delivery, difficulties in achieving the necessary availability indicators, it was not possible to do A / B testing, the ability to apply the new functionality only for some users. The system was redesigned and containers and microservices were used in the new design, it became possible to resize the system online, the ability to change the version of the service without changing the configuration. In the microservice architecture, the application routing layer, the balancing layer, the service layer (does not store the state) were selected. After making changes, Ops teams were able to make continuous delivery on the fly, the availability of the system increased from 4 nines to 99.998. The time required to increase the system and deployment of new servers was reduced to 4 hours.

Avoiding costs, delays and failed releases with Lifecycle Virtualization
Todd DeCapua, CSC digital brand services


The Toda report focuses on how to reduce problems with releases and the key idea is that when testing a program, it is important to consider the entire program life cycle.

4 key components:

  1. Users - user virtualization simulating the behavior of your real users of the system.
  2. Services - there should be a virtualization of services so that you can check the operation of the entire process from beginning to end.
  3. Network - emulation of network conditions.
  4. Data - data virtualization from the sale, in order to simulate the challenges of the application close to what will happen in reality.

Based on my experience, a large percentage of errors during the delay is due to the fact that very few people fulfill all 4 conditions, and there is always something on the sale that nobody expected to see, and that something breaks the release. It is very important to use the use-case with the sale of data and user behavior.

Here is an example from Tod's presentation:





Tod - Author of a book about performance optimization , performance tests and their interpretation.

The book describes how important a culture of performance testing in an organization is, how it can be brought in if no one is doing it now. A few examples from practice are given with a description of the task that the team faces, options for solving it, what approach the team chose and what the consequences were in the real system. In my opinion, these stories are often also about how difficult it is to predict the behavior of the end user, and how important it is to do a small focus group from a finite number of users and see how your prediction coincides with the focus group behavior.

Implementing Metrics-Driven DevOps: Why and How!
Andreas Grabner, Dynatrace


Andreas Grabner's report on DevOps practices in various companies and why a short time2market is important. Andreas used a very interesting metaphor about photos. Many people remember film photos - you take a photo, go show, print and see that it is unsuccessful, but you do not have the opportunity to retake the photo, since you are not in that moment.

Now everything is different - you take a photo, upload it to instagram and immediately receive feedback and can finish something that your subscribers are asking for. You are still in the moment and get a reaction in real time.

Now, back to the software - as it was before - new software functions were planned, implemented, tested and made, for example, 1 release per year. And they understood that the functionality on which a lot of money and effort was spent is needed by two users out of millions. Did this happen to you?

Like now? Agile and the ability to make releases at least every day as it does Facebook - the ability to immediately assess how much this functionality is in demand by users, whether it is necessary to hang it over with ruches, or it is better to throw it out right away and not waste time and energy.

I highly recommend a report to a business that is trying to explain to its team why they need Scrum and Agile! Now many terry enterprise companies with releases 3 times a year try to become faster, thinner and louder



In general, the report is not about this, but about how to build DevOps practices, to make releases often and well. Yes, it happens. It is important to use metrics and monitor the situation with the application, the load, make the correct deploy on the process. Andreas performance advocate and he has quite a lot of useful reports on this topic with unusual and memorable slides.
Here is a report by Andreas with a few other slides.

Performance Testing in New Contexts
Eric Proegler, Soasta


At the beginning of the report, Eric retrospectively described how the system architecture has changed since 2000, how the transition to the cloud changed or should have changed, how the systems are designed, taking into account the scalability in the cloud. Eric gives an example from practice, when one of the TV companies launched a vote in a mobile application and just during the show, when users had to vote, the system could not bear the load and was unavailable. With the culture of startups, it is difficult to predict how many users there will be, maybe 20 thousand could be planned, and the application will quickly reach 50 thousand. There are many applications for performance testing in the BlazeMeter (Jmeter), Selenium, Gatling, Grinder cloud. They are free, but visualizing the results is not very convenient. Therefore, it is recommended to use either another tool for visualization (Tableau) or use your own database in order to analyze what happened. When testing web applications, it is important to pay attention to the fact that incoming test users are geo-distributed. It is advisable to do a small perf tests each assembly and compare them with the baseline results.

Slides on the report of Eric can be viewed on slideshare.

Eric is the author of the blog .

In addition to the reports presented at the conference, several round-table discussions were also organized. Several experts condemned the given topic in the format of a lively active dialogue with the audience. In my opinion, this is a very interesting format, since conference participants could share their often impressive experience, and it was possible to come in to talk with each expert and participant by discussing interesting questions.

Offtopic and conversations behind the scenes


A special part of my impressions of the conference is informal communication with speakers and participants. Thanks to the format of the discussions, lunches, round tables and receptions, you can get to know people from all over the world from different companies and share experiences. Quite a lot of communication inside the conference. People are very open and willing to share experiences.

Separately, I want to mention Debbie Sheetz. Debbie worked at BMC and began analyzing performance on mainframes. Then she moved to distributed systems. Her experience in monitoring is huge and very interesting.

I was also lucky to talk with Anush Najaryan from MathWorks, whose experience also deserves attention.

Unfortunately, the number of reports does not allow visiting them all, and there is no record of the reports, that is, there is no possibility to review them at home. The organizers use the articles for the selection of speakers, and these articles are then transmitted to the conference participants, but there are no materials from the invited speakers in the collection.

Here is the Anush article about the conference .

The CMG impact is a very useful and interesting conference with an amazing atmosphere of sharing experiences that I recommend visiting.

Source: https://habr.com/ru/post/412767/


All Articles