
Much has been said about the advantages of working at a grocery company, and it is difficult to be original here. But about how to maintain the "health" of the product and what you can do in a grocery company, apart from the development of functionality, not everyone knows. We will explain how we operate the product at Juno, and how the operations department and technical specialists are involved.
We do not declare that our path is the most correct. We constantly try, make mistakes and try to learn from our mistakes. We hope that our experience will be useful to you.
About us: Juno is a ride-over service in the US that is part of the Gett group of companies.
In Juno, they write code in Go, Swift, Kotlin, Python, React.js as part of mobile application teams, Backend, Frontend, Data Science, Technical Operation Support, creating a service that has become part of the daily life of tens of thousands of drivers and hundreds of thousands of New Yorkers York
What does product management consist of?
Let's understand the process of operating in Juno and try to decompose it into its component parts.
We have identified three key components:
- Operational office
- Metrics and monitoring
- Incident Investigation
The purpose of operating a product is to respond to problems and changes in a timely manner, regardless of their nature.
For this you need:
- Determine the health of the system
- Understand how changes within the system affect performance
- Understand how changes in the market affect performance
- Understand when change becomes a problem
With this approach, business decisions are based on data. Our operating team operates in New York, as Juno service is currently available only to residents of this metropolis.
The team’s daily to-do list looks like this:
- Monitor and respond quickly to changes in regulators. Common changes include the appearance of a new toll road and the transfer of a waiting area for drivers at the airport. As soon as we receive information about such events, the employee leaves the site to correctly update the map and analyze the list of possible problems. When the development team updates the map on the servers, the employee tests the changes in the “field conditions” and makes sure that they work correctly.
- Conduct field research. When we started the service in New York, the plan was to first dial a certain number of drivers for the stable operation of the service in any area of the city. For this reason, for a couple of months, the drivers who joined us first went without passengers and only occasionally received trips from beta testers. These trips were not enough to gather the necessary information. Then we decided to send the operating team to the “fields” in order to assess the quality of the service and find out the drivers' complaints about the operation of the application. This approach proved to be useful and we constantly use it when releasing significant changes or to test hypotheses.
- Conduct "Calendar of Events" - a list of events, holidays and weather phenomena that can affect the quantity and quality of travel. This helps to understand and anticipate changes in key indicators (for example, the number of trips or the number of drivers online), which are not obvious to the development team from Minsk. Some events can be googled (weather conditions, SuperBowl finals, marathon, cycling, etc.), but there are some that are harder. For example, the first year of work for us was a surprise that Ramadan greatly affects the number of drivers ready to accept an order. The fact is that in the USA many Muslims work as drivers, and they don’t go to work on a holiday. It is difficult to take into account this fact, being in Minsk.
- Track business metrics change. In the third month after the launch of Juno and the rapid growth in the number of trips, we found that there were not enough drivers online, which affected the car’s delivery time and the desire of passengers to travel with us. It turned out that a competitor had launched a campaign that guaranteed drivers a higher payment for trips in the morning and evening rush hours. Information was quickly transmitted to Minsk, and in a short time we also had the opportunity to offer such conditions. This step helped us get drivers back and continue to grow.
Metrics and monitoring
In Juno, all teams have metrics that we agreed to divide into:
- Business metrics.
- Technical metrics.
Business metrics are a series of indicators that allow you to evaluate the “health” of a product. Conditionally divide them into two parts:
- Online. The number of drivers and passengers online, the number of trips according to statuses are attributed to the obvious ones. The less obvious are the number of new users, the conversion of the transition from the screen with the preliminary price of the trip to the trip order, the average waiting time for a car in a particular area, the queue speed at the airport, etc.
- Offline. Not all information can be quickly received and processed in real time, and it is not always necessary. When we are planning promotions for drivers or new functions, we are interested in long-term trends or user reaction to an A / B experiment, be it a new design, a new function, or an additional discount.
To create analytical reports based on collected metrics, use Tableau. We have a Business Intelligence (BI) team responsible for such reports. They work in the Tel Aviv office next to the grocery team. Both teams work closely with their colleagues in New York, which allows, based on BI analysts, to evaluate the success of the actions taken, formulate hypotheses for testing in the “fields” and correct the product development plan.
On the other hand, there are a number of technical metrics that in one way or another affect the system as a whole.
Technical metrics are a series of indicators indicating the error-free operation of individual components, on the basis of which a conclusion is drawn about the operation of the system as a whole. They show how much time the calls between services take, how much memory they consume and whether there are critical errors in the transfer of messages between them. There are a lot of such metrics in Juno. They are somewhat redundant, but in critical situations it helps to quickly find the cause of the problem. Tracking and using technical metrics help us:
- Dashboard - displays significant system vital signs. Each development team has its own set of metrics that help them understand how this or that change affected the microservices entrusted to them. For example, one team monitors the metrics associated with the payment of money to drivers and payments of passengers, and the other looks at the metric responsible for the search for the driver or the number of received coordinates.
- Logs. We log events from mobile devices and microservices backend. In 2017, they occupied 400-500 gigabytes per week, by 2018 this figure had doubled. We are interested in the following events: requests of microservices to external sources of information, to other microservices, received and sent requests for clients, all sorts of errors (business and technical). It is worth noting that the information is anonymized: personal data such as passwords and banking information are not logged.
To monitor performance, we use Grafana and Prometheus. When developing a new service or adding a new function, the developers add the necessary metrics to the service, and then each team sets up alerts for itself.
Thanks to the configured alerts, the technical support team makes a primary analysis and escalates the problem into development or into business teams for further solution.
If the problem is technical in nature and threatens the normal operation of the service, the technical support team creates a production issue. Thanks to the automated process, interested parties are immediately notified, including the customer support team (Customer service aka Helpdesk aka L1 support), which is prepared for a possible rush of calls.
Incident Investigation
Over time, we came to the conclusion that after each serious incident, a kind of “debriefing” takes place. We are making changes to processes that help us avoid or better cope with similar events in the future.
The elements mentioned above: metrics, dashboards, alert and logs help to understand what happened. The teams sit together, analyze changes in technical and business indicators, take into account mistakes and take lessons for themselves.
You have to deal with both production incidents and any other situation where it is impossible to quickly answer “what happened”. And here the tech support team (TechSupport aka L2 support) helps.
What issues are solved in technical support? It is believed that this is a boring job, as in the IT Crowd series, where three nerds in the basement just do what they say: "try to turn off and turn on the computer." In fact, questions arise complex and ambiguous.
The first customer service level is organized according to the “follow the sun” principle. With this approach, round-the-clock user support is possible without night shifts. In European time, there is an office in Tel Aviv, and during the American hours - in Portland. The task of this team is to listen and understand the "pain" of the driver or passenger, to calm, if possible to help. The guys who work there are responsible for questions regarding the work of the service. At the same time, the team is not “technical”, and as soon as a moment comes when it is necessary to dive deeper into technical nuances, the request is redirected to the technical support team. This team works in Minsk and is part of the development center. The guys solve only technical issues and do not communicate with drivers and passengers directly. The task of the team: incident investigation and process automation.
In the case of a production incident, the task for the technical support team looks like this: a bug was found or a failure occurred during deployment, we noticed a problem, fixed it, but we still need to figure out how this affected the system and what needs to be restored from the point of view of product management
- Is the data damaged, is the integrity broken?
- How did this incident affect users?
- Have all users suffered?
- What can be fixed?
The questions are simple, but to answer them you need to understand very well how the system works and how its behavior changed during the incident. When answering a question, it is worth considering the ongoing process of deployment, as the likelihood that something can change every minute.
As an example, when technical support assistance was required for the correct operation of the product, consider the case “I did not make the trip”. The driver took another passenger and made a trip for which our passenger does not want to pay. In this case, it is necessary to distinguish between legitimate request and attempted fraud, when the user tries not to pay for the services rendered.
If the request arrives more than once, it is automated by the technical support team and provided to the user support team in the form of a web application. This approach allows us to reduce the time for processing the user's request and not to “inflate” the technical support team. Nevertheless, the vacancy of a technical support engineer is constantly open, as the guys grow and move to other development teams.
All roads lead to Rome
A detailed description of the work of the technical support team within this article is not accidental. It so happened that it has become a place where information from all sources flows. A single point of contact reduces the number of interpreters, and therefore reduces the number of distortions.
This does not mean that the technical support team is the main link in the management of the operating product, because the grocery company is a living organism: all the organs are important and necessary. It is impossible to choose what is more important for a person — the brain or the heart, the lungs, or the circulatory system. Only the harmonious development and interaction of all organs guarantees the healthy functioning of the organism or IT company.
Health to you and your products!