Learn what production readiness means and how to create a production readiness checklist of requirements. In this article, Yonatan Boguslavski explains why you need to evaluate your services and how to improve their production readiness.
Learn what production readiness means and how to create a production readiness checklist of requirements. In this article, Yonatan Boguslavski explains why you need to evaluate your services and how to improve their production readiness and a better developer experience.
What is production readiness?
We develop software products to answer customer needs. Products should handle production-level traffic and be compliant with all data and security requirements. To do this, our services should be production-ready.
There is no one definition for “production-ready”. It’s a spectrum, and production readiness is slightly different for every organization and its needs. For example, a “production-ready” service for an early-stage startup might only require monitoring support, alerts, and a few unit tests. In contrast, in a large corporation, production readiness may be an adequate amount of unit and integration tests, documentation, monitoring, alerts, tracing support, and more.
In this production readiness 101 post, I will provide a general idea about which requirements can be included in a checklist, covering all or most of the aspects behind the idea of production readiness. You can focus on the ones that best fit with the product and the engineering organization you’re in.
Production readiness: a checklist
Getting to production readiness is no walk in the park. DevOps is becoming mature, and the result is that the software ecosystem is becoming complex. Production environments can consist of a growing number of moving parts, different environments, thousands of microservices, frameworks, and dependencies. As a result, it’s hard to observe distributed systems, creating a heavy cognitive load for developers and making onboarding difficult. This makes the cost of coordination between developer teams high, and makes for a poor developer experience with a long onboarding period in which they need to get to grips with the tribal knowledge about the architecture, resources and what even counts as production readiness.
Shift-left production readiness
Shift-left production readiness can help.
Just like shift-left testing and shift-left security (and almost all phases of the development life cycle), shift-left production readiness is crucial for reducing incidents and MTTR. In many ways, the core of modern production readiness is a shift-left of production operations to developers, but with solutions and approaches that do not require them to be infrastructure or devops experts, because that isn’t always humanly possible.
Custom views developers can understand
With the shift-left paradigm, a lot of the responsibility is now in the developers' hands. But it is impossible to expect the developer to be an expert in k8s, security, data, and a lot more. You need to give the developers a platform that consolidates all the relevant data they need to know to fix the problem, in an intuitive view, not too much and not too less. You don't want developers to start doing SSH and kubectl commands in various terminals.
Using a service catalog
A Service Catalog is a unified interface showing engineering teams everything they need to know about the microservices, software and underlying resources - the architecture.
The organization’s service catalog should support the developers as they perform many tasks, from scaffolding new microservices with best practices (for example, configuring basic alerts and showing them on Grafana) up to the production phase when, for example, the developer needs to find the relevant runbook for fixing a bug in production. A consistent view that shows your microservices architecture is crucial for scaling engineering efforts and helping developers stay efficient and autonomous, with a strong sense of ownership. An up-to-date service catalog can be very helpful when an incident occurs and will help reduce the MTTR. It should be live, not reside in documentation or a spreadsheet.
What should be in such a catalog?
- Who owns a service: Instead of hunting down suspected service owners, in case of downtime or errors, you can immediately interact with the responsible team or developer
- The slack channel for the service: A dedicated slack channel for each service is a great way to interact with all its relevant stakeholders: developers, SREs, and more.
- Who is on-call
- Used packages (In-House / External): In case of a security vulnerability, it’s much easier to find all services that are impacted.
- Latest Deployment: Reviewing the latest commit messages and the latest deployments, makes it much easier to determine what have gone wrong
- Documentation: Most organizations store documentation in all kinds of places (readme, confluence, notion, and more). It really helps to have one central place with the link to the relevant documentation
- Service dependencies: Today organizations have hundred of microservices, but developers need to quickly understand who/what depends on the service they own and which services their service depends on
- Audit Log on catalog entities: When it is easy to see all the last operations that happened on entities in the service catalog (for example, S3 buckets, deployments, k8s), you can understand issues faster
- Which checks/tests were performed on the microservices: It is important to remember that every service is unique, so we need to define a general checklist for all the services but also special checks and tests for the individual services. Creating an accurate checklist will be crucial in reducing the MTTR. It is important to distinguish between tests that check the core logic of the services and tests that check the production-readiness of the services.
and many more...
Release cycles are becoming shorter, and some organizations release a new version a couple of times a day. In this case, shift-left testing is the only way to go. Without it, the number of tests that QA should do is too big and will cause a never-ending loop between the developers and the QA. it’s not only impossible, it’s also a recipe for production bugs.
With shift-left testing, we force ourselves to do brainstorming sessions and think about roadblocks, bottlenecks, and possible performance failures. Even though these discoveries may require new design options, they will ensure a better outcome.
Adopting a Production and Reliability Mindset
Everyone, not just SREs, should adopt a production and reliability mindset.
Developers should implement systems and processes that make detecting, mitigating, and preventing incidents easier. They need to really understand the problems and issues their customers are facing. In addition, they need to use monitoring and analytics tools to provide a continuous, holistic view of infrastructure health and enrich it with relevant data to reduce the time it takes to solve an incident.
Also, in the developer onboarding phase, it is important to reinforce this mindset. When a new developer starts their onboarding phase in the organization, we usually invest in shortening the time until the first or tenth commit. Don’t forget how important it is to be familiar the production environment. A good KPI for the quality of onboarding can be the time it takes for a developer to be on-call on their own.
Automation can help perform tasks with limited manual intervention in order to streamline development and incident management. Automation is one of the critical principles for avoiding incidents and has benefits such as increasing productivity and scalability while reducing the chances of failure. In addition, Automation increases the agility and the independence of the developers and reduces DevOps grunt work.
In some organizations, even non-technical people make app configuration changes, on YAML files. This isn’t a good idea, since it is error-prone. It would be best if you supplied your non-technical internal users with a friendly UI to change configurations, if needed, for example, by customer success. Just like in a service catalog, the UI should contain all the relevant context for executing the operations. Automation should also include manual approval processes if it is necessary and also support "if this, then that" UI forms.
Incidents will happen, and your organization should have a disaster recovery plan that’s both documented and tested (and don’t forget regular backups).
If your organization wants an efficient disaster recovery process, you should use IaC. If you’re using manual processes or complex chains of tooling, then that disaster recovery (DR) process will take longer. The reliability of an application is impacted by the ability to pivot and the speed to redeploy. Be sure you know what that process looks like and how to put in place the right practice, tooling, and underlying processes to make the deployment as straightforward as possible.
IaC also helps track changes to an audit infrastructure. Because your infrastructure is represented in code, commits to your Git repository reflect who, when, and why changes were made. You’ll be able to look at the code and to know how environments were built, what’s happening, and why.
Conduct Post-Mortem Meetings
One of the most powerful ways to prepare for future incidents is to study and learn from patterns in past incidents.
A post-mortem meeting is held at the end of an incident. The goal is to look at the incident from start to finish to determine what went right and what can be improved. By the end of the meeting, you should have identified best practices and future improvement opportunities.
Benefits of Production Readiness
The end goal of production and operational readiness is to improve customer experience and deploy more frequently with more confidence, and minimize toil. In addition, development teams respect the code base more, leading to higher quality code, and an increased pride of ownership.
How to Review Production Readiness
Each organization needs a platform that shows which checks are available, which checks the services did not pass, what their severity is, and what action items have to be done to fix them. The developers must understand and buy into the initiatives around production and operational readiness. It is also essential to make a recurring event dedicated to going over the status of the services and if there have been improvements in the last weeks.
Production readiness is getting harder to achieve. Teams are hit with unprecedented change and growing application complexity. Organizations need to provide one place where the developer can see all the relevant context for him and will be able to do all the required operations to handle the incident. A developer portal is the best approach.