What is production readiness?
Production readiness can be defined as the process that ensures that specific software components are secure, reliable and are able to perform at the level expected. Achieving production readiness can help reduce the chances of downtime, minimize the number of critical incidents or failures, and provide users with a better experience.
The idea stemmed from the production readiness review process in Google’s SRE book. The process relies on numerous factors that all play a role in making software production ready; but these factors look different in different engineering organizations.
Production readiness is difficult to achieve, partly because it requires incorporating a number of standards across the software development lifecycle such as reviewing code, testing, monitoring, security and access controls, documentation, deployment workflows and more. The process touches everything from code to post-production ops and additionally, the organization’s production engineering requirements can change over time.
In simpler terms, production readiness is similar to ‘definition of done’; an idea born from the product management world.
The idea is that ‘the definition of done’ means that different stages need to be completed in their entirety to be considered ‘done’, but nothing in the software world is ever ‘done’, as it has to be continually and consistently reviewed, monitored and maintained. For example, if a service was production-ready when it was scaffolded, it won’t necessarily remain that way over time because requirements change and services (and their components) can degrade.
There are different ways to actually ensure the review of the production readiness checklist:
- By the DevOps engineer that is performing the action (for instance, when scaffolding a service)
- By the developer
- Using manual lists, stored in excel sheets or Jira
- Using automated checks, such as using scorecards or self-service actions in an internal developer portal.
Ensuring that reviewing and maintaining lists is easy and doesn’t involve manual work is important; even more when services that have already deployed need to be reviewed for production readiness.
Importance of a production readiness checklist
A production readiness checklist is exactly what it sounds like - a ready-made list of everything that you need to check about your software for production readiness.
Ensuring that software is production-ready is closely tied to software standardization; all of which encompass the necessary steps to ensure smooth operation in a live environment.
The checklist is important because it means that from an engineering viewpoint the service has the appropriate resilience, security and performance; a slowdown or shutdown of your software or a breach could have a hugely detrimental impact on your business’ reputation and bottom line. As an engineering team, it could also have negative consequences internally. On the flip side, using a checklist can improve the user experience, and as a byproduct you can retain (and grow) customer trust and revenue.
Getting started with a production readiness checklist
Before we get into what you have to include in your list, there are some things you should bear in mind when planning your production readiness checklist:
The checklist shouldn’t be static
The software development life cycle is continually evolving and so new frameworks, dependencies and technologies should be factored into any checks. In fact, because of this evolution, the checklist, just like ‘the definition of done’ should be considered as the first steps of your production readiness checks, but not your final steps as even if you’ve ensured there are no bugs or vulnerabilities before deployment, there needs to be a way to check that new vulnerabilities have not appeared and that there is an approach to resolving such vulnerabilities embedded in the organization’s standards. This is where ongoing review of readiness comes into place.
Automated production readiness checks are vital
While the whole idea of using a checklist sounds like a manual approach in itself - to improve the efficiency and accuracy, it should evolve to automated checks. The only real manual approach of the checklist is compiling the list itself and verifying that it makes sense with all stakeholders.
When it comes to the checks themselves, each organization will have their own approach but it’s clear that manual checks - using spreadsheets, project management software or Configuration Management Databases (CMDBs) are inefficient and may not be up-to-date, which can subsequently hinder the trust that engineers have in the process. Automated checks, which rely on scorecards of production readiness, using internal developer portals, can monitor and validate readiness criteria on a continuous basis, and can consistently perform checks without human error. The automated aspect also enables these checks to go a step further; providing alerts when issues arise, and then enforcing policies, triggering tests and validating configurations; all providing a more efficient and reliable process.
Checklists vary greatly and are difficult to put together
Creating a production readiness checklist is challenging due to the diverse requirements of different software components (e.g., APIs, microservices). These standards vary based on numerous factors, including the infrastructure, underlying technology, and the role of each component within the overall engineering ecosystem.
Each organization requires its own set of production readiness metrics and checklists tailored to its unique:
- Business needs (eg. highly regulated industries handling sensitive data); and
- Technical environments (eg. externally exposed services needing robust security measures, or adherence to specific Kubernetes standards).
Core components of a production readiness checklist
A comprehensive production readiness checklist for a service addresses multiple factors to ensure it’s good to go. These include:
Security:
- Conduct vulnerability scans (are you connected to relevant scanners?)
- Identify vulnerabilities through a security audit
- Ensure you have SLOs and maximums set for vulnerabilities
- Put role-based access controls in place
- Ensure authentication and authorization methods are in place for each service
- Static application security testing (SAST) using tools like Snyk to monitor code in the CI/CD pipeline.
- Make sure secrets are properly managed
- Perform penetration tests and dynamic application security testing (DAST) at the appropriate times
- Check all dependencies are using the correct versions using scanning tools.
- Implement data encryption for both data at rest and in transit
- Verify compliance with industry security standards
- Checks for other common malicious activities
Scalability
- Ensure the architecture is designed to handle increased loads efficiently
- Stress test the application’s components to check their limits
- Check whether your application can handle user or data growth
- Use performance monitoring for SLOs
- Establish performance benchmarks and then check these are met
- Automate the CI/CD release process to enhance scalability
- Execute automated unit and integration tests that require passing
Reliability:
- Define and monitor compliance with service-level objectives (SLOs), service-level indicators (SLIs) and service-level agreements (SLAs)
- Ensure disaster recovery plans are documented and tested
- Keep regular backups of data
- Ensure redundancy mechanisms are in place
- Include automated rollback capabilities to revert to a stable version if needed.
Observability:
- Implement monitoring with comprehensive KPI and health metrics, logging, and tracing
- Ensure you are alerted via preferred method (Slack, email, etc) if the status of your services change (through broken thresholds or inconsistencies)
- Use dashboards for real-time status
- Use logging for incidents and errors
Ownership
- Identify owners of services and components, include easily discoverable contact information and methods
- Map upstream and downstream dependencies
- Identify and make discoverable related teams, stakeholders, and team members
Incident Management
- Ensure runbooks have been documented and are accessible.
- Assign on-call responsibilities for incidents
- Designate owning teams for each service
- Establish escalation policies
- Test incident response process with a drill
- Ensure on-call is able to find the information they need easily during resolution
Addressing these areas ensures software is production-ready, capable of meeting user demands and maintaining reliability throughout its lifecycle.
Not all services need to track every metric listed; additional metrics might include FinOps, specific Kubernetes standards, or application security standards, which aren't always part of SRE activities.
Where to store the production readiness checklist
Where you store your checklist matters because it may impact how easy it is to find, use, update and even delete (by mistake!). Often, companies will store the checklist inside the GitHub repo as a markdown (.md) file; the benefit of this is that it is in the same space as code, and won’t get lost, but the downside is that it might not be as easily accessible. Alternatives include spreadsheets, which, just like the checks themselves, can be a painstaking exercise to use and manually update.
Key takeaways
In conclusion, a production readiness checklist is essential for guaranteeing that your services are secure, scalable, reliable, and observable. It also plays a critical role in implementing continuous integration and deployment (CI/CD), setting service level objectives (SLOs), and establishing robust disaster recovery and rollback plans. Incorporating these elements from the initial launch and throughout subsequent updates ensures the ongoing health and effectiveness of your services.
Learn how you can manage production readiness in an internal developer portal in this guide.
Book a demo right now to check out Port's developer portal yourself
It's a Trap - Jenkins as Self service UI
How do GitOps affect developer experience?
It's a Trap - Jenkins as Self service UI. Click her to download the eBook
Learning from CyberArk - building an internal developer platform in-house
Example JSON block
Order Domain
Cart System
Products System
Cart Resource
Cart API
Core Kafka Library
Core Payment Library
Cart Service JSON
Products Service JSON
Component Blueprint
Resource Blueprint
API Blueprint
Domain Blueprint
System Blueprint
Microservices SDLC
Scaffold a new microservice
Deploy (canary or blue-green)
Feature flagging
Revert
Lock deployments
Add Secret
Force merge pull request (skip tests on crises)
Add environment variable to service
Add IaC to the service
Upgrade package version
Development environments
Spin up a developer environment for 5 days
ETL mock data to environment
Invite developer to the environment
Extend TTL by 3 days
Cloud resources
Provision a cloud resource
Modify a cloud resource
Get permissions to access cloud resource
SRE actions
Update pod count
Update auto-scaling group
Execute incident response runbook automation
Data Engineering
Add / Remove / Update Column to table
Run Airflow DAG
Duplicate table
Backoffice
Change customer configuration
Update customer software version
Upgrade - Downgrade plan tier
Create - Delete customer
Machine learning actions
Train model
Pre-process dataset
Deploy
A/B testing traffic route
Revert
Spin up remote Jupyter notebook
Engineering tools
Observability
Tasks management
CI/CD
On-Call management
Troubleshooting tools
DevSecOps
Runbooks
Infrastructure
Cloud Resources
K8S
Containers & Serverless
IaC
Databases
Environments
Regions
Software and more
Microservices
Docker Images
Docs
APIs
3rd parties
Runbooks
Cron jobs
Check out Port's pre-populated demo and see what it's all about.
No email required
Contact sales for a technical product walkthrough
Open a free Port account. No credit card required
Watch Port live coding videos - setting up an internal developer portal & platform
Check out Port's pre-populated demo and see what it's all about.
(no email required)