Launching Products Reliably (SRE)
How do you launch products? How do you do that repetitively and reliably?
Let’s look at how one of the big techs(Google, in this case) does it.
There’re a number of things that should be in order to have successful launch(but not limited to):
- Builds
- Configurations
- Monitoring and logging stack setup to track the right metrics
- Rollout strategy(canary, blue-green, etc.) — depends on how you implement (ex. — by using Feature flags framework in the case of Google)
- Rollback strategy(should be tested beforehand)
Getting the obvious out of the way, let’s proceed.
Now, do you have a launch checklist? (Pssst: You should).
Why do I need one? Because these can be used to reduce chances of failure and ensure consistency.
Ex:
Could a user potentially abuse the service?
Action Item: Implement rate limiting and quotas.
It should be concise, practical and actionable enough for the developer(s).
LC can be defined for multiple areas of product.
Architecture
Ex: Have you figured out the dependencies/path correctly? Are they provisioned correctly? Have they been tested and reviewed?
Integration
Ex: Is monitoring stack setup and up-to-date(in case of new features)?
Capacity Planning
Ex: Are the outreach teams(marketing, blogging, PR, etc.) up to date with the launch? How much traffic is expected and do you have the resources to handle that?
Failure Modes
Ex: What if one of the components goes down? Is there a single point of failure?
Client Behavior
Ex: How does a miss behaving user or a DoS attack affect your service?
Processes (Manual and Automated)
Ex: What if a cluster or a DC is compromised? How can your service handle the load? How can you move it to a new cluster/DC?
Development Process
Ex: Are you using version control(not only for code but also for configurations)?
External Dependencies
Ex: Do you have any third party or a partner dependent on your service? Are they aware of the updates and how these would affect their service(s)?
Rollout Planning
Ex: Canary, blue-green, etc.
Google has a dedicated role (Launch Coordination Engineers) who are some of the experienced SREs and help a development team to work through these challenges and deliver/deploy quickly.
You can read more about this topic at https://sre.google/sre-book/reliable-product-launches/.