It's only barely autumn. The leaves just started turning and I haven't even gone apple picking yet. Why are we talking about this? To some extent, planning for our various roadmaps is continuous. But accounting for predictable events should happen at least a couple of months out. We're just about one and a half months away from Black Friday and Cyber Monday, some of the biggest days for Button and our partners.
Chillier days ahead
Button engages in a commonly observed practice in software development: Code Freezes, specifically around major holidays or high-revenue sensitive time periods. These periods involve a seasonal peak in business activity (e.g., shopping), and many Button personnel will be out-of-office. In addition, many of our partners have similar freezes during the same periods. Thus, any production issue that arises during these periods can be both extremely costly and more challenging than usual to recover from.
Due to this elevated risk, we endeavor to avoid all non-essential modifications to our systems, including but not limited to releasing new features, enabling new partners, and fixing non-critical bugs. The goal is to minimize moving parts and general system entropy. So even when a change may seem "innocent enough" or intuitively low-risk, it is still to be avoided. We are being deliberately more conservative in our approach than usual.
To freeze or not to freeze?
Increased pre-freeze stability: Button's approach to the code freeze (more on that later!) results in increased scrutiny and transparency about changes to production that are applied prior to the start of the freeze. This means that the state of our environments and applications is well understood by those who need to support it.
Quality of life: Practically everyone, but especially the software development on-call team and their families, benefit from an improved quality of life resulting from fewer critical production alerts.
Increased post-freeze stability: Generally speaking, systems converge on stability when they are not under the continuous modification that results from normal software development. The exception to this "rule" is issues encountered by exposing a system to unprecedented scale demands.
Consistent maintenance: Button's software development on-call team is responsible for ensuring that production remains stable 24/7, including during company holidays. The more static that our environments are, the easier it is for that team to ensure stability with minimal effort and hand-off between team members (i.e., the team doesn't need to spend time away from their families just to transition duties).
Pre-Freeze instability: Code freezes can promote a rush to ship features before a hard deadline. Rushing can lead to cutting corners or making mistakes, which leads to a short-term decrease in system stability.
Perceived lack of progress: What are they doing all day if the software development team isn't releasing code changes? We're still making changes to the code, but we're unable to introduce those changes to the production environment.
Post-Freeze instability: The work that happens during or just prior to the freeze is promoted to production after the freeze is done. But those changes have been sitting around, potentially diverging from the previously well-understood state of the system. The longer a set of changes sits without being released, the riskier it is to release it (e.g., certain assumptions may no longer be valid, specific details of the changes may have been forgotten, etc.). There is also the risk of a thundering herd of changes that result in incompatibility among multiple benign deployments. Change A and B, by themselves, may not have caused any issues, but deployed together they result in degradation.
Where Button lands
Button's policy for code freezes, as with most of our policies, attempts to find a healthy balance between safety and progress. Generally, code freezes are beneficial during times of elevated risk or decreased staff availability. The problems with code freezes are mostly associated with the hard deadline they represent.
Rather than impose a hard deadline by which all production deployments must be performed, we set a schedule over which a team of "Freeze Admins" become progressively stricter about the changes applied to our production environment. This group comprises Engineering management and leads. Button's production environment "slushes" over as we get closer to more sensitive times of the year, specifically Back Friday/Thanksgiving (US)/Cyber Monday and the winter holiday season.
Daily forecast posted in the holiday freeze Slack channel:
Realistically, there is still a hard deadline: the day the Freeze Admins no longer allow releases to production. The practice of the slush is to raise awareness, heighten visibility, and prevent any surprise/last-minute deployments.
Every request to deploy to production must be approved by a Freeze Admin. In general, this group is accountable for the stability of Production and comprises Engineering Leadership as well as Engineering Team Leads.
Requests are formatted as an articulation of:
What: A description of the proposed changes from both a business and technical perspective, typically accompanied by a pull request that has been reviewed and approved by a peer engineer.
Why: The reasons for the change, specifically 1/ the technical and business value of the change, and 2/ the urgency of the change (why it must be deployed now as opposed to after the freeze).
Risk: What could happen if we do or do not deploy the changes? How confident are we in the changes?
Requesting approval for these changes is relatively easy. The deploying software developer posts the above information in a dedicated Slack channel, tagging the Freeze Admins. The change is either approved, denied, or more information is requested. The approval, if given, is valid for half a day. A team of representatives from the Business and Product teams have the oversight power to request a change be deferred in the event that it would disrupt business operations.
A sample request is below (partner, application, and individual names modified/redacted):
Hello @freeze-admins - Seeking approval to push [repository]#[PR number], part two of two (link previous change) & should be the final change from me this week.
What: Updates the [application] configuration for XYZShop to add a custom `utm_campaign` value, only on app links, when country code is detected (i.e., the vast majority of traffic).
Why: Ticket APP-1234, as before. Many orders will not be tracked properly during a high volume period if this is not released. See ticket for revenue estimates.
Risk: Low, and similar to any change in this part of [application], the risk should be limited to XYZShop only. I will test in staging before prod (e.g., verify linking works and attributes are correct), and have the gracious assistance of Jim and Martha to help with testing.
Does it work?
Yes! We have utilized a code freeze process since 2017. Overall, we have successfully prevented costly mistakes with minimal agony or disruption to roadmaps. Every year, we iterate on this process to improve what works and eliminate what doesn't.
The most important effects of doing this are:
A stable platform on which our partners can rely
A quiet period for the engineering on-call team
The ability for Buttonians to spend time with the important people in their lives
Do you value fast-moving, high-trust engineering environments that operate at a meaningful scale in a distributed cloud environment? View our open roles and join the Button team today!