The author argues against the common practice of on-call rotations, particularly as implemented by many tech companies. They contend that being constantly tethered to work, even when "off," is detrimental to employee well-being and ultimately unproductive. Instead of reactive on-call systems interrupting rest and personal time, the author advocates for a proactive approach: building more robust and resilient systems that minimize failures, investing in thorough automated testing and observability, and fostering a culture of shared responsibility for system health. This shift, they believe, would lead to a healthier, more sustainable work environment and ultimately higher quality software.
In a provocative blog post entitled "Take this on-call rotation and shove it," author Scott Mitelli articulates a profound dissatisfaction with the contemporary practice of on-call rotations, particularly as they are implemented within the software development industry. He posits that the current system, frequently touted as a shared responsibility amongst team members, is, in reality, a deleterious imposition that significantly degrades the quality of life for engineers. He argues that the constant anticipation of potential disruptions, the intrusive nature of alerts, and the requirement to be perpetually available negatively impact mental well-being, disrupt personal time, and ultimately lead to engineer burnout.
Mitelli meticulously deconstructs the purported benefits of on-call rotations, challenging the notion that they foster a sense of ownership and shared understanding of the system. He suggests that the pressure and anxiety associated with being on-call often overshadow any potential learning opportunities, and that the fear of making mistakes under pressure can stifle experimentation and innovation. He further contends that the disruption to sleep patterns, family life, and leisure activities engendered by on-call duties creates a chronic state of stress that is unsustainable in the long term.
The author proceeds to explore alternative approaches to managing system reliability and incident response. He advocates for a more proactive approach that prioritizes building robust and resilient systems from the outset, thereby minimizing the need for constant intervention. He also suggests investing in comprehensive automated monitoring and alerting systems that can effectively filter noise and escalate issues only when genuinely necessary. Furthermore, he champions the concept of dedicated site reliability engineering (SRE) teams as a more specialized and sustainable solution for managing complex systems, arguing that these specialized teams can develop the expertise and dedicated focus necessary to handle incidents effectively without imposing the burden on the entire development team.
In essence, Mitelli's central argument is that the current on-call rotation model is a flawed and outdated practice that should be replaced by more thoughtful and sustainable approaches to system reliability and incident management. He concludes with a call to action, urging companies and engineers to critically evaluate the impact of on-call rotations and to explore alternative models that prioritize the well-being and professional development of their engineering teams. He paints a picture of a future where engineers can focus on building innovative and reliable systems without the constant dread of being summoned to address production issues at any given moment.
Summary of Comments ( 19 )
https://news.ycombinator.com/item?id=43498213
Hacker News users largely agreed with the author's sentiment about the burden of on-call rotations, particularly poorly implemented ones. Several commenters shared their own horror stories of disruptive and stressful on-call experiences, emphasizing the importance of adequate compensation, proper tooling, and a respectful culture around on-call duties. Some suggested alternative approaches like follow-the-sun models or no on-call at all, advocating for better engineering practices to minimize outages. A few pushed back slightly, noting that some level of on-call is unavoidable in certain industries and that the author's situation seemed particularly egregious. The most compelling comments highlighted the negative impact poorly managed on-call has on mental health and work-life balance, with some arguing it can be a major factor in burnout and attrition.
The Hacker News post titled "Take this on-call rotation and shove it" generated a moderate number of comments discussing various aspects of on-call work and the author's perspective. Several commenters generally agreed with the author's frustrations regarding poorly implemented on-call rotations, particularly the lack of proper compensation and the disruption to personal life.
One compelling comment thread focused on the distinction between being "on-call" and effectively working a second shift. Commenters argued that true on-call work should be compensated appropriately for the inconvenience and disruption, even if no incidents occur. However, if the on-call duty consistently requires active work and prevents personal time, it should be treated as regular work hours and compensated accordingly. This discussion highlighted the importance of clearly defined expectations and fair compensation for on-call responsibilities.
Several users shared their own experiences with dysfunctional on-call rotations, echoing the author's sentiments about the negative impact on well-being and work-life balance. These anecdotes served to validate the author's claims and illustrate the prevalence of this issue in the tech industry.
Another point of discussion revolved around the importance of building resilient systems that minimize the need for constant on-call intervention. Commenters suggested that prioritizing proactive measures, such as thorough testing, robust monitoring, and automated remediation, can significantly reduce the burden on on-call engineers. This preventative approach was presented as a more sustainable solution compared to relying on reactive responses to frequent incidents.
Some comments also touched upon the cultural aspect of on-call work, emphasizing the need for companies to foster a supportive environment that recognizes and values the contributions of on-call engineers. Suggestions included providing adequate training, clear escalation paths, and mechanisms for feedback and improvement.
While there wasn't overwhelming agreement with every point made by the author, many comments reflected a shared understanding of the challenges associated with on-call work and the need for better practices within the industry. The discussion overall provided valuable insights into the complexities of managing on-call rotations effectively and ensuring the well-being of engineers.