As Gene Kim states in The DevOps Handbook, “How technology work is managed and performed predicts whether our organisations will win in the marketplace or even survive.”
In the 7+ years I’ve worked at YLD across a diverse portfolio of clients, I’ve seen firsthand the challenges and misconceptions around DevOps in client projects. A common mistake is thinking that having a dedicated DevOps team means full adoption and an end to infrastructure issues. This isn’t true.
In this article, I’ll clear up misconceptions, highlight common antipatterns, and explain how Platform Engineering and Site Reliability Engineering (SRE) are crucial to evolving DevOps and building a strong DevOps culture.
Understanding DevOps and its evolution
DevOps is a collaborative approach where development and operations teams work together to release software more quickly and efficiently. Traditionally, these teams worked separately, causing delays and inefficiencies. The legacy DevOps workflow led to long feedback loops, extended timelines, and delays in the development lifecycle. Developers would start new projects while operations addressed issues with previous code, resulting in unresolved technical debt and longer project completion times.
As DevOps has evolved, Platform Engineering and SRE have become vital in managing shared responsibilities throughout the development lifecycle, which are key to nurturing a successful DevOps culture.
By embracing end-to-end responsibilities and adopting a collaborative model, organisations have seen faster development and deployment, along with improved delivery standards, deployment frequency, and mean time to recovery (MTTR). This approach helps ensure high-quality digital products reach users more quickly and reliably.
Common DevOps antipatterns
As DevOps gained popularity, many organisations rushed to adopt it, although often incorrectly. Common issues include overemphasising tools, prioritising speed over quality, and unintentionally creating new silos within DevOps teams. Here are other common examples of counterproductive DevOps practices:
Merging development and operations teams
Simply merging development and operations teams and calling it a DevOps team isn’t enough. True DevOps transformation requires a cultural shift, new team structures, and updated management practices. The right setup depends on the organisation’s needs and maturity, so a flexible, adaptable structure is key.
Removing engineering operations entirely
As engineering roles evolve, some mistakenly think streamlining workflows in DevOps means getting rid of the operations team. However, expecting developers to handle everything, like infrastructure, testing, maintenance, and monitoring is unrealistic. This overload can hurt productivity, leading to a “one step forward, two steps back” situation.
The need for a dedicated DevOps team
Creating a dedicated “DevOps team” often leads to confusion and mismatched expectations among team members. A dedicated DevOps team can become highly task-focused, mitigating issues at each step of the development lifecycle and then moving on to the next task. Similar to a waterfall approach, this method contradicts the collaborative culture vital for a high-functioning engineering environment.
Too focused on tools
Organisations often focus too much on tools, emphasising their features over how they integrate into the overall workflow and product strategy. While advanced tools and experts can be valuable, they can also cause fragmented DevOps pipelines and siloed processes if not managed well. This tool-centric approach can lead to inefficiencies and block collaboration and innovation, as tools might create obstacles rather than enable smooth workflows.
Solving DevOps antipatterns with Platform Engineering & Site Reliability Engineering (SRE)
Many development teams try to adopt practices from larger organisations before they’re ready or skilled enough. This often leads to obstacles and slows progress, contributing to DevOps antipatterns.
So, how do we tackle these issues? By having Platform Engineering and SRE teams work closely with both development and operations, we can implement practices that complement each other and encourage effective collaboration.
Platform Engineering builds a strong, scalable foundation for applications and services, creating business value at a lower cost. By enabling effective monitoring and analysis, Platform Engineering supports strong Observability practices. With a centralised infrastructure, development and operations teams are freed from the burden of managing infrastructure. This allows them to focus on innovation instead of routine operational tasks, which also helps prevent silos from forming.
A key benefit of Platform Engineering is the development of internal developer platforms (IDPs), which simplifies code delivery and reduces cognitive load. This boosts productivity, enhances engineers’ autonomy, and promotes better collaboration with less friction. IDPs also standardise digital products, improving maintenance and scalability.
SRE, on the other hand, focuses on automating operational tasks to manage production systems effectively. SREs act as first responders during incidents and ensure reliability by proactively testing automated operations. Their goal is to automate any repetitive task, freeing up time for more valuable project work.
When considering the scalability of digital products, SRE plays a critical role in creating reliable software systems that can handle large-scale operations. By managing these responsibilities, SRE helps balance new feature rollouts with system reliability.
Together, Platform Engineering and SRE eliminate the rigid segregation of responsibilities that often leads to friction and silos within teams.
Rest easy with Observability & Monitoring
Many organisations think they’re doing Observability when they’re actually just Monitoring.
Monitoring alerts SREs to issues, but Observability helps them understand why those issues are happening. Correctly classifying incidents from P-0 to P-5 can save costs and improve resource management. The key elements of Observability (logs, traces, and metrics) are vital for accurately classifying incidents and making the most of resources.
- Logs: Useful for collecting and storing data about events within the system for troubleshooting and problem identification.
- Traces: These track the journey of requests through the system, aiding in understanding operations and diagnosing issues.
- Metrics: They measure various aspects of system performance, helping monitor and identify trends.
Incidents will always occur in unexpected ways, sometimes affecting just a small part of the system and other times impacting it on a much larger scale. This emphasises the importance of having engineers on call, despite the associated costs.
Never compromise on this function, as SREs play a vital role in managing and addressing incidents. Using logs, traces, and metrics helps prevent unnecessary incident escalations during false positives. We all deserve peace of mind, which leads me to the next crucial point: minimising operational fatigue.
Minimising operational fatigue
The high stakes of service reliability and the risk of outages can put a lot of pressure on on-call engineers, which can impact their well-being, leading to mistakes that could threaten your services’ availability.
You’ll want your SREs to invest 50% of their time in project work, with the remaining half split between operational tasks and being on call. Balancing quality with on-call responsibilities involves engineers managing incidents and then conducting retrospectives.
Shift-left mindset to improve efficiency and outcomes
As the demand for faster software delivery grows, the shift-left approach is crucial for boosting efficiency and cutting costs by addressing technical debt early. By identifying issues and defects before code reaches production, significant savings are achieved. This approach also fosters continuous learning and process improvement across engineering teams.
Similarly, SREs reinforce a shift-left mindset by automating build testing and validation with service-level indicators (SLIs) and service-level objectives (SLOs). They also influence architectural decisions early to ensure resiliency and scalability.
Both platform engineers and SREs adopt the shift-left approach to ensure high quality and reliability are built in from the start of product development, resulting in significantly improved uptime.
Here are key practices to follow to ensure an error-free production environment and improved code quality:
Continuous testing: Automates testing throughout the development process and uses simulations to check incomplete systems. By building continuous testing into the CI/CD pipeline, teams can catch and fix issues earlier, which improves software quality and cuts down on debugging time.
Continuous deployment: Automates the entire process of provisioning and deploying new builds, making the deployment cycle more efficient and speeding up software updates. This works hand-in-hand with continuous testing to quickly and effectively test new code as it’s rolled out. It reduces manual effort, speeds up feedback, and ensures that updates reach users faster and more reliably.
Continuous security: Makes security a built-in part of the development process by integrating it into the CI/CD pipeline. This includes automated security tests, vulnerability scans, and compliance checks at every stage. By focusing on security from the outset, teams can identify and address potential threats early, keeping their infrastructure and applications safe.
The key to successful DevOps
To fully embrace the DevOps model, it’s crucial to understand where team responsibilities overlap and when it’s beneficial to separate PE and SRE functions from DevOps.
When hiring for PE and SRE roles in a DevOps setup, look for engineers with strong tech skills, good collaboration abilities, and a curious mindset. Cross-functional training is crucial for helping team members understand each other’s roles and challenges.
These practices build empathy and enhance collaboration, crucial for effective DevOps adoption and adaptability to changing needs.
Final reflections
Mastering DevOps is still a tough nut to crack for many organisations. As a CTO, you’ll sleep better at night knowing that solid DevOps practices are in place and your team has the right mindset and tools. This means better productivity and efficiency across the board.
For developers, a strong DevOps foundation not only improves morale but also makes the entire development process more transparent. It leads to real benefits in production, like lower costs and better observability. With the right DevOps approach, developers gain a deeper understanding of how their applications work, which boosts their confidence and overall productivity, ultimately delivering more value to end users.
To find out more information about any aspect of effective DevOps adoption, contact us.