What is a Site Reliability Engineer SRE?

The Great splintering of Engineering types.

Jan 17, 2023

∙ Paid

Hey Everyone,

I recently started a Talent Collective for software engineering and A.I. jobs, you can see the jobs here. You can join the collective if you see it here. I’m noticing a lot more SRE type jobs, which stands for site reliability engineer.

As the Modern Data Stack (MDS) evolves, new roles are increasingly being defined in data science and software engineering. How the SRE, Platform Engineering and DevOps roles are maturing is fairly interesting.

According to AWS:

What is site reliability engineering?
Site reliability engineering (SRE) is the practice of using software tools to automate IT infrastructure tasks such as system management and application monitoring. Organizations use SRE to ensure their software applications remain reliable amidst frequent updates from development teams. SRE especially improves the reliability of scalable software systems because managing a large system using software is more sustainable than manually managing hundreds of machines.

But in a data-first world, increasingly even smaller firms need SREs. The SRE role is common in large enterprises, but smaller businesses need it, too.

As data grows and travels across clouds and to the edge and back, more experts are saying that organizations need an SRE-like role for data.

DevOps gained popularity in order to combat siloed workflows, decreased collaboration and a lack of visibility across the software development lifecycle. Now in 2023, it’s the era of a demand for SREs.

DevOps teams don’t necessarily have someone specifically dedicated to developing systems that increase site reliability and performance.

That’s where a site reliability engineer (SRE) comes into the picture.

Site reliability engineering was originally developed by Google. In the words of Ben Treynor, SRE is “what happens when you ask a software engineer to design an operations function.”

So what does that actually look like in terms of roles?

SRE is kind of like a more proactive form of quality assurance (QA). Site reliability engineers will be dedicated full-time to creating software that improves the reliability of systems in production, including:

Fixing issues
Responding to incidents
Usually taking on-call responsibilities

Aside from its growing role today, SRE’s biggest claim to fame might be the four golden signals of monitoring:

Latency
Traffic
Errors
Saturation

Some argue, The data reliability engineer — often called the site reliability engineer (SRE) for data or database reliability engineer (DBRE) — could be the missing role needed to create clarity in the ever-more complicated stack. Roles are becoming increasingly specialized as the Modern Data Stack becomes more seamless and yet more complex at the same time.

Whatever your predictions for the Modern Data Stack in 2023, the importance of SREs and DBREs is on the rise.

Even though the site reliability engineer (SRE) role has become prevalent in recent years, many people—even in the software industry—don't know what it is or does. Site Reliability Engineering: How Google Runs Production Systems, written by a group of Google engineers, is considered the definitive book on site reliability engineering. Google vice president of engineering Ben Treynor Sloss coined the term back in the early 2000s.

Continue reading this post for free, courtesy of Michael Spencer.

Or purchase a paid subscription.

Machine Economy Press

What is a Site Reliability Engineer SRE?

The Great splintering of Engineering types.

What is site reliability engineering?

Continue reading this post for free, courtesy of Michael Spencer.