Site Reliability Engineering

Site Reliability Engineering (SRE)

Modern days web applications need constant behold and improvements even when it’s serving less traffic. This also includes monitoring performance, reviewing anatomy, balancing application defects.
Earlier these activities were being handled by forefront developers and system administrators but lagged behind or were left with certain gaps. Site Reliability Engineering is a must have to improve the availability, efficiency, capacity planning, and monitoring.

Why SRE over DevOps

DevOps is more about streamlining development operations for building a robust product. Whereas, SRE is a practice of creating and maintaining a highly resilient service. DevOps primarily focuses more on the automation, SREs focus on stability and scalability of production environment, as well as observability.

What is SRE

SRE allows software engineers to own the daily ongoing operations of the application in the production environment. It deals with practices like real-time monitoring of applications or services and alerting to enhance productivity and development practices to automate and improve the system’s health and availability.
SRE unites development and functioning, by combining software engineering and systems to raise a very highly productive system. It is a practice of creating, maintaining a highly resilient service and focus on stability of production environment, observability, and scale reliability.

Key SRE Capabilities

Monitoring and reviewing application performance stats.

Enabling diagnostics for key performance issues

Log indexing and Pattern analytics

Isolating defects and feature requests

SRE Attributes

SLI

Service Level Indication (Informs health of a service.)

SLO

Service Level Objective (Keep track of SLI.)

SLA

Service Level Agreement (Type of business agreement.)

Error Budget

An error budget states the numeric expectations of SLA availability.

SRE (SLI/ SLO) Monitoring using Cavisson Monitoring Suite

Insights into application performance with SLO focused metrics within an interactive dashboard.

CPS / CPM, Errors for each services
Error, Latency, throughput stats along with time taken by integration point calls
Insight into infrastructure health with Disk, system load stats to identify potential issues
Drill down to individual requests for detailed insight and RCA
Synthetic monitoring to check real time availability of applications

SLO driven real-time alerts

Trigger alerts with manual or dynamics threshold for different severity states
Drill down from alerts for detailed insight and RCA
Configure alerts across different metrics along with percentile/ rate and custom metrics
Geo map and Health dashboard allows to track uptime and system/ application health
Identify patterns and trends in behavior, and correlate to assess the ongoing viability of SLOs
Business performance monitoring using business KPI Dashboard (Order/ Revenue/ cart)

Key metrics to focus


Health and Performance metrics	SLO violation duration graph	Session Duration	Business Transaction Response Time/Load	Error Rate	Batch Latency	Throughput	Counts of cache hits	Database Response Time	Real-time performance