Better Alerts, Notifications & Incident triggering with Monitoring, Functions and Search on Oracle Cloud Infrastructure

Running an Ops Team/ Production Support/ SRE team is a lot like running an Emergency Response Team.

MTTR ( Mean Time to Repair/Resolve/Remediate)

Successful Emergency Responses depend on the ability of the responder(human/machine) to successfully extract maximum information from the incident reporter (human/machine) 

What if incident reporters can report incidents better?

"For an SRE, automation is a force multiplier, not a panacea. Of course, just multiplying force does not naturally change the accuracy of where that force is applied: doing automation thoughtlessly can create as many problems as it solves. Therefore, while we believe that software-based automation is superior to manual operation in most circumstances, better than either option is a higher-level system design requiring neither of them - an autonomous system. Or to put it another way, the value of automation comes from both what it does and its judicious application. We'll discuss both the value of automation and how our attitude has evolved over time"Niall Murphy, John Looney, Michael Kacirek

What if people were trained better to make better 911 calls.

The solution

It’s all about Context

Example Alarms Payload

Title

Status

  • OK_TO_FIRING: Things are getting worse and need your attention
  • FIRING_TO_OK: Things are getting better and you can panic a little lesser
  • REPEAT: Whatever it is that you’re doing isn’t helping
  • RESET: You’re either wasting your time by checking because the resource doesn’t exist or something is very wrong

Severity

Alarm Metadata

Alarms are the boolean evaluations of Metric Queries with a state variable to understand transitions from good to worse or vice-versa. 
1. The Metric Query and the evaluation
2. What was the result of the List Metrics Activity
3. Number of resources where this evaluation is passing

About Metric Queries

metric[interval]{dimensionname="dimensionvalue"}.groupingfunction.statisticFurther Reading: Metrics
1. Multiple Resources, of same resource grouped as a namespace
2. Every namespace has multiple metrics, therefore a resource emits mulltiple metrics
n:n relationship between metrics and resources
1:1 relationship between metrics & metric queries

Enrichment Options

How does one locate this resource with ease 
-------------------------------------------
1) A Resource Display name or OCID?
2) Which Compartment does it belong to Name or OCID?
3) In which region is this present ?

Enrich with Supplementary Data

For Compute Instances if I see the CPU Utilization Shoot Up , do I also need the Block Volume attachments to that instance and the number of TCP Connections, the upstream load-balancer is receiving ?

The Implementation

Better Incident reporting

Read Alarm Body -> 
Extract Alarm metadata ->
Enrich Metadata with Resource Fetch ->
Publish Notification ->
Configure Notification to Publish to HTTP Endpoint/Slack/PagerDuty

Better Email Readability

Read Alarm Body -> 
Extract Alarm metadata ->
Enrich Metadata with Resource Fetch ->
Format Email ->
Send Email on SMTP credentials through email delivery->

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store