Better Alerts, Notifications & Incident triggering with Monitoring, Functions and Search on Oracle Cloud Infrastructure

One of the most defining characteristics of a reliability/monitoring/production-support engineer is working on-call shifts, mitigating incidents, and responding to production issues.
Running an Ops Team/ Production Support/ SRE team is a lot like running an Emergency Response Team.
MTTR ( Mean Time to Repair/Resolve/Remediate)
Your team could be equipped with sophisticated real-time operations and incident-response tooling, it wouldn’t matter if you had an inefficient upstream incident information generation, issues will nevertheless overwhelm the system.
Successful Emergency Responses depend on the ability of the responder(human/machine) to successfully extract maximum information from the incident reporter (human/machine)
What if incident reporters can report incidents better?
"For an SRE, automation is a force multiplier, not a panacea. Of course, just multiplying force does not naturally change the accuracy of where that force is applied: doing automation thoughtlessly can create as many problems as it solves. Therefore, while we believe that software-based automation is superior to manual operation in most circumstances, better than either option is a higher-level system design requiring neither of them - an autonomous system. Or to put it another way, the value of automation comes from both what it does and its judicious application. We'll discuss both the value of automation and how our attitude has evolved over time"Niall Murphy, John Looney, Michael Kacirek
What if people were trained better to make better 911 calls.

The solution
Enrich the Alarm notifications with OCI- Functions that use OCI-Search and maybe even get resource context or even the Resource endpoint for detailed information about the resource. One could even configure the function to do special tasks that aren’t limited to pulling resource information.

It’s all about Context
Let’s understand the Alarms payload,

Title
What you configure as the alarm title, remember to write a meaningful title.
Status
OK_TO_FIRING
: Things are getting worse and need your attentionFIRING_TO_OK
: Things are getting better and you can panic a little lesserREPEAT
: Whatever it is that you’re doing isn’t helpingRESET
: You’re either wasting your time by checking because the resource doesn’t exist or something is very wrong
Severity
How important is this alert for you
Alarm Metadata
The part that provides you with resource information on why the alarm fired.
Alarms are the boolean evaluations of Metric Queries with a state variable to understand transitions from good to worse or vice-versa.
They contain critical pieces of information about
1. The Metric Query and the evaluation
2. What was the result of the List Metrics Activity
3. Number of resources where this evaluation is passing
About Metric Queries
metric[interval]{dimensionname="dimensionvalue"}.groupingfunction.statisticFurther Reading: Metrics
1. Multiple Resources, of same resource grouped as a namespace
2. Every namespace has multiple metrics, therefore a resource emits mulltiple metricsn:n relationship between metrics and resources
1:1 relationship between metrics & metric queries
Enrichment Options
The Incident response mode dictates, what kind of enrichment strategy to choose, a machine-readable OCID is a better strategy to choose if the responder to the action is a machine and not a human. It could be a combination of both, so you’d probably have to mine as much information as possible.
How does one locate this resource with ease
-------------------------------------------1) A Resource Display name or OCID?
2) Which Compartment does it belong to Name or OCID?
3) In which region is this present ?
Enrich with Supplementary Data
For Compute Instances if I see the CPU Utilization Shoot Up , do I also need the Block Volume attachments to that instance and the number of TCP Connections, the upstream load-balancer is receiving ?
The Implementation
Better Incident reporting
Read Alarm Body ->
Extract Alarm metadata ->
Enrich Metadata with Resource Fetch ->
Publish Notification ->
Configure Notification to Publish to HTTP Endpoint/Slack/PagerDuty
Better Email Readability
Read Alarm Body ->
Extract Alarm metadata ->
Enrich Metadata with Resource Fetch ->
Format Email ->
Send Email on SMTP credentials through email delivery->