Observability-Debt-Shielded AIOps for Regulated Finance and Energy Systems: Telemetry-Parity Incident Triage and SLO Risk Scoring
Author(s): Amol Diwakar Agade, Samta Balpande
Publication #: 2602023
Date of Publication: 09.05.2021
Country: United States
Pages: 1-12
Published In: Volume 7 Issue 3 May-2021
DOI: https://doi.org/10.62970/IJIRCT.v7.i3.2602023
Abstract
In today’s technological world, banking companies uses hybrid system that includes micro services architecture, mainframe systems, batch and streaming pipelines, and many third party services like SaaS that powers their on-premises and hybrid cloud setup. In regulated industries like Finance, Energy and Nuclear sectors Site reliability teams are increasingly using AIops to lower mean time to detect and mean time to restore. We are seeing a trend where AIops is continuing to use on telemetry that has uneven balance and data quality due to vendor limitations, legacy constraints, cost control and risk based security limits. When the telemetry data that includes metrics, logs, audit and event traces are uneven, service layer objective risk model (SLO) and triage can easily become biased. Services that generate a healthier telemetry data will receive more attention while services that has bad quality data will remain unnoticed before causing major failures. This paper introduces the concept of Observability Debt Shielded AIops which aims to measure observability debt as quantifiable state variables. This outlines the telemetry equivalent goals for routing and managing the incidents via ticketing tool that are handled by IT Service Management tools. This also calculates the unpredictable calibrated service level objective burn risk and incident severity score when healthy data is unavailable. This paper also provides the reference architecture that works with DevOps pipeline and change controlled management policies. We have also described the algorithm for Observability Debt Index (ODI) and have also outlined the methods to combine the logs, metrics and traces while considering the missing values in the dataset. This paper also includes Controls over Policy as a Code that links reliability automation with ongoing governance needs for banking and regulated industry. This resulted in a practical design of AIops which is statistically valid and ready to be consumed in regulated industry.
Keywords: AIOps, observability debt, telemetry data quality, Site Reliability Engineering (SRE), service level objectives (SLO), incident triage, incident severity scoring, IT service management (ITSM), hybrid cloud banking systems, policy as code.
Download/View Count: 7
Share this Article