I’m a big fan of maturity models. They help teams clearly articulate their vision and define a path forward. You can tie the product roadmap and projects to the model and justify budgets needed to reach the desired maturity level.

Gartner offers the following “Analytics Spectrum” that describes how analytics platforms evolve in two main dimensions:

  1. Sophistication level of the analytics
  2. Amount of human intervention required in the decision-making process towards a desired action.

The most common form of analytics is descriptive, with few that offer some level of diagnostics. Predictive Analytics are not yet mature, but we clearly see an increasing demand for a better prediction model and for longer durations. As for prescriptive analytics — the icing on the cake — there are very few organizations that have reached that level of maturity and are applying it in very specific use cases.

Gartner Analytics Maturity Model

As you can imagine, at the highest maturity level, an analytics platform provides insights about what is going to happen in the future and takes automated actions to react to those predictions. For example, an e-commerce web site can increase the price of a specific product if the demand is expected to increase significantly. Additionally, if the system detects a price increase by competitors, they can send a marketing campaign to customers that are interested in that product to head off declining sales, or they can scale up or down based on changes in traffic volumes.

Taking the Gartner model into consideration, I have developed a new maturity model which takes a slightly different (but very much related) approach to help you evaluate the current state of your monitoring/analytics system and plan in which areas you want to invest. This model is to be used as a guide since each company will be at its own level of maturity for each of the monitoring system capabilities.

Moving down the left side of the table below, we see the Monitoring System Key Capabilities: Collect (business and infrastructure metrics), Detect, Alert, Triage, Remediate. The numbers from left to right are the different levels of maturity of each of these capabilities. And lastly, on the right, are the KPIs affected by each capability, that I explained in more detail in the first post of this series: TTD (Time to Detect), TTA (Time to Acknowledge), TTT (Time to Triage), TTR (Time to Recover), and SNR (Signal to Noise Ratio).

 

Monitoring System
Key Capability
Maturity LevelAffected KPIs
12345
Collect (Business Metrics)Key metrics at Site/Company levelKey metrics at product line, geography levelSecondary level metrics at product line, geography, customer/partnerKey and Secondary metrics at page, OS and browser levelFine grain dimensions per transactionTTD, TTR
Collect (Infrastructure Metrics)Key metrics for key components at Site levelKey metrics for key components at availability zone/data center levelKey metrics per component in the entire technology stack (database, network, storage, compute etc.)Key metrics per instance of each componentFine grain dimensions per component/instanceTTD, TTR
DetectHuman factor (using dashboards, customer input etc.)Static ThresholdBasic statistical methods (week over week, month over month, standard deviation), ratios between different metricsAnomaly detection based on machine learningDynamic anomaly detection based on machine learning with predictionTTD
AlertHuman factor (using dashboards, customer input etc.)Alert is triggered whenever detection happens on single metricThe system can suppress alerts using de-duping, snoozing, minimum durationAlerts simulation, enriched alertCorrelated and grouped alerts to reduce noise level and support faster triagingSNR, TTA
TriageAd Hoc (tribal knowledge)Initial play book for key flowsWell defined play book with set of dashboards/scripts to help identify the root causeSet of dynamic dashboards with drill down/through capabilities and to help identify the root causeAuto Triaging based on advanced correlationsTTT
RemediateAd HocWell defined standard operating procedure (SOP), manual restoreSuggested actions for remediation, manual restorePartial auto-remediation (scale up/down, fail over, rollback, invoke business process)Self-HealingTTR

 
 One thing to consider is that the “collect” capability refers to how much surface area is covered by the monitoring system. Due to the dynamic nature of the way we do business today, it’s kind of a moving target — new technologies are introduced, new services are being deployed, architecture change, and so on. Keep it in mind as you might want to prioritize and measure progress in the data coverage.

You can use the following spider diagram to visualize the current state vs. the desired state of the different dimensions. If you want to enter your own maturity levels and see a personalized diagram, let me know and I’ll send you an spreadsheet template to use (for free, of course).

Analytics Maturity Spider DiagramThe ideal monitoring solution is completely aware of ALL components and services in the ecosystem it is monitoring and can auto-remediate issues as soon as they are detected.  In other words, it is a self-healing system.

There are some organizations that have partial auto-remediation (mainly around core infrastructure components) by leveraging automation tools integrated into the monitoring solution. Obviously, to get to that level of automation requires a high level of confidence in the quality of the detection and alerting system, meaning the alerts should be very accurate with low (near zero) false positives.

When you are looking to invest in a monitoring solution, you should consider what impact it will make on the overall maturity level. Most traditional analytics solutions may have good collectors (mainly for infrastructure metrics), but may fall short when it comes to accurate detection and alerting; the inevitable result, of course, is a flood of alerts. A recent survey revealed that the top 2 monitoring challenges organizations face are: 1) quickly remediating service disruptions and 2) reducing alert noise.

The most effective way to address those challenges is by applying machine learning-based anomaly detection that can accurately detect issues before they become crises, enabling the teams to quickly resolve them and prevent them from having a significant impact on the business.

Subscribe to our newsletter and get the latest news on AI analytics

Written by Avi Avital

Avi has managed the technology and business operations of global organizations for more than a decade. As VP Customer Success, he leverages his experience building large-scale analytics and AI systems at PayPal and DHL, to lead Anodot's global CS team. Avi’s unique strategic and creative approach, coupled with his experience and passion for making a difference, help him deliver high value to customers, employees and businesses.

This website stores cookies on your device. These are used to improve your website experience and provide more personalized services. Learn more