Evaluating the Maturity of Your Analytics System

I’m a big fan of maturity models. They help teams clearly articulate their vision and define a path forward. You can tie the product roadmap and projects to the model and justify budgets needed to reach the desired maturity level.

Gartner offers the following “Analytics Spectrum” that describes how analytics platforms evolve in two main dimensions:

Sophistication level of the analytics
Amount of human intervention required in the decision-making process towards a desired action.

The most common form of analytics is descriptive, with few that offer some level of diagnostics. Predictive Analytics are not yet mature, but we clearly see an increasing demand for a better prediction model and for longer durations. As for prescriptive analytics — the icing on the cake — there are very few organizations that have reached that level of maturity and are applying it in very specific use cases.

Gartner Analytics Maturity Model

As you can imagine, at the highest maturity level, an analytics platform provides insights about what is going to happen in the future and takes automated actions to react to those predictions. For example, an e-commerce web site can increase the price of a specific product if the demand is expected to increase significantly. Additionally, if the system detects a price increase by competitors, they can send a marketing campaign to customers that are interested in that product to head off declining sales, or they can scale up or down based on changes in traffic volumes.

Taking the Gartner model into consideration, I have developed a new maturity model which takes a slightly different (but very much related) approach to help you evaluate the current state of your monitoring/analytics system and plan in which areas you want to invest. This model is to be used as a guide since each company will be at its own level of maturity for each of the monitoring system capabilities.

Moving down the left side of the table below, we see the Monitoring System Key Capabilities: Collect (business and infrastructure metrics), Detect, Alert, Triage, Remediate. The numbers from left to right are the different levels of maturity of each of these capabilities. And lastly, on the right, are the KPIs affected by each capability, that I explained in more detail in the first post of this series: TTD (Time to Detect), TTA (Time to Acknowledge), TTT (Time to Triage), TTR (Time to Recover), and SNR (Signal to Noise Ratio).

Monitoring System Key Capability	Maturity Level					Affected KPIs
Monitoring System Key Capability	1	2	3	4	5	Affected KPIs
Collect (Business Metrics)	Key metrics at Site/Company level	Key metrics at product line, geography level	Secondary level metrics at product line, geography, customer/partner	Key and Secondary metrics at page, OS and browser level	Fine grain dimensions per transaction	TTD, TTR
Collect (Infrastructure Metrics)	Key metrics for key components at Site level	Key metrics for key components at availability zone/data center level	Key metrics per component in the entire technology stack (database, network, storage, compute etc.)	Key metrics per instance of each component	Fine grain dimensions per component/instance	TTD, TTR
Detect	Human factor (using dashboards, customer input etc.)	Static Threshold	Basic statistical methods (week over week, month over month, standard deviation), ratios between different metrics	Anomaly detection based on machine learning	Dynamic anomaly detection based on machine learning with prediction	TTD
Alert	Human factor (using dashboards, customer input etc.)	Alert is triggered whenever detection happens on single metric	The system can suppress alerts using de-duping, snoozing, minimum duration	Alerts simulation, enriched alert	Correlated and grouped alerts to reduce noise level and support faster triaging	SNR, TTA
Triage	Ad Hoc (tribal knowledge)	Initial play book for key flows	Well defined play book with set of dashboards/scripts to help identify the root cause	Set of dynamic dashboards with drill down/through capabilities and to help identify the root cause	Auto Triaging based on advanced correlations	TTT
Remediate	Ad Hoc	Well defined standard operating procedure (SOP), manual restore	Suggested actions for remediation, manual restore	Partial auto-remediation (scale up/down, fail over, rollback, invoke business process)	Self-Healing	TTR

One thing to consider is that the “collect” capability refers to how much surface area is covered by the monitoring system. Due to the dynamic nature of the way we do business today, it’s kind of a moving target — new technologies are introduced, new services are being deployed, architecture change, and so on. Keep it in mind as you might want to prioritize and measure progress in the data coverage.

You can use the following spider diagram to visualize the current state vs. the desired state of the different dimensions. If you want to enter your own maturity levels and see a personalized diagram, let me know and I’ll send you an spreadsheet template to use (for free, of course).

The ideal monitoring solution is completely aware of ALL components and services in the ecosystem it is monitoring and can auto-remediate issues as soon as they are detected. In other words, it is a self-healing system.

There are some organizations that have partial auto-remediation (mainly around core infrastructure components) by leveraging automation tools integrated into the monitoring solution. Obviously, to get to that level of automation requires a high level of confidence in the quality of the detection and alerting system, meaning the alerts should be very accurate with low (near zero) false positives.

When you are looking to invest in a monitoring solution, you should consider what impact it will make on the overall maturity level. Most traditional analytics solutions may have good collectors (mainly for infrastructure metrics), but may fall short when it comes to accurate detection and alerting; the inevitable result, of course, is a flood of alerts. A recent survey revealed that the top 2 monitoring challenges organizations face are: 1) quickly remediating service disruptions and 2) reducing alert noise.

The most effective way to address those challenges is by applying machine learning-based anomaly detection that can accurately detect issues before they become crises, enabling the teams to quickly resolve them and prevent them from having a significant impact on the business.

Evaluating the Maturity of Your Analytics System

Written by Avi Avital

You'll believe it when you see it

Featured resources

Automated Internal Operations for Managed Service Providers

Multi Cloud Visibility & Reporting for Managed Service Providers

Cloud Unit Economics for Managed Service Providers