ML-based technologies can’t infer causation, a process which is still people-dependent. But robust solutions provide the full contextual envelope needed for swift resolution of incidents.

The idea of uncovering the root cause of a problem has a universal appeal. The systems we are dealing with on a daily basis are complex. When an issue occurs, we naturally opt to resolve the deeper underlying problem rather than settle for treating the symptoms or the downstream effects. 

It’s no wonder, then, that many ML-driven monitoring solutions, especially in the AIOps domain, are now offering some level of automated root cause analysis (RCA). Automated root cause analysis is the missing link between autonomous detection and autonomous remediation. Bridging the gap holds, on the face of it, a promise for a new level of automation. Alas, in reality, uncovering a root cause is conceptually complex and still impossible without a man in the loop, even using today’s bleeding-edge AI technology. 

The big ‘why’

Root cause analysis has to do with causation: with providing an answer to the question why an incident happened, rather than to the questions what happened, where, and how much. However, the concept of causation has been eluding science and philosophy for millennia. Over the years, both disciplines have devised different methodologies for proving causal connection between phenomena. Statistical analysis provides a set of tools which are beneficial for going from correlation to causation. Counterfactual Conditionals is another approach which opts to uncover causation by posing conditional questions which discuss what would have been true under different circumstances. But the fact of the matter is that causation is always open to interpretation. 

That is why one of the main pillars of the scientific method is the strict prohibition to confuse correlation with causation. “Correlation is not causation”, a truism taught to every first-year science student: never draw conclusions only on the basis of a correlation between X and Y, which can be just a coincidence. Instead the scientist must understand the basic mechanisms that connect the two. This understanding is based on the discovery of the causal relationship between the various factors of the model. This, of course, requires a model (or theory): only once there is a model can the data be connected with confidence. Data without a model is just noise, or in the words of Prof. Judea Pearl, the most prominent researcher working on causation theory today, “data are dumb”.

Correlations aplenty do not make causation

Correlations, however, is the only thing that AI, and ML specifically, can provide — at least at this moment in their evolution. Machine Learning is an unparalleled tool for correlating over vast amounts of data, metrics and dimension. But while machine learning methodologies make it possible to move with unprecedented efficiency from finite sample estimates to probabilistic distributions, they are incapable of advancing to the next stage of causal logic: from probability to cause-and-effect relationships.

Correlations cannot answer Why questions. Rather, they can show connections, influences and contributing factors. When faced with two incidents happening simultaneously, ML can point them out, but is unequipped to provide insights as to whether they are indeed related, and if so, which one caused the other, and through what mechanism. Getting to the root cause of one or both incidents demands interpretation and an understanding of the real-world physical and temporal models they refer to, which machine learning algorithms simply do not possess. 

Causal questions can never be answered from data alone. However, the current excitement surrounding pouring more data and compute resources into machine learning models has blinded most research and many vendors to the fact that AI technology still suffers from an inability to infer causality.

Adopting a Contextual Mindset

So despite decades of R&D dedicated to Root Cause Analysis, at present causality still eludes AI and ML. Leading analysts like Gartner agree that Root Cause Analysis is still a people-dependent process. In some cases, by providing the ML model with contextualized information and an understanding of similarity in events from the past, intelligent systems can learn how previous events have affected the metrics and use that knowledge to forecast expected behavior. This approach can be useful, for example, in cases when metrics are affected by events such as version releases, promotional campaigns, holidays or even weather.

In other cases, where previous causal context is lacking, AI-driven correlations can still provide a momentous heads-up for human-driven root cause analysis by presenting the context in which incidents occur. Context can be created through the presentation of contributing factors, surrounding events, as well as influencing conditions and states. 

Correlation analysis can help pinpoint the underlying issue and consequently help with reduction of Time to Resolution (TTR). In the example above, taken from the telco industry, Anodot identified and correlated multiple performance degradations impacting customer experience across various apps. The quick detection of the anomalies, coupled with correlation analysis, led to an accelerated understanding of the root cause of the problem and a fast resolution that minimized impact on subscribers.

Creating this kind of powerful context depends on the robustness of the solution. Only by analyzing 100 percent of the business’s data, and correlating across the full range of metrics and dimensions, can related incidents and influencing factors be exposed. Then, the causal relations between them can be quickly hypothesized and tested, leading to a swift resolution not only of the symptom, but of the actual root cause. 

So while automated root cause analysis in itself cannot find causes efficiently, a human assisted by advanced correlation and RCA tools can. This synergy allows engineers to quickly find relevant information in the vast sea of data, to postulate hypotheses and have machine learning algorithms analyze the evidence for them. This “Cyborg Approach”, the powerful combination of man and machine, provides an unparalleled advantage.

Written by Anodot

Anodot leads in Autonomous Business Monitoring, offering real-time incident detection and innovative cloud cost management solutions with a primary focus on partnerships and MSP collaboration. Our machine learning platform not only identifies business incidents promptly but also optimizes cloud resources, reducing waste. By reducing alert noise by up to 95 percent and slashing time to detection by as much as 80 percent, Anodot has helped customers recover millions in time and revenue.

You'll believe it when you see it