Zero touch networks is a term used to describe autonomous networks that can heal and adjust themselves, based on the signals in the data they collect and analyze across all network activity. The term “zero touch” refers to automation that is so highly developed that it monitors networks and services and acts on faults with minimal (if any) human intervention, including in the early detection of emerging problems, autonomous learning, autonomous remediation, decision making, and support of various optimization objectives. Zero touch networks are based on advanced machine learning technology that not only identifies anomalies, but also provides autonomous remediation through robust correlations and root cause analysis.
According to Gartner, monitoring systems perform three processes: Observe; Engage; Act. Technologies that deliver these capabilities end-to-end are referred to as zero touch. For telecoms (CSPs) and other verticals employing extremely complex systems, these fully autonomous monitoring technologies are a key concept on the road to digital transformation. End-to-end network and service monitoring and automation are considered essential if telcos are to realize the full potential of their networks, which are becoming increasingly complex and will continue to do so going forward.
The road to zero touch
While zero touch technologies are still in their infancy, as network monitoring and alerting platforms mature there is a growing expectation that they will go from anomaly detection to full remediation, without a human in the loop. Over the last five years, monitoring telecom networks has evolved to the extent that autonomous remediation (sometimes referred to as “the action phase”) is likely to become a dominant feature for leading CSPs.
But to get there, robust machine learning capabilities are key. AI, and specifically unsupervised machine learning, enables the transformation of traditional network and service operations towards automation and intelligent operations through three crucial steps: anomaly detection, correlations & root cause analysis, and, finally – remediation.
1. Anomaly detection
In the first stage, ML enables real time monitoring of 100% of the network data from connections, devices, radio networks, current and legacy core networks, services, transport, IT operations and any other source. This is already a big leap forward from many existing CSP monitoring solutions that still monitor data in silos, preventing CSPs from obtaining a single view of the network that is essential for identifying glitches that affect multiple capabilities, domains and environments.
In addition to a holistic view of the network, leading monitoring platforms feature fully autonomous baselining that also accounts for different seasonalities and constantly and optimally adapts to change. By monitoring the full scope of data using adaptable algorithms that take seasonality, trends and other behavioral variabilities into account, anomalies are detected faster and false alarms are reduced to a minimum.
2. Correlations and root cause analysis
These correlations provide the full context of what is happening, enabling teams to swiftly get to the root cause of every issue for the fastest possible remediation. Machine learning based solutions correlate across billions of metrics and related events and glitches across multi-technology (3G/4G/5G) and multi-vendor networks. Correlation analysis across multiple architectural layers is a must if CSPs are to effectively determine the probable cause of acute problems such as outages, as well as service degradation and slow leaks.
By autonomously pinpointing network anomalies and mapping the relations between them, ML-based monitoring is paving the way for autonomous remediation. These automated, closed-loop processes can currently be observed in low level tasks, such as automating “bounce the server” or an “open a ticket” type of script. This is done through automation scripts that still require a human in the loop. However, the technological roadmap is leading towards automation rule mapping and a fully automated ML remediation engine. In this scenario, the ML-based system will go through phases 1 and 2 – anomaly detection and root cause analysis – recommend an action based on previous incidents, execute the action through the remediation engine, and fine tune its operations through a closed feedback loop, increasingly improving its reactions.
Only by providing these three ML-based monitoring tiers can AI-based solutions progress towards the zero touch vision. Today, the “action” phase of monitoring is still lacking in most solutions. Many current solutions are rule-based and rely on static threshold models that are best suited for data-at-rest, and therefore cannot provide the benefits inherent in the latest technologies. With older solutions, correlation rules are limited and mostly applied on fault alarms which are reactive — instead of on granular time-series data.
Of course, minimizing time to remediation can only be achieved by anomaly detection and correlation analysis carried out on live data streams, requiring advanced correlation techniques.
Since this is the direction the domain is going in, it’s a good idea to check with respective vendors where they stand on automated actions. Since autonomous remediation is predicted to become a dominant feature for leading platforms, in the meantime it’s crucial to verify that the platform is ML-based and can effectively communicate granular data and insights to both IT stakeholders and other IT systems that can be used in the remediation phase.