How to Find, Calculate, and Determine Outliers in Your Data?
In our previous post, we discussed what data outliers are and why they are important to your business. Outliers can indicate both problems and opportunities, both of which need to be addressed quickly in order to either grow revenue or lessen losses. In this post, we’ll discuss how automated detection systems actually detect those outliers.
Finding outliers in time series data
Outliers can occur in any data set, from abnormally bright pixels in an image to an isolated spike in time series data, which is the format KPIs and other business metrics are reported in. Our solution is designed and optimized for time series data, and thus not all types of algorithms for finding outliers in data are used in the system (the market often uses terms ‘outliers’ and ‘anomalies’ interchangeably).
Specifically, this means that methods for calculating outliers which employ clustering (proximity methods based on k-means clustering, for example) cannot be used when dealing with one-dimensional time series data. Since time series data often doesn’t have a Gaussian distribution, outlier detection methods which assume one (like many methods based on extreme value analysis) are also ruled out. Approaches for calculating outliers based on projection methods require visual mapping and manual identification of outliers, which won’t work at the scale of millions metrics.
Additionally, there’s a common drawback when data scientists actually try to use these approaches on real data: it often takes manual investigation and visualization, iterated several times, to determine which algorithm to ultimately use, and then how to tune it, as each algorithm has its own set of options that are tunable.
Thus, a fully automated outlier detection system would include a meta-algorithm to select the algorithm most appropriate for each metric and then tune it. This in fact is one of the key innovations – utilizing advanced machine learning to give our customers their own cloud-based data scientist.
Using meta-algorithm is the key: How to calculate and determine outliers in time series data
What the automated outlier detection system does for each time series:
- Classifies the metric and selects a model based on that classification:
Is it a “smooth time series” (stationary) or is the distribution multimodal, sparse, discrete, etc. This step is critical for the performance of the outlier detection system because the distribution determines the model, which in turn determines which algorithms can be used for determining outliers.
- Initializes that model:
Read in new data points sequentially, updating and tuning that model in order to learn the normal behavior for that metric. Since a metric’s normal behavior may include seasonality, we use a proprietary algorithm, “Vivaldi” (based on autocorrelation with subsampling), to detect it. Vivaldi is extremely accurate while at the same time not computationally expensive, which allows Anodot to find outliers in a way that both eliminates false positives and performs well, even at when analyzing millions of metrics.
- Updates and refines the model:
Once a data point is read, we determine if it’s an outlier. If not, that point is used to update and tune the model. If it is indeed an outlier, the system will label it as such.
If the outlier is persistent (a change to a new normal), Anodot will update the model to take into account the anomalous behavior. It starts by giving the new outlier a much smaller weight compared to a normal data point, and increases it gradually the longer it persists, until it has equal weight with non-anomalous data. This adaptability allows the system to adjust to permanent, substantial changes in the normal behavior of a metric while at the same time alerting users to the change at the moment it happens. When finding outliers in real time at the scale of millions of metrics, manual re-selection and re-tuning of a model is simply impractical.
This top-level meta-algorithm is necessary for two important reasons: for the initial model selection and for the continuous model re-selection and re-tuning as the business realities underlying the metrics change. As we’ve mentioned above, at the scale of millions of metrics, manual model selection, re-selection and tuning is impossible. Exponentially more time and money would be spent on managing such a manual system than would be spent investigating and acting on the insights generated by it.
Intelligent outlier monitoring for quick insights
That’s the end goal: actionable insights. Not false positives, not alert storms, not man-hours spent squinting at graphs and plots. The layers of machine learning, statistics and software which powers Anodot converts your raw data into a cohesive, concise picture of the problem which then can be used to solve that problem.
More intelligent monitoring means quicker outlier identification. Less time impacted by an outlier can mean less lost revenue, fewer lost customers. It’s guaranteed that outliers will occur in your data, but with Anodot’s detection system, you can act before those outliers become problems.