In our previous post, we discussed what data outliers are and why they are important to your business. Outliers can indicate both problems and opportunities, both of which need to be addressed quickly in order to either grow revenue or lessen losses.
One of the problems with statistical outliers is that they can be difficult to detect within the context of time series data. In this post, we’ll discuss how we can tune automated detection systems so that they become integral when it comes to identifying outliers across thousands to billions of metrics.
How to Identify Outliers?
Outliers can occur in any data set, from abnormally bright pixels in an image to an isolated spike in time series data, which is the format KPIs and other business metrics are reported in. Our solution is designed and optimized for time series data, and thus not all types of algorithms for finding outliers in data are used in the system (the market often uses terms ‘outliers’ and ‘anomalies’ interchangeably).
Specifically, this means that methods for identifying outliers which employ clustering (proximity methods based on k-means clustering, for example) cannot be used when dealing with one-dimensional time series data. Since time series data often doesn’t have a Gaussian distribution, outlier detection methods which assume one (like many methods based on extreme value analysis) are also ruled out.
Approaches for finding outliers based on projection methods require visual mapping and manual identification of outliers, which won’t work at the scale of millions of metrics.
Additionally, there’s a common drawback when data scientists actually try to use these approaches on real data: it often takes manual investigation and visualization, iterated several times, to determine which algorithm to ultimately use, and then how to tune it, as each algorithm has its own set of options that are tune-able.
Thus, a fully automated outlier detection system would include a meta-algorithm to select the algorithm most appropriate for each metric and then tune it. This, in fact, is one of the key innovations – utilizing advanced machine learning to give our customers their own cloud-based data scientist.
How to Calculate and Determine Outliers in Time Series Data: Using Meta-Algorithm is the Key
What an automated system for identifying outliers does for each time series:
- Classifies the metric and selects a model based on that classification:
Is it a “smooth time series” (stationary) or is the distribution multimodal, sparse, discrete, etc. This step is critical for the performance of the outlier detection system because the distribution determines the model, which in turn determines which algorithms can be used for determining outliers.
- Initializes that model:
Read in new data points sequentially, updating and tuning that model in order to learn the normal behavior for that metric. Since a metric’s normal behavior may include seasonality, we use a proprietary algorithm, “Vivaldi” (based on auto-correlation with subsampling), to detect it. Vivaldi is extremely accurate while at the same time not computationally expensive, which allows Anodot to find outliers in a way that both eliminates false positives and performs well, even when analyzing millions of metrics.
- Updates and refines the model:
Once a data point is read, we determine if it’s an outlier. If not, that point is used to update and tune the model. If it is indeed an outlier, the system will label it as such.
If the outlier is persistent (a change to a new normal), Anodot will update the model to take into account the anomalous behavior. It starts by giving the new outlier a much smaller weight compared to a normal data point, and increases it gradually the longer it persists, until it has equal weight with non-anomalous data.
This adaptability allows the system to adjust to permanent, substantial changes in the normal behavior of a metric while at the same time alerting users to the change at the moment it happens. When finding outliers in real time at the scale of millions of metrics, manual re-selection and re-tuning of a model is simply impractical.
This top-level meta-algorithm is necessary for two important reasons: for the initial model selection and for the continuous model re-selection and re-tuning as the business realities underlying the metrics change. As we’ve mentioned above, at the scale of millions of metrics, manual model selection, re-selection and tuning is impossible. If you’re wondering how to find outliers in a data set, remember this: manual systems are nearly impossible to scale.
Exponentially more time and money would be spent on managing such a manual system than would be spent investigating and acting on the insights generated by it.
Intelligent outlier monitoring for quick insights
How can outliers affect your business in the real world? Let’s say that you detect an outlier in time-series data — a lot of your customers are purchasing more items than normal. You want to understand whether your sales promotion is working or maybe there’s a price glitch on a product.
Anodot helps you out by first detecting there is an outlier, and then correlates it with any other metric or events that can help explain it. If there’s a correlation with the sale event, you know that it is working and can decide to extend it (making sure you have enough inventory). If it correlates with an anomalous decrease in expected revenue — such as what happened recently when, on Prime Day, Amazon sold a $3,000 camera for less than $100 — you can be fairly certain that there’s a price glitch that needs to be addressed.
Accounting for outliers turns fuzzy data into actionable insights. Not false positives, not alert storms, not man-hours spent squinting at graphs and plots. The layers of machine learning, statistics and software which powers Anodot converts your raw data into a cohesive, concise picture of the problem which then can be used to solve that problem.
More intelligent monitoring means quicker outlier identification. Less time impacted by an outlier can mean less lost revenue, fewer lost customers. It’s guaranteed that outliers will occur in your data, but with Anodot’s detection system, you can act before those outliers become problems.