tl;dr ThirdEye comes batteries included with detection methods that cover most use-cases. Here’s an overview of major issues that effective detection needs to consider. And because I know you just can’t wait to roll your own detection logic, ThirdEye also includes a powerful debugger UI for algorithm development and tuning.
ThirdEye is an open-source monitoring platform for business and system metrics. It’s part of the Pinot analytics platform and comes batteries included. While ThirdEye supports proprietary algorithm plugins to cover specialized requirements, we’ve also implemented a number of detection methods in open-source that cover a wide range of typical use-cases. True to Pareto they cover 80 percent of use-cases for 20 percent of the effort. They are also transparent and easy to understand for first-time adopters.
Rather than launching into a detailed methodology section I’d like to give an idea of the day-to-day challenges ThirdEye faces monitoring various metrics at LinkedIn effectively. ThirdEye primarily covers time series with a granularity between 1 minute and daily intervals. Different types of time series range from smooth aggregate business metrics, such as page views to jittery system metrics such as latency percentiles and error rates. While each of type of metric has its own challenges, there are common aspects.
We frequently observe weekly and daily seasonality in business metrics, which isn’t surprising. A professional networking website like LinkedIn would be expected to see most of its traffic during during daytime and weekdays and less traffic during night time and weekends. Hard-coding this behavior by itself is ineffective, however, as other activity such as weekly batch computation on Hadoop clusters may execute primarily during low traffic periods.
When we implemented general-purpose detection methods, we therefore focused on accurate handling of specific weekly and daily patterns by default. In other words we want to achieve great detection performance (high precision and recall) on data with seasonality without requiring any additional configuration from the user.
Calendar effects and expected change
In addition to obvious daily and weekly patterns there are calendar effects throughout the year that need not occur in precise intervals and may affect only “part of” a metric. Depending on the monitoring use-case this effect may be considered “anomalous” or fully expected and unremarkable.
An example for this is the Easter holiday across many parts of Europe. The start of the Easter holiday is determined by the type of calendar (Julian vs Gregorian) and moon phases. The holiday turns into a one-and-a-half week long vacation for families with school-age children while it’s merely a long weekend for most young professionals. It therefore impacts different segments of user in different ways, which also translates to differentiated impact on system metrics.
Here ThirdEye’s detection algorithms benefit from the knowledge base of ThirdEye’s Root-Cause Analysis (RCA) component. RCA already tracks holiday databases and extracts relationships between your computer systems and metrics from meta data, configuration, and previous analysis sessions. Additionally, ThirdEye allows users to track future expected events and their impact on specific metrics.
Naturally, while this approach covers major metrics the stored knowledge can ever be fully complete. If ThirdEye still detects an anomaly that users consider “expected”, such as a drop in website traffic during the July 4th weekend in the US, we do the next best thing avoid alerting about the “bounce-back” the following business days (but ThirdEye would alert you in case of a lack thereof).
These are the “anomalies” in anomaly detection, such as a server failure causing a spike in error rates or a holiday decreasing desktop-based page views. Affected metrics temporarily deviate from their usual behavior and are expected to move back into line once the root-cause has been mitigated. Detecting the anomaly is only part of the problem however. The other is figuring out what would have happened without this anomaly. This is relevant for both root-cause analysis as well as correct detection of outliers in the future.
For detection to be automated we cannot simply rely on rules defined by expert users across thousands of metrics. Rather we have to learn from time series behavior, extract relevant features, and fit an effective model. Unfortunately, the real world isn’t a clean room laboratory and outliers in the training data can be detrimental to model fit and detection performance. When the system has been monitoring a (similar) metric for a longer time and a set of user-labeled outlier regions is available, excluding data anomalies is easy. Bootstrapping the monitoring process from scratch for a previously unknown metric in an automated way is non-trivial. Exclude too much and detection becomes noisy, don’t exclude enough and detection may miss critical outliers in the future.
For example, if a mid-week holiday decreases page traffic this week we may want to send out an alert, but we do not want to alert anyone the following week when the traffic returns to normal levels. Therefore, the outlier period needs to be excluded from any future baseline computation. This can become tricky when multiple anomalies overlap or appear in rapid succession or when there are strong underlying trends in the data.
Another common issue of anomaly detection in a production environment is low data quality. A general-purpose monitoring service like ThirdEye supports data ingestion from many different sources. When not supported by a scalable analytics store like Pinot, some users prefer data quantity over data quality especially when ongoing production issues (found by ThirdEye) are prioritized over maintaining internal ETL pipelines. Thus, we sometimes find ourselves confronted with gaps or intermittent data dropout.
We support algorithms in ThirdEye with a library of common smoothing methods and tools that are resistant to noise. There isn’t any free lunch here, as tolerance to noise comes with a potential degradation in detection latency. For a example, if a time series frequently sees dropout of data points ThirdEye alerts may be delayed by another data point or two until we are sufficiently certain that an observed deviation indeed comes from the underlying metric rather than ingestion process.
Many built-in algorithms can tolerate data dropout with a trade-off for either precision or detection latency (or a well-maintained ETL pipeline and analytics system). From our practical experience of operating ThirdEye here at LinkedIn we know that this choice is use-case and team-specific and can therefore managed via alert-specific configuration.
Permanent change points
Finally, the only constant in a growing business is change. The best model eventually becomes useless as the world around it changes. Business changes, technologies are replaced, and your product’s end-users will always find a way to surprise you. This change also reflects itself in the metrics monitored by ThirdEye.
Detection algorithms in ThirdEye handle this change in different ways: from moving windows and exponential decay to change-point detection from temporal distributional change or even manual user-input. And while accurate change point detection is already challenging we often cannot just stop monitoring a recently changed metric until sufficient data has accumulated to re-fit fitting a model after a change point.
ThirdEye implements methods that attempt to project properties from the previous sample onto the changed time series. For example, this allows us to detect unexpected effects of ramp up in A/B tests as anomalies but continue to monitor the metric reasonably well at the new, higher level by re-using established seasonality but scaling expected mean and variance proportionally. While this topic deserves an article by itself, I think this example conveys the fundamental principle.
Live preview and debugger UI
With many methods and heuristics implemented in ThirdEye we still have not found a silver bullet to reliably address all of these concerns at all times. We therefore set out to make it easy for users to judge detection performance by providing a preview and debugger UI for algorithm configuration and development.
For ThirdEye users the UI overlays the time series of choice with ranges of detected anomalies and allows comparison of various configuration settings and tuning methods. This way any algorithm can be back-tested and evaluated against historical data quickly before deploying it for real-time detection. Furthermore, our users can often troubleshoot small issues with algorithms themselves without having to reach out for in-depth support.
For algorithm developers the preview UI goes a step further and allows them to expose the internal state of the detection algorithm in the form of time series and other output and overlay it with the time series being analyzed. These capabilities have super-charged the development and iteration on detection algorithms for both the open-source distribution of ThirdEye as well as for algorithms tailored to specific internal data sets at LinkedIn. Algorithm development went from log scraping and break points to a visual debugger experience. Additionally, this visual debug output goes a long way in adding transparency to algorithm behavior for power-users and helps first-time adopters build trust in the technology.
I hope this article gives you some food for thought for building your own detection algorithms and convinces you that “off-the-shelf” in ThirdEye already goes a long way for most of your use-cases. And whether you want to build algorithms yourself or rather observe the inner workings of existing ones, I’m sure ThirdEye’s preview UI will make your day.