tl;dr Automated anomaly detection is easy. Detection performance matters, but system adoption is also driven by transparency, amenability to existing processes and change, and usability.
My work at LinkedIn’s Data Infrastructure group over the past year focused on anomaly detection and root-cause inference, specifically on integrating high-level business metrics with low-level system telemetry and external events. Our team builds ThirdEye, a monitoring and analytics platform for clouds and large distributed systems.
Personally, one of the major motivation for joining this effort was ThirdEye’s identity as a first-class open-source citizen that is deployed at scale in a production environment. ThirdEye sits on top of the remarkably powerful, and equally open-source, Pinot analytics platform for high-dimensional time series (Pinot in SIGMOD 18, LinkedIn Eng Blog). While ThirdEye also ties into a dozen different proprietary databases and APIs at LinkedIn, it is inherently useful even without these connectors.
It is great to see ThirdEye’s user community grow and to hear about use-cases and concerns about automated anomaly detection from many different perspectives. One standout observation is that in the business- and system-metrics context anomaly detection and root-cause analysis are, technologically, easy. There are numerous systems, both open and closed, that perform time series analysis, clustering, event correlation, etc. Yet, the vast majority of monitoring, even at large, modern Internet companies, still operates on manual thresholds, eyeballing, and simple rules of thumb.
In this article I want to explore some of the reasons why adoption of automated monitoring solutions isn’t as clear-cut a choice as the determined data scientist, software engineer, or product manager thinks. I certainly did. And apparently the title caught your attention too.
Every engineering team operating a particular service has their home-grown way of monitoring their system. There are company-specific APIs and standards, but typically the monitoring of system metrics is an afterthought when most effort goes into first shipping and then scaling and operating a new service.
Business metrics get generated via merging and transformation of different data sets. This leads to a loss of source information and and, by adding transformations and data cleaning, may also mask data valuable for detection and root cause inference. Additionally, the data pipelines typically still have delays of several hours (or days) as they have grown over a long time horizon and cannot easily be replaced by a single streaming solution.
Any anomaly detection platform requires data as a raw material however. Expecting prospective users to write (and maintain!) an addition ETL pipeline for the anomaly detection system is a big hindrance for adoption. Even worse, this pipeline may introduce additional delays, show numeric discrepancies, or become a source of instability by itself.
ThirdEye partially works around this by providing various system connectors, but in practice substantial effort is required to streamline and unify metric logging and processing. The upside here this is that tech companies typically attempt to unify and integrate their data pipelines anyways, especially during episodes of regulatory change such as the upcoming activation of the European GDPR.
Critical systems and business metrics already have established processes for monitoring, reporting, and trouble shooting. Even if a new monitoring system can deliver better detection performance – in terms of recall – it will not be adopted trivially.
The entire reporting chain from operations staff, over managers, up to executives is organized based on established thresholds and severity measures. Changing these processes in a large, established enterprise is a long and slow process.
A common example for this is the reliance on week-over-week numbers for comparison and reporting. There are numerous statistical methods the generate numeric baselines that eliminate outliers, such as recent holidays or software deployments, from the comparison. Yet, alerts and reports require week-over-week numbers or else they are considered untrustworthy and useless as input to existing processes.
Automated root-cause analysis equally battles established run-books for trouble shooting. Even if we perform a correct analysis and determine, e.g. a hardware failure as the reason for an alert, a human typically still has to run through the steps in the run book as required. Cynically, from the perspective of established processes, the value proposition for root-causes analysis mainly derives from the automation of leg work rather than finding the actual cause.
With ThirdEye we support algorithmic detection and autonomous root-cause inference but at the same time enable manual configuration of processes, outputs and detection rules. An interesting approach to driving process change is the parallel, comparative execution of established rules and algorithmic results. This helps users and management to build trust in detection performance and enables a gradual shift toward automated monitoring.
Sending out email alerts is easy. Filtering your email inbox is easy too. More invasive alerting such as paging or automated calls at 3:00am are a sure ways to upset large numbers of operations staff and on-call engineers quickly, especially if alerts are false alarms or going to the wrong recipient.
If the detection system sends out false alarms, or non-actionable alerts, they will be ignored very quickly and the system discarded as a nuisance. The worst thing that can happen in this scenario is the management insisting top-down on the use of a bad system. This has brought down entire projects, teams, and careers.
This is the aspect where alert precision matters, but it equally requires an up-to-date view of responsibilities. The best detection is useless if it alerts the wrong person. This ties in with established processes, where different teams use different structures of on-call responsibilities, escalation structures, and investigation processes.
Even if a monitoring solution performs on point and finds the correct recipients, a larger wave of alerts should still be ordered by priority. Root-cause analysis can go a long way here, trying to identify the most critical issues with possibly common causes. We take the approach in ThirdEye to include basic root-cause information with the alerts. For example, drops in business metrics may be accompanied by recent holidays in affected regions. This helps our users to mentally triage issues and minimize fatigue.
One of the most consistent concerns I hear from users considering the adoption of ThirdEye, both inside and outside of LinkedIn, is a high-level question about how the system makes decisions about anomalies. Of course. Analysts, operators, and engineers have an intricate understanding of the systems, metrics and events they are monitoring on a daily basis.
Black-box algorithms have a hard time explaining which inputs matters how and whether the inferred relationships “make sense”. Users strongly prefer transparent solution they can understand intuitively. In my experience, it is overwhelmingly preferable to provide transparent but noisy results over opaque ones with higher accuracy. Root-cause analysis can alleviate part of this opaqueness, but detection algorithms themselves are typically the primary focus.
In my opinion, a big part of this ties back to established processes again. Ultimately, it is the human users who is held accountable for their system working as expected. If automated anomaly detection is to take over part of the monitoring effort, its decisions making has to be transparent to be verifiable ahead of time, and explainable after the fact if it were not to perform as expected.
This has spawned a fascinating effort within the ThirdEye team to develop prediction and detection heuristics that are easy to understand intuitively, yet show strong performance when compared to “more rigorous” statistical tests and algorithms. Many of these are implemented in the open-source project, such as seasonal median aggregation.
Existing rules and monitoring solutions have an established track record. While it may be spotty, users have already spent time and energy to adapt to it. The introduction of a new platform nixes this effort unless it can guarantee equal results – at least in the beginning.
Finally, ThirdEye itself is under development and undergoes continuous scaling. This doesn’t go without friction all the time, especially for custom or cutting-edge features. If the monitoring system itself is shaky or does not function consistently with all types of monitoring data then users are quick to hesitate. This isn’t surprising, after all, the platform’s prime directive is monitoring other system’s reliability.
Another devious source of instability may be the data sources tapped by the monitoring platform. As users become familiar with the depth and variety of data available for detection and analysis, they will notice data being incorrect or unavailable intermittently. Unless the UI does a good job explaining that this is a problem at the source, they may conclude that the platform isn’t working correctly.
A final aspect of introducing a new platform is the education of users. Often a misunderstanding of the systems’ features can be perceived as system failure. Usability and user interface are crucial aspects, as are the user training and trouble shooting resources. Even with this however, we still find ourselves investing substantial effort into helping out new users. User groups and interviews go a long way here as you may not even learn about perceived problems otherwise – such the frequent use of terms that remind of math or statistics, which can trigger visceral reactions.
We have taken ThirdEye a long way in terms of UI and user communication, and there does not seem to be an end in sight. Every iteration of the user interface broadens accessibility to larger groups of users and creates new challenges. Similarly, the documentation becomes increasingly detailed to cover numerous edge-cases and we invested massively in automated unit and end-to-end integration testing.
As the world and systems keep change the definition of “normal” changes as well. Simple alerts and rules are easy to adapt. Expanded an experiment to twice as many users? Just double the traffic threshold. If alerting is driven by a supervised black-box algorithm, this may not be so easy.
Any data-driven detection system has a notoriously hard time adapting to changes in the world. Yes, there a techniques for automated change point detection. And with enough integration with other systems we can add some degree of “intelligence” such as expecting a doubling of traffic when doubling the scope of an automated A/B test. Despite this, the final authority on what represents the “new normal” is a human.
There must be efficient ways for a human to inform the system about expected changes before (or after) the fact. However, only relying on the “human in the loop” isn’t good enough either. A detection system quickly becomes a nuisance again when it requires dedicated feedback for each individual alert in case of changes. ThirdEye’s team has invested into learning from user feedback across alerts and domains. This is a challenging, but very interesting aspect of ongoing development.
Another type of change is the adoption of new technologies and processes throughout the organization. APIs may changed and systems be deprecated. Teams merge and split. At scale it would be impossible for the engineering team of ThirdEye to keep up with all of this. We therefore provide numerous APIs and interfaces for external teams to connect to and plug-in their dedicated business logic. This is a delicate balance between stability and one-off solutions that re-invent the wheel for different teams. Here, an extensive ongoing dialog with the users is the only solution.
The development of ThirdEye, and research in anomaly detection and root-cause analysis at large, is an incredibly insightful journey cutting across a vast number of aspects of modern Internet businesses and technology. It is good to see numerous developers and research address these challenges. However, when diving deep into the technical details it is easy to ignore that adoption is driven by other aspects too. Ease of use, amenability to existing processes and change, and transparent operation are critical as well.