Category Archives: Various

Synthetic Time Series for Testing in ThirdEye

tl;dr Pinot officially entered incubation with Apache. ThirdEye is part of the Apache Pinot platform and we recently added Quick Start support, so you can try it out hands-on in less than 15min. ThirdEye is capable of generating synthetic dimensional time series on the fly for cool demos and for testing your own detection algorithms later down the line.

Over the past several months we streamlined the setup process for ThirdEye in brand-new environments. Extensible and customizable software is great, but requiring a big up-front time investment just to try it out isn’t. Interactive analysis and visualization are core features of ThirdEye and we want you to be able to experience this directly.

An interesting challenge in setting up and/or demoing ThirdEye in a new environment is the need for useful data for analysis and visualization. We originally planned to use existing data sets, either by anonymizing system metrics we have here at LinkedIn or using publicly available time series data, e.g. financial data or our cloud workload traces. We ultimately decided to go beyond static data and generate synthetic time series on the fly instead. This not only provides infinite supply for demonstration purposes, but also enables our algorithm developers to test and debug new algorithms with previously unseen data. As an additional benefit, model-generated data is portable, well-described and and takes up minimal amounts of space.

A core value proposition of ThirdEye’s interactive analysis is the ability to slice-and-dice data on demand. For synthetic time series this means that we need to actually generate this data with dimensional information in the first place – and the dimensional data must remain consistent for the duration of the analysis. As we wanted to make model-generated data useful for algorithm testing as well, we attempted to enable as much customization as possible without overwhelming users evaluating our platform. One part of this are sensible default settings, another is retaining the ability for fine-grained configuration and customization. Finding the right trade-offs here isn’t trivial and there will always be room for improvements in the future.

Generating synthetic data in ThirdEye

Currently, our “mock” data source supports generating millisecond to daily granularity time series data, with dedicated weekly and daily seasonal components. We chose a configuration model which uses separate parameters for each sub-dimension of a metric. We then compute the aggregate of a metric by adding up the slices of the resulting data cube. The benefits are simplicity and the power to express realistic scenarios, the trade-off is verbose configuration:

On the positive side, we can customize time series amplitude and noise at different frequencies for each individual sub-dimension. ThirdEye is satisfied with only the mean value for a given sub-dimension at minimum. We initialize a Gaussian noise component and the additive daily and weekly seasonality components with defaults that can be overridden as needed. ThirdEye furthermore only creates data for dimension-combinations explicitly defined in the configuration, so you can express gaps in the data by omitting “Safari on Windows Mobile on Pixel 2” (in theory at least).

On the negative side, there is the sheer size of the configuration for high-dimensional data which becomes intractable for 5+ dimensions, e.g. 50 countries x 20 page ids x 10 browsers x 10 operating systems x 10 devices = 1m combinations. The flip side is that many combinations may not actually appear in the wild. Another, more subtle, disadvantage is the inability to synthesize non-additive data sets at this time.

Overall, this setup has served us well for testing internally and we are confident you will get the hang of modifying existing and generating your own mock data quickly – simply look over the sample config files.

Another need for the effective setup of a tool like ThirdEye is the seamless transition from testing to full integration. We made several changes to enable incremental step-by-step integration with your existing infrastructure from “fully isolated and synthetic” to “all real” and already tested the procedure with our partners. Our Quick Start guide contains (optional) steps to integrate data from your existing Pinot cluster. Once you’re set up you can trivially remove the synthetic data source, and what retains is an integrated and fully functional installation of ThirdEye.

Detection methods in ThirdEye got you covered

tl;dr ThirdEye comes batteries included with detection methods that cover most use-cases. Here’s an overview of major issues that effective detection needs to consider. And because I know you just can’t wait to roll your own detection logic, ThirdEye also includes a powerful debugger UI for algorithm development and tuning.

ThirdEye is an open-source monitoring platform for business and system metrics. It’s part of the Pinot analytics platform and comes batteries included. While ThirdEye supports proprietary algorithm plugins to cover specialized requirements, we’ve also implemented a number of detection methods in open-source that cover a wide range of typical use-cases. True to Pareto they cover 80 percent of use-cases for 20 percent of the effort. They are also transparent and easy to understand for first-time adopters.

Rather than launching into a detailed methodology section I’d like to give an idea of the day-to-day challenges ThirdEye faces monitoring various metrics at LinkedIn effectively. ThirdEye primarily covers time series with a granularity between 1 minute and daily intervals. Different types of time series range from smooth aggregate business metrics, such as page views to jittery system metrics such as latency percentiles and error rates. While each of type of metric has its own challenges, there are common aspects.

Seasonality

We frequently observe weekly and daily seasonality in business metrics, which isn’t surprising. A professional networking website like LinkedIn would be expected to see most of its traffic during during daytime and weekdays and less traffic during night time and weekends. Hard-coding this behavior by itself is ineffective, however, as other activity such as weekly batch computation on Hadoop clusters may execute primarily during low traffic periods.

When we implemented general-purpose detection methods, we therefore focused on accurate handling of specific weekly and daily patterns by default. In other words we want to achieve great detection performance (high precision and recall) on data with seasonality without requiring any additional configuration from the user.

Calendar effects and expected change

In addition to obvious daily and weekly patterns there are calendar effects throughout the year that need not occur in precise intervals and may affect only “part of” a metric. Depending on the monitoring use-case this effect may be considered “anomalous” or fully expected and unremarkable.

An example for this is the Easter holiday across many parts of Europe. The start of the Easter holiday is determined by the type of calendar (Julian vs Gregorian) and moon phases. The holiday turns into a one-and-a-half week long vacation for families with school-age children while it’s merely a long weekend for most young professionals. It therefore impacts different segments of user in different ways, which also translates to differentiated impact on system metrics.

Here ThirdEye’s detection algorithms benefit from the knowledge base of ThirdEye’s Root-Cause Analysis (RCA) component. RCA already tracks holiday databases and extracts relationships between your computer systems and metrics from meta data, configuration, and previous analysis sessions. Additionally, ThirdEye allows users to track future expected events and their impact on specific metrics.

Naturally, while this approach covers major metrics the stored knowledge can ever be fully complete. If ThirdEye still detects an anomaly that users consider “expected”, such as a drop in website traffic during the July 4th weekend in the US, we do the next best thing avoid alerting about the “bounce-back” the following business days (but ThirdEye would alert you in case of a lack thereof).

Temporary outliers

These are the “anomalies” in anomaly detection, such as a server failure causing a spike in error rates or a holiday decreasing desktop-based page views. Affected metrics temporarily deviate from their usual behavior and are expected to move back into line once the root-cause has been mitigated. Detecting the anomaly is only part of the problem however. The other is figuring out what would have happened without this anomaly. This is relevant for both root-cause analysis as well as correct detection of outliers in the future.

For detection to be automated we cannot simply rely on rules defined by expert users across thousands of metrics. Rather we have to learn from time series behavior, extract relevant features, and fit an effective model. Unfortunately, the real world isn’t a clean room laboratory and outliers in the training data can be detrimental to model fit and detection performance. When the system has been monitoring a (similar) metric for a longer time and a set of user-labeled outlier regions is available, excluding data anomalies is easy. Bootstrapping the monitoring process from scratch for a previously unknown metric in an automated way is non-trivial. Exclude too much and detection becomes noisy, don’t exclude enough and detection may miss critical outliers in the future.

For example, if a mid-week holiday decreases page traffic this week we may want to send out an alert, but we do not want to alert anyone the following week when the traffic returns to normal levels. Therefore, the outlier period needs to be excluded from any future baseline computation. This can become tricky when multiple anomalies overlap or appear in rapid succession or when there are strong underlying trends in the data.

Noise suppression

Another common issue of anomaly detection in a production environment is low data quality. A general-purpose monitoring service like ThirdEye supports data ingestion from many different sources. When not supported by a scalable analytics store like Pinot, some users prefer data quantity over data quality especially when ongoing production issues (found by ThirdEye) are prioritized over maintaining internal ETL pipelines. Thus, we sometimes find ourselves confronted with gaps or intermittent data dropout.

We support algorithms in ThirdEye with a library of common smoothing methods and tools that are resistant to noise. There isn’t any free lunch here, as tolerance to noise comes with a potential degradation in detection latency. For a example, if a time series frequently sees dropout of data points ThirdEye alerts may be delayed by another data point or two until we are sufficiently certain that an observed deviation indeed comes from the underlying metric rather than ingestion process.

Many built-in algorithms can tolerate data dropout with a trade-off for either precision or detection latency (or a well-maintained ETL pipeline and analytics system). From our practical experience of operating ThirdEye here at LinkedIn we know that this choice is use-case and team-specific and can therefore managed via alert-specific configuration.

Permanent change points

Finally, the only constant in a growing business is change. The best model eventually becomes useless as the world around it changes. Business changes, technologies are replaced, and your product’s end-users will always find a way to surprise you. This change also reflects itself in the metrics monitored by ThirdEye.

Detection algorithms in ThirdEye handle this change in different ways: from moving windows and exponential decay to change-point detection from temporal distributional change or even manual user-input. And while accurate change point detection is already challenging we often cannot just stop monitoring a recently changed metric until sufficient data has accumulated to re-fit fitting a model after a change point.

ThirdEye implements methods that attempt to project properties from the previous sample onto the changed time series. For example, this allows us to detect unexpected effects of ramp up in A/B tests as anomalies but continue to monitor the metric reasonably well at the new, higher level by re-using established seasonality but scaling expected mean and variance proportionally. While this topic deserves an article by itself, I think this example conveys the fundamental principle.

Live preview and debugger UI

With many methods and heuristics implemented in ThirdEye we still have not found a silver bullet to reliably address all of these concerns at all times. We therefore set out to make it easy for users to judge detection performance by providing a preview and debugger UI for algorithm configuration and development.

For ThirdEye users the UI overlays the time series of choice with ranges of detected anomalies and allows comparison of various configuration settings and tuning methods. This way any algorithm can be back-tested and evaluated against historical data quickly before deploying it for real-time detection. Furthermore, our users can often troubleshoot small issues with algorithms themselves without having to reach out for in-depth support.

For algorithm developers the preview UI goes a step further and allows them to expose the internal state of the detection algorithm in the form of time series and other output and overlay it with the time series being analyzed. These capabilities have super-charged the development and iteration on detection algorithms for both the open-source distribution of ThirdEye as well as for algorithms tailored to specific internal data sets at LinkedIn. Algorithm development went from log scraping and break points to a visual debugger experience. Additionally, this visual debug output goes a long way in adding transparency to algorithm behavior for power-users and helps first-time adopters build trust in the technology.

I hope this article gives you some food for thought for building your own detection algorithms and convinces you that “off-the-shelf” in ThirdEye already goes a long way for most of your use-cases. And whether you want to build algorithms yourself or rather observe the inner workings of existing ones, I’m sure ThirdEye’s preview UI will make your day.

Crunching Numbers Comfortably with IPython Notebook and Pandas

A big part of building distributed computer systems is delivering proof they actually work. Besides a live demo with shiny front-end and a polished slide deck, raw numbers are ultimately necessary to show that promises of robust availability, high throughput and low latency are kept during real-world use. And sometimes you may need numbers for debugging as well.

My clear personal favorite for data analysis and visualization (and light programming) is Python, and in extension IPython Notebook, matplotlib, seaborn and the time-series analysis framework Pandas. Their integration has become seamless over the past years and they are very well suited for pretty much any task from quickly visualizing application logs to in-depth looks at time series and performing statistical inference. As examples for successful use of these tools I can offer our recent work on validated simulation of IaaS clouds and SLAs for spot instances. When looking closely at these publications the astute reader will find the giveaways of graphs generated with these tools.

If you haven’t used IPython notebook yet, I highly recommend you invest 1-2 hours in getting familiar with the basics. Personally, it took some time to overcome my internal inertia and finally spend the time necessary – and I haven’t looked back since. It makes life quite a bit easier. I also had the opportunity earlier this year to talk to Brian Granger – one of the master minds behind IPython – and heard about the plans for expanding IPython’s scope with project Jupyter. I’m excited to see what’s coming down the pipeline in terms of high-performance analytics for those lengthy production log files we have sitting around.

Pandas had a steep learning curve for me as well, but it took some time to get my head around some of the intermediate indexing and slicing techniques. As I figured this out, however, productivity shot through the roof. Importing text, csv, json and xml? No problem. Join three different data sets on different columns and get aggregate statistics a la SQL? Check. Plot intermediate results to debug heavy scipy use? Quick and easy. Things that took hours before get done in minutes now. It was well worth spending an afternoon to get familiar with it.

Despite all this greatness, there’s a caveat. For presentation slides I still find myself falling back on Microsoft Excel for most visualization. Yes, I know my coolness factor just took a hit. The WYSIWYG (“what-you-see-is-what-you-get”) formatting capabilities are still more time-efficient than figuring out the various corner-cases of matplotlib calls. That being said, I usually prepare the data plotted with Excel using the aforementioned Python tools.

Probably the easiest way to get it all set up is a pre-configured Python distribution such as Continuum Anaconda. An installation from scratch with pip and co is possible as well, but depending on your platform you will end up dealing with version conflicts manually. In case you run into any roadblocks there’s a solid user base for all these tools. This means that stackexchange.com is an invaluable resource for troubleshooting in addition to the official IPython and Pandas docs.