Synthetic Time Series for Testing in ThirdEye

tl;dr Pinot officially entered incubation with Apache. ThirdEye is part of the Apache Pinot platform and we recently added Quick Start support, so you can try it out hands-on in less than 15min. ThirdEye is capable of generating synthetic dimensional time series on the fly for cool demos and for testing your own detection algorithms later down the line.

Over the past several months we streamlined the setup process for ThirdEye in brand-new environments. Extensible and customizable software is great, but requiring a big up-front time investment just to try it out isn’t. Interactive analysis and visualization are core features of ThirdEye and we want you to be able to experience this directly.

An interesting challenge in setting up and/or demoing ThirdEye in a new environment is the need for useful data for analysis and visualization. We originally planned to use existing data sets, either by anonymizing system metrics we have here at LinkedIn or using publicly available time series data, e.g. financial data or our cloud workload traces. We ultimately decided to go beyond static data and generate synthetic time series on the fly instead. This not only provides infinite supply for demonstration purposes, but also enables our algorithm developers to test and debug new algorithms with previously unseen data. As an additional benefit, model-generated data is portable, well-described and and takes up minimal amounts of space.

A core value proposition of ThirdEye’s interactive analysis is the ability to slice-and-dice data on demand. For synthetic time series this means that we need to actually generate this data with dimensional information in the first place – and the dimensional data must remain consistent for the duration of the analysis. As we wanted to make model-generated data useful for algorithm testing as well, we attempted to enable as much customization as possible without overwhelming users evaluating our platform. One part of this are sensible default settings, another is retaining the ability for fine-grained configuration and customization. Finding the right trade-offs here isn’t trivial and there will always be room for improvements in the future.

Generating synthetic data in ThirdEye

Currently, our “mock” data source supports generating millisecond to daily granularity time series data, with dedicated weekly and daily seasonal components. We chose a configuration model which uses separate parameters for each sub-dimension of a metric. We then compute the aggregate of a metric by adding up the slices of the resulting data cube. The benefits are simplicity and the power to express realistic scenarios, the trade-off is verbose configuration:

On the positive side, we can customize time series amplitude and noise at different frequencies for each individual sub-dimension. ThirdEye is satisfied with only the mean value for a given sub-dimension at minimum. We initialize a Gaussian noise component and the additive daily and weekly seasonality components with defaults that can be overridden as needed. ThirdEye furthermore only creates data for dimension-combinations explicitly defined in the configuration, so you can express gaps in the data by omitting “Safari on Windows Mobile on Pixel 2” (in theory at least).

On the negative side, there is the sheer size of the configuration for high-dimensional data which becomes intractable for 5+ dimensions, e.g. 50 countries x 20 page ids x 10 browsers x 10 operating systems x 10 devices = 1m combinations. The flip side is that many combinations may not actually appear in the wild. Another, more subtle, disadvantage is the inability to synthesize non-additive data sets at this time.

Overall, this setup has served us well for testing internally and we are confident you will get the hang of modifying existing and generating your own mock data quickly – simply look over the sample config files.

Another need for the effective setup of a tool like ThirdEye is the seamless transition from testing to full integration. We made several changes to enable incremental step-by-step integration with your existing infrastructure from “fully isolated and synthetic” to “all real” and already tested the procedure with our partners. Our Quick Start guide contains (optional) steps to integrate data from your existing Pinot cluster. Once you’re set up you can trivially remove the synthetic data source, and what retains is an integrated and fully functional installation of ThirdEye.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.