Here’s a selection of collaborative projects I’m involved in – or have been involved with in the past. This list only includes publicly accessible projects and references.
Efficient monitoring of system and business metrics at scale
ThirdEye performs automated monitoring of key business and system metrics in real-time and powers anomaly detection and root-cause analysis for core use-cases at LinkedIn. It’s machine-learning-driven detection engine learns from user feedback and improves accuracy over time. ThirdEye integrates with a number of analytics platforms such as Pinot and the Apache Hadoop eco-system.
ExFed cloud federation
Job federation across IaaS clouds with strict availability SLAs
The ExFed cloud federation platform is a end-to-end prototype for job federation with strict availability SLAs on pre-emptible IaaS instances (“spot instances”). ExFed makes statistical predictions about the near-term availability of pre-emptible resources and uses them to meet requested availability guarantees through careful admission control. It builds on earlier research results in trace-based simulation of IaaS clouds at scale with minimal knowledge about cloud internals.
(“EXFed: Efficient Cross-Federation with Availability SLAs on Preemptible IaaS Instances” @ IC2E 2017)
Validated simulation of IaaS cloud infrastructure (UC Santa Barbara)
TraceSim is a highly accurate IaaS cloud simulator for the development of production quality schedulers. The Monte-Carlo style simulation is calibrated to match a specific real-world cluster and validated via trace execution at observable scale. Validated simulation can then be used for quantitative evaluation of schedulers, which reduces engineering expenses for prototyping and testing, and detailed capacity planning.
(“Using Trustworthy Simulation to Engineer Cloud Schedulers” @ IC2E 2015)
(“Providing Lifetime Service-Level-Agreements for Cloud Spot Instances” @ GCA 2015)
Apache Helix Auto-Scaling
Cluster Auto-Scaling with Apache Helix and YARN (LinkedIn)
This Helix module automates the life-cycle management of distributed systems and adds autonomous Service-Level-Objective-based (SLO-based) capacity planning. Similar to Amazon Auto Scaling, the size of a cluster is adjusted up and down automatically based on objectives defined by the operator without violating constraints and it automatically recovers from failures by replacing failed instances. Notably, these capabilities are independent from a specific cloud service provider and can be deployed in private or hybrid enterprise clouds.
Delphi / Pythia
Autonomous management of multi-tenant database clusters (UC Santa Barbara)
Pythia is a machine-learning-based cluster controller which ensures balanced quality of service across SQL database clusters in the presence of dynamic load changes. It characterizes tenant databases and generates packings of individual tenants into database servers. The characterization and assignment is adapted in real-time and packings are optimized autonomously over time.
(“Characterizing tenant behavior for placement and crisis mitigation in multitenant DBMSs” @ SIGMOD 2013)
Highly efficient parallel data processing (UC San Diego)
TritonSort performs distributed batch processing for very large datasets (100TB+). Design and implementation aim for resource-efficiency, maximizing disk and network I/O. TritonSort set a new world record in 2011, and to-date remains the top-performing system for large data sets in terms of performance and energy efficiency in the competitive Sort Benchmark.
(“TritonSort: A Balanced Large-Scale Sorting System” @ NSDI 2011)
Quant-trading Wiki (currently offline)
Open-source implementations of quantitative trading strategies
The Quant wiki serves as an open-source reference to quantitative trading strategies, back tests and related background information. It focuses on working implementations of algorithms and reproducible evidence. The wiki was created with the intent to provide a structured overview of publicly available information on quantitative trading and stem the tide of front-page turnover in the vibrant online community of quants.
Integration platform for engineering software tools (TU Vienna)
The Engineering Service Bus provides a platform for integrating software tools from different (engineering) domains and heavily reduces the setup time for project-specific tool chains. In extension to a plain Enterprise Service Bus (ESB) approach existing setups can easily be adapted to changes of software components to support dynamically changing development processes.
Simulation of assembly workshops and fault-tolerant production control (TU Vienna)
Simulation of Assembly Workshops provides a framework for modeling automated production in factories. These models are used to approximate the efficiency of control strategies under normal conditions and evaluate the behavior in the presence of machine and communication failures.