Tag Archives: Simulation

Creating Validated Simulation Models of IaaS Clouds

In a recent effort to empirically evaluate a newly-proposed power-aware scheduler for private IaaS clouds we ran into problems obtaining accurate simulation results for two cloud testbeds we were working with. This prompted us to investigate an approach for creating validated, yet light-weight simulation models using an approach inspired by Perturbation theory. The approach augments a simple cloud model with measurements taken from a small subset of an actual production system to produce highly accurate predictions at scale.

The “power manager” we investigate is designed to learn from and validate against production traces with multi-month time frames. The sheer duration of these traces makes it necessary to use faster-than-realtime simulation. Furthermore, we want to make predictions about the performance of the scheduler at larger scale than we can observe. Since we have to luxury of having access to two production-quality testbeds, we are also required to deliver a fully functional scheduler that handles workloads replayed on the real-world clouds flawlessly. This article is meant to be an accessible high-level How-To of our work on “Using Trustworthy Simulation to Engineer Cloud Schedulers” published at IC2E 2015.

Our primary goal is to build light-weight simulation models for our two specific private IaaS clouds to evaluate the power manager. These “clouds” have with 5 and 8 nodes, respectively, and are very small compared to commercial cloud installations. Due to their small size we have the opportunity to get away with a simple model while still achieving high accuracy. Having a production system at hand, we chose an approach inspired by Perturbation theory and start with a parsimonious model (“solvable”, e.g. simple model derived from the architecture overview diagram) of our cloud. We then iteratively perturb (“refine”) the model until the desired level of accuracy is achieved.

We lay out a minimal model with the end goal of evaluating the power manager in mind. The power manager draws on information on node occupancy and modifies the power states of individual nodes in our cloud. Hence, our simulation should accurately reproduce node utilization, occupancy, and power states and should allow for a plug-able scheduling algorithm and faster-than-realtime execution. Anything more than this is optional. Hence, we start with a simple model in which instance requests (and timer events) arrive at a scheduler which places instances on individual nodes. These instances then execute on the node for a fixed duration until they complete. Furthermore, the scheduler observes the system state in fixed intervals (epochs) and may hibernate and wake-up nodes as needed.

From experience working with clouds we also decide to run the simulation in a Monte-Carlo-style fashion which allows for non-determinism when it comes to state-transition times and failures in the system. This is necessary since multiple runs of the same workload – on the same cloud – are still subject to concurrency issues and tend to create similar but not necessarily equal results.

Our Perturbation approach then demands that we validate our model’s predictions at small scale – a single node in our case – against measurements of the real system. As expected, the prediction of the initial model diverge by a 15 percent margin and we enter our first round of iterative “perturbing” of the model. Looking at the logs of the production system versus the simulation, three types of unaccounted overheads stand out: instance setup, instance teardown and node power state changes. We extract these three overheads as variables for our model by storing their empirical distributions and later sampling them during the simulation. “Perturbing” our model by introducing these three variables indeed produces simulation results within 1 percent of the real 1-node cluster. But do these predictions hold when scaling up the simulation without collecting additional samples?

For the avid reader, the unsurprising answer to this question is “yes”. Even predictions made for our cloud with all 8-nodes validate against out-of-sample (i.e. later) measurements taken from the real-world system, with an error in the 1-2 percent range. Interestingly, this “cloud-specific” model for our 8-node cluster is trivially portable to the 5-node cluster and produces equally accurate results by swapping out the three empirical variables with measurements from a single node of the other cluster.

While these results are encouraging, a number of qualifications in order. Most importantly, scaling up a simple model cannot be expected to remain accurate for large systems. Shared resources such as network bandwidth or storage contention will put a cap on linear scaling sooner rather than later. Another aspect worth addressing is the assumption that nodes are homogeneous. Without this assumption, different nodes require different empirical distributions which increases model complexity and the required sample count. This may go to the extreme where each node is represented by it’s own samples, which defies the purpose of the approach. On the other hand, from a methodological perspective there are no restriction on “perturbing” the model further to account for these issues.

A practical takeaway of our Perturbation experiment is that it is indeed possible to build a simulation model that scales with the size of an implementation effort. The parsimonious simulation model is manually designed by the developer, which allows qualitative considerations, and augmented with empirical measurements, which adds quantitative information. The subsequent real-world testing of a component developed with the help of this perturbation model then generates new insights and empirical measurements at scale, which in turn can be fed back into the model by iterative “perturbing”.

Cloud Simulators for Research and Development

My personal interest lies in the area of scheduling and resource allocation in IaaS clouds. Evaluating the effectiveness of a new scheduling algorithm is often only visible in over a long period of time, with heavy load on the system. When working with production traces spanning multiple months, empirical evaluation in real-time becomes infeasible. The academic community has picked up on this issue and produced a large variety of simulators that allow evaluation of schedulers in faster-than-realtime. For a taxonomy of evaluation methods for large scale systems, I highly recommend you to have a look at a Gustedt et al. survey from 2009.

Looking specifically at the simulation approach, system evaluation is typically performed from a specific perspective – from the application or the infrastructure provider – and deliver accordingly tailored results. A subset of these simulators is presented below. Another, complimentary summary of existing work by Oujani can be found online as well.

Infrastructure simulators:

CloudSim. One of the primary frameworks used for simulating clouds in academic research. It is the brain-child of the developers of GridSim and has been used in a number of studies as it is highly customizable. Extensions to CloudSim include CloudAnalyst and NetworkCloudSim, which add a GUI and facilities for simulating geo-distributed applications, among others.

GreenCloud. Built on NS2 it’s primary focus lies on exploring the impact of network layouts on cloud performance and energy consumption.

iCanCloud. Focuses on predicting application performance, energy-consumption and cost with different hardware platforms and resource allocation schemes.

MDCSim. A commercial entrant in the area, relying on detailed models of individual hardware components to produce predictions about a clouds performance at scale. The original publication targets 3-tier web applications instead of generic IaaS cloud infrastructures.

DCSim. Simulates IaaS clouds with a specific focus on dynamic power- and SLA-optimization via VM migration. Its authors use tiered scale-out workloads and evaluate the advantage of VM migration and replication strategies over static provisioning.

GDCSim. Primarily concerned with the thermal aspects of power-management in data centers by integrating existing modeling tools. Specifically investigates the interaction of workloads intensity and resource management policies with heat dissipation and fluid dynamics of different physical data center layouts.

Application-perspective simulators:

PICS. A recent entrant in the cloud simulation field, with a focus on accurate reproduction of job execution times and cost on public clouds from traces.

EMUSim. Uses emulation of Bag-of-Task applications to extract performance properties and simulate their behavior at larger scale more accurately. An evaluation step ensures that emulation and simulation agree at observable scales.

These simulator are typically based on discrete-event simulation, using compound models from smaller sub-models. This approach makes them highly customizable, but creates a significant problem calibrating and validating them against real-world measurements. Notably, while application-perspective simulator are published with results to validate their accuracy against measurements taken from real-world execution, this is step appears to be missing for most infrastructure simulators. They are thus mostly applicable to exploratory research and design studies rather than exact performance prediction.