Cloud Traces and Production Workloads for Your Research

Only interested in the raw trace data? Skip to the end.

(EDIT 2022-07-14: added Alibaba GPU and micro-services traces. H/T Yuyang Wang)
(EDIT 2022-07-14: added Chameleon Cloud traces. H/T Maël Madon)
(EDIT 2021-06-09: added new Azure Traces (2019, 2020, Serverless, DNN training). H/T Apoorve Mohan)
(EDIT 2021-06-09: added new Google Traces (2019). H/T Apoorve Mohan again, and again)
(EDIT 2021-06-09: add IBM docker registry paper. H/T Yue Cheng)

(EDIT 2021-06-09: Thank you to the many contributors and commenters! Without your help, this awesome collection wouldn’t have happened. Special thanks to Apoorve Mohan, Dachuan Huang, Saurabh Jha, Yue Cheng, and the folks at ResearchGate. Full change log at the bottom)

Whenever there’s a new idea for a cloud scheduler, my first step is a quick draft of the algorithm in an IaaS cloud simulation framework – punching out every idea on a production system simply isn’t feasible. The simulator then needs to be fed with platform configuration about system hardware and some type of utilization trace. The easiest type of workload trace to look at is generated from synthetic distributions, but this has some limitations. The traces we work with at minimum contain (a) job start times, (b) a type of job size such as duration or amount of data to process, and (c) a job type such as the instance type other form of constraint. When I speak of workload traces in this article, I am specifically referring to traces of batch jobs with fixed units of work. As an example, for one of our recent papers about SLA-enforcement for IaaS spot instances this means in detail:

  • request timestamp
  • instance life-time
  • instance core count
  • any additional data …

Generating realistic cloud workloads synthetically has spawned an entire branch of research. My focus in this article is rather a practical description of the steps I personally take for developing and evaluating a new cloud scheduler.

I usually start with a synthetic trace with job inter-arrival times and durations generated from an exponential distribution, with uniform core size – in our example a core count of 1 – for all requests. If the new scheduler doesn’t provide satisfactory results with this, it’s back to the drawing board. The next stage uses a log-normal distribution for arrival and duration, as this better models the long-tail properties of jobs encountered in real-world traces. A last extension for the synthetic traces then is the introduction of a non-homogeneous mix of instance sizes – which has been the demise of quite a few ideas. While the synthetic approach is a useful basic for testing, it does not re-create the kind of challenges that production traces pose, such as change-points in user-behavior, time-varying auto-correlation, and seasonality in the workload.

When a scheduler prototype enters serious consideration, I am a strong proponent of using traces recorded from production systems for evaluation. Unfortunately, this is where evaluation becomes difficult. Besides handling the technological complexity of the scheduler, a logistical problem comes up: the scarcity of publicly available production traces. This can be a big challenge for the aspiring cloud researcher. I’ve listed a number of notable exception below, but generally companies in the cloud space either do not record utilization traces over the long-term or they heavily guard these traces and rarely allow the interested researcher a glimpse. If researchers do get access, they often cannot name the source of the traces and cannot re-distribute the raw data used as foundation for their work. This in turn creates problems with the reproducibility of results and slows down the overall innovation process. The desire to protect a company’s competitive edge is understandable, and yet the availability of anonymized traces would spark innovation and drastically support academic research.

Fortunately, there are exceptions to this rule of scarcity. Here is a selection of public traces that we have found valuable in testing the real-world suitability of cloud schedulers:

Alibaba GPU traces. The released trace contains a hybrid of training and inference jobs running state-of-the-art ML algorithms. It is collected from a large production cluster with over 6,500 GPUs (on ~1800 machines) in Alibaba PAI (Platform for Artificial Intelligence), spanning the July and August of 2020.

Alibaba Micro-Services traces. The released traces contain the detailed runtime metrics of nearly twenty thousand microservices. They are collected from Alibaba production clusters of over ten thousand bare-metal nodes during twelve hours in 2021.

Chameleon Cloud traces. Data from OpenStack Nova/Blazar/Ironic services, as well as software to extract the appropriate data. The chameleon data spans samples from 2017 to 2020.

Azure Public Dataset. Very large trace of anonymized cloud VMs in one of Azure’s availability zones. Contains cpu and memory utilization plus deployment batch size. Cortez et al. analyze the original trace in their SOSP 17 paper. Microsoft keeps adding over time (2019, 2020, Serverless, DNN training).

Google cluster workload. Published by Google in an effort to support large-scale scheduling research, these traces from a Google data center cell have attracted analysis efforts from a number of researchers in the meantime, e.g. an analysis by Sharma et al.. The trace covers a 1-month time frame and 12.000 machines an includes anonymized job constraint tags. They added a new trace in 2019 as well.

IBM Docker Registry traces. More a server access trace rather than raw VM status, but increasingly relevant with the adoption of kubernetes and containerization. Anwar et al. published the matching paper at USENIX in 2018.

Blue Waters HPC traces. (uses LDMS) Cray Gemini torodial network traces from the NCSA’s Blue Waters cluster. Especially relevant for HPC networking studies. Jha et al. present the trace with their work on Monet.

Mustang and Trinity HPC traces. HPC cluster traces from Los Alamos National Labs. The Mustang trace is a smaller cloud-like trace with node counts and groups ids, whereas the Trinity trace comes from a large-scale super computer with backfill scheduler. G. Amvrosiadis et al. analyze the traces and summarizes the results.

Alibaba Cluster Trace Program. Data center traces for VMs with batch workloads and DAG information. Contains a 12 hour and a longer 8 days trace, with cpu and memory allocation. Lu et al. analyze the trace.

CERIT-SC grid workload. Traces from a cluster running cloud and grid applications on a shared infrastructure. Contains traces with resource foot print, instance groups, and allocated hosts. Klusácek and Parák analyze the trace.

TU Delft Bitbrains traces. Two data sets about VM allocation in a distributed data center focused on financial applications. One trace uses SAN storage, the other has a mixed population. Provides fine-grained cpu, memory, disk, and network utilization data over several weeks. Shen et al. analyze the trace. There are several other traces under “datasets”.

Eucalyptus IaaS cloud workload. Anonymized multi-month traces scraped from the log files of 6 different production systems running Eucalyptus private IaaS clouds. Published as part of a study by Wolski and Brevik. The traces contain start- an stop times for instances, their size and the node allocation as decided by the native scheduler. We added the traces from our IC2E 2015 paper on trustworthy cloud simulation as well.

Yahoo cluster traces. A number of data sets from Yahoo’s production systems. Most notably contains system utilization metrics from PNUTS/Sherpa and HDFS access logs for a larger Hadoop cluster. Additionally provides data sets with file access statistics and time-series for testing anomaly detection algorithms.

Cloudera Hadoop workload. (no trace) Similar to the above with data from production systems of anonymous Cloudera customers and Facebook and analyzed by researchers from UC Berkeley. Unfortunately, the raw data is not available.

OpenCloud Hadoop workload. Taken from a Hadoop cluster managed by CMU’s Parallel Data Lab, these traces provide very detailed insights in the workload of a cluster used for scientific workloads for a 20-month period. Includes timestamps, slot counts, and more. K. Ren et al. investigate the traces in depth.

Facebook Hadoop workload. A number of 1-hour segments from Facebook’s Hadoop traces published as part of UC Berkeley AMP Lab’s SWIM project. Some segments contain arrival times and duration, whereas others provide the amounts of data processed.

Notably, most of these traces stem from Hadoop clusters and are limited to data-mining applications. More generic IaaS-type workloads can be found in the Eucalyptus traces and, potentially, the Google trace. I want to emphasize that these are very different types of batch workloads that can offer interesting insights in the behavior of a cloud system under varying conditions. I hope this short reference provides a jump-off point for both researchers and engineers to get their hands on a broader variety of production traces.

Change Log

(EDIT 2022-07-14: added Alibaba GPU and micro-services traces. H/T Yuyang Wang)
(EDIT 2022-07-14: added Chameleon Cloud traces. H/T Maël Madon)
(EDIT 2021-06-09: added new Azure Traces (2019, 2020, Serverless, DNN training). H/T Apoorve Mohan)
(EDIT 2021-06-09: added new Google Traces (2019). H/T Apoorve Mohan again, and again)
(EDIT 2021-06-09: add IBM docker registry paper. H/T Yue Cheng)
(EDIT 2021-06-09: move change log to the bottom)
(EDIT 2020-02-03: added Blue Waters HPC network traces. H/T Saurabh Jha)
(EDIT 2019-07-04: added Mustang and Trinity HPC traces. H/T Apoorve Mohan, again)
(EDIT 2019-03-11: added Azure and Alibaba traces. H/T Apoorve Mohan)
(EDIT 2018-02-21: added TU Delft Bitbrains and CERIT-SC traces. Via ResearchGate)
(EDIT 2017-08-01: added traces from our IC2E 2015 paper “Using Trustworthy Simulation to Engineer Cloud Schedulers”)
(EDIT 2015-09-15: added Yahoo cluster traces. H/T Dachuan Huang)

46 thoughts on “Cloud Traces and Production Workloads for Your Research

  1. Hi Alex, This is very interesting, thank you.
    I was wondering if you know where to get traces of the datanodes.
    I’m doing a research on optimizing inter-server traffic. So I need the traces of the inter-node communication.

    On my local cluster, each datanode has a trace named:
    hadoop-hdfs-datanode-serve_name.log.

    Among others, it has the following information:
    INFO datanode.DataNode (DataXceiver.java:writeBlock(603)) – Receiving BP-186-192.198.1.54-419:blk_825_01 src: /192.168.1.59:50810 dest: /192.168.1.54:50010

    Thanks.

  2. Hi Arik,
    I primarily collect job tracker data and IaaS instance traces, so unfortunately I don’t have any HDFS-specific logs. The Yahoo and Google cluster traces may contain some processed data on file access.

  3. What is actually a task ? Example of Google cloud task ? What actually task is doing ??
    Plz answer

  4. Hi Rupali, the google trace is anonymized, i.e. there’s no info about the specific applications. AFAIK the trace contains primarily batch jobs. The Sharma et al. paper may have some pointers.

  5. Hi Alex,
    I must say, this is a very helpful post.
    As I am focusing on VNE problem, I wonder if I could get some related trace.

    The trace should contain, the request ID, # of VMs requested and their communication graph, the configuration of each VMs etc.

    Thanks

  6. Hi Chinmaya,
    Trace anonymization typically requires configuration and comm patterns to be removed. That said, some traces have fairly predictable patterns (e.g. certain classes of hadoop jobs). The Wolski and Brevik paper also has details about the origins of the Eucalyptus traces.

  7. Hi Alexander,

    This is a very interesting blog! I thank you for the data you provide.
    I am a researcher dealing with resource allocation and optimization in data center networks.
    I need real workload traces for Facebook data centers in order to evaluate my proposals.
    You’ve already providen a link for FB traces above, but unfortunately, the traces are little bit old (since November 2010).
    A new study “Inside the Social Network’s (DataCenter) network”, published in ACM Sigcomm 2015, consider more recent workload for Facebook DC, in 2015.
    Have any idea how I can find this dataset?
    I will be very thankful if you can provide it for me!

    Bests

  8. Hi Boutheina,
    I recommend you contact the authors of the paper directly. Chances are, however, you’ll have to get access/approval right from the source (i.e. Facebook).

  9. Hello Alex,
    I am an aspiring PhD student and I am interested in using the Google cluster logs to achieve the following;-
    -Show the heterogeneity of the cluster infrastructure and workloads
    -Show how memory and CPU has been is over provisioned
    -Show the difficulty of predicting cloud workloads
    -Show that there is enormous opportunities in saving energy if resource provisioning is done right.
    However, i do not know how to get the Google cluster log and what i should know for me to achieve this. Any help will be appreciated.

  10. Hi Alex,
    I am PhD student working on energy efficiency and resource optimization in data centers. Could you please help where could I get some traces which include workload plus the energy consumption of the data centers including the cooling and other air-conditioning consumption??

  11. Hi Ehsan,
    Unfortunately I don’t have any traces for auxiliary systems power consumption. You could check Gupta’s (ASU) “GDCSim” paper. There might be some pointers.

  12. pls i am working on optimization of virtual machine performance and i need google trace or AWS trace

  13. Hi ajayi

    Just google “Goolge cluster data” and you can get it from the git-hub

  14. Hi,
    I’m PhD student in Computer Science and working on the energy efficiency of data centres. I need traces that reflect servers utilization, power consumption and the performance metrics of a given task as well. Is any resources out there helping me?

  15. Hi Morteza,
    I just added two traces I discovered recently. If you need fine-grained metrics like cpu, memory, network and disk I/O the Bitbrains trace should be interesting.

  16. Hi Alex,

    I am master student and doing my thesis in vms traffic aware placement in cloud data center. I need traffic traces contains the vms id and physical machines the racks and switches with the size of the packets between them. Any recommendations?

  17. Hi Alex,

    I am a PhD student working on optimization of Fog/Edge Computing allocation services. Is there any resources helping me? Many thanks.

  18. Hi Alex
    I am working on cloud security, can anyone provide data related hierarchical security.
    Thanks

  19. hi,
    alex,
    I am working on cloud resource management.
    I am considering cloud container with virtualization (container on virtual machine) scenario. I need trace that provide details like
    timestamp , application-id,no. of containers , no. of vms , no.of hosts (physical machine), cpu utilization , memory utilization , bandwidth , no. of requests, no.of users,response time, percentage of sla violations , scaleup/scaledown
    would you please help me for providing this info ??

    good day,
    Hiren

  20. Hi, I am working on Trust evaluation in cloud based on comptation performed on cloud over a given data. Could you help me with any dataset or existing framework to get started . I would appreciate it

  21. hi, in my master proposal, I’m working on a cloud scheduling mechanism depending on a historical cloud data set that contain all tasks with their attributes (cpu, ram , task length, deadline,…) and their matched Vms with their attributes and their resource utilization such as memort, cpu so that we can predict the type and number of needed Vms at the first iteration of our scheduling algorithm.

    so I need a real cloud data set with the above attributes ,please.

  22. Hello,

    I am looking for Openstack logs (Openstack traces )or google traces for my project. this logs should be included CPU usage, memory usage, VM, delay, packet loss
    can you guide me on how I can get it?

    Thank you

  23. Thanks for this awesome collection!
    Another trace that is not listed here: Chameleon cloud testbed ( https://www.scienceclouds.org/cloud-traces/ ) . I don’t know if you want to list it here giving that it’s not a production cloud but a scientific cloud (same as CERIT-SC, that is listed here though).

  24. Hi,
    I’m PhD student in Computer Science and working on network traffic prediction.
    Is there any public dataset for traffic in data centers ?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.