A Testing Ground for Smarter Cloud Scheduling
22, June, 2026
·3 minutes read
·
Overview
Every cloud application has to answer a deceptively simple question many times a day: which machine should run this piece of work? In systems built on Kubernetes, the industry-standard platform for managing containerised applications, that job falls to a component called the scheduler. Get its decisions right and work flows smoothly while resources are used efficiently. Get them wrong and tasks pile up waiting, while expensive hardware sits idle.
Researchers are increasingly turning to machine learning to make those decisions smarter, letting schedulers learn good strategies rather than follow fixed rules. But there's a practical obstacle: how do you test a new scheduling strategy without disrupting a live system? Experimenting on a real cluster is costly, hard to reproduce, and risks interfering with the actual workloads people depend on.
As part of the HYPER-AI project, researchers from the Cyprus University of Technology (CUT) and the University of Applied Sciences and Arts Western Switzerland (HES-SO) have built an answer: an open-source Kubernetes workload simulator that lets teams try out scheduling strategies in a safe, controlled, and repeatable environment.
A controlled environment for fair comparison
The idea is to recreate the behavior of a real cluster without the cost and unpredictability of running one. The simulator lets researchers define a make-believe cluster of machines, with whatever mix of processing power, memory, and storage they want, spanning anything from large cloud servers to small edge and IoT devices. They can then generate synthetic workloads, streams of tasks with realistic arrival patterns and resource demands, and watch how a given scheduling strategy copes.
Because the setup is fully reproducible, it becomes possible to do something that's genuinely hard on live infrastructure: change only the scheduler, keep everything else identical, and compare the results head to head. That makes for fair, like-for-like benchmarking of one strategy against another.
Built for flexibility
The simulator is deliberately modular, separating the cluster, the workload, the scheduling logic, and the evaluation into independent parts that can be swapped out. It offers two ways to run: a lightweight Kubernetes-style mode that emulates real cluster behavior without the overhead of actual containers, and a fast native mode for quick experimentation. Scheduling strategies plug in through a common interface, so anything from a simple rule-based approach to a trained machine-learning model can be tested under identical conditions, including against Kubernetes' own default scheduler.
Throughout each run, the simulator records detailed traces and performance metrics, such as how long tasks wait, how long they take to complete, and how heavily each machine is used. Early experiments confirm the point of the exercise: different scheduling policies produce visibly different patterns of resource use, which is exactly why a systematic, repeatable way to evaluate them matters.
Why it matters
The simulator lowers the barrier to developing and validating better scheduling strategies, especially the learning-based ones that are difficult to train and test safely on production systems. By bridging the gap between abstract models and real deployments, it gives researchers and engineers a practical sandbox for iterating quickly and comparing approaches with confidence. The team plans to extend it with richer system modeling, closer integration with real Kubernetes clusters, and a graphical interface to make it more approachable still.
The full paper, presented at CLOUD COMPUTING 2026, is available here.