June 2, 2016
In services like cloud computing or supercomputing, thousands of computing tasks are sent for execution on clusters of servers each second. Coordinating the myriad of incoming requests a cluster receives (e.g. which machine should execute job X, how many machines should be used to run process Y, etc.) is a daunting task, and one that peaks the interest of CyLab Ph.D. student Alexey Tumanov.
“Oftentimes it takes many computers to perform a job,” says Tumanov, a Ph.D. student in Electrical and Computer Engineering who works in the Parallel Data Lab. “My work makes sure that those computers work together efficiently to get such jobs done.”
Tumanov’s latest study, “TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters,” just received the Best Student Paper Award at last month’s European Conference on Computer Systems in London, U.K.
To keep job resource requests in line, computing clusters employ “schedulers” that assign resources to complete the requested work. Scheduling is essential to computing and makes it possible for a computer to complete multiple tasks on a shared collection of servers.
Currently, some schedulers assume all of its resources have similar characteristics, Tumanov says, which results in an underutilization of cluster resources. Additionally, some schedulers are also not taking advantage of information that can be used to determine the amount of time a task will take on different types of resources.
“These two pieces of information – the types of available resources that can be used to perform the job and the deadlines and estimated run times associated with jobs – are at the core of TetriSched,” Tumanov says. “This scheduling work is all about answering the question of who runs where, and when.”
Tumanov and his collaborators developed an algebraic language called “Space-Time Request Language” (STRL), a simple language construct that substantially reduces the number of mathematical computations needed to find an optimal space and time for any given number of job requests. As a result, TetriSched is able to construct higher quality schedules by assigning the optimal resources to the right jobs, planning ahead which jobs to defer to a later time, and continuously re-evaluating to address new job arrivals and job runtime mis-estimates.
Tumanov and his collaborators are running TetriSched on a real, 256-node cluster housed at Carnegie Mellon.
Other authors in the paper include Computer Science Ph.D. students Timothy Zhu and Jun Woo Park, computer science professor Mor Harchol-Balter, Jatras Professor of Electrical and Computer Engineering and Computer Science (by courtesy) Greg Ganger and Intel Labs Principal Engineer Michael Kozuch.
See all CyLab News articles