Turbocharging Hadoop Fair Scheduler using Dynamic Job Grouping in Multi-Job Workloads
This study addresses the efficiency challenges in the Hadoop fair scheduler when handling interdependent multi-job workloads in a single-tenant Hadoop environment. Hadoop has become the leading choice for large-scale data processing by leveraging MapReduce parallelism. However, conventional fair scheduling struggles with interdependent multi-job workloads, where the dependency is in the form of one job taking the output of another job as input, leading to underutilized cluster resources and extended job waiting times owing to job dependencies. To remedy this, this research introduces a dynamic job grouping approach that optimizes fair scheduler performance and reduces the overall workload completion time. Simulating a healthcare workload with 35 interdependent jobs, the approach demonstrates a 35% to 40% reduction in average job completion time, showcasing adaptability in diverse scenarios. Recommendations include integrating dynamic job grouping into a fair scheduler, optimal parallelism factor consideration, and the exploration of dynamic job submission mechanisms in the Hadoop ecosystem with diverse datasets. As organizations increasingly adopt Hadoop for multi-job workloads, the proposed approach provides valuable enhancement for dynamic and adaptive job submission, improving overall efficiency.