Prioritize predictable performance in Hadoop

Prioritize predictable performance in Hadoop

The growth of Apache Hadoop over the past decade has proven that the ability of this open source technology to process data at massive scale and allow users access to shared resources is not hype. However, the downside to Hadoop is that it lacks predictability. Hadoop does not allow enterprises to ensure that the most important jobs complete on time, and it does not effectively use the full capacity of a cluster.

YARN provides the ability to preempt jobs in order to make room for other jobs that are queued up and waiting to be scheduled. Both the capacity scheduler and the fair scheduler can be statically configured to kill jobs that are taking up cluster resources otherwise needed to schedule higher-priority jobs.

These tools can be used when queues are getting backed up with jobs waiting for resources. Unfortunately, they do not resolve the real-time contention problems for jobs already in flight. YARN does not monitor the actual resource utilization of tasks when they are running, so if low-priority applications are monopolizing disk I/O or saturating another hardware resource, high-priority applications have to wait.

As organizations become more advanced in their Hadoop usage and begin running business-critical applications in multitenant clusters, they need to ensure that high-priority jobs do not get stomped on by low-priority jobs. This safeguard is a prerequisite for providing quality of service (QoS) for Hadoop, but has not yet been addressed by the open source project.