Using the Shuffle Handler

By default, Hive on MR3 uses an external shuffle service, such as Hadoop/MapReduce shuffle service, in order to send and receive intermediate data between ContainerWorkers. From MR3 0.5, Hive on MR3 can also use the shuffle handler available in the runtime system of MR3. Since the shuffle handler is implemented as a DaemonTask, every ContainerWorker runs its own thread for the shuffler handler. As a result, Hive on MR3 can now run in an environment where an external shuffle service is not available (for example, on Kubernetes).

In order to use the shuffle handler, the user should set three configuration keys in hive-site.xml and tez-site.xml:

  • Set hive.mr3.use.daemon.shufflehandler to true in hive-site.xml. Now Hive on MR3 attaches a DaemonTask for the shuffle handler to ContainerGroups.
  • Set to tez_shuffle. Now the runtime system of MR3 routes intermediate data to the shuffle handler of MR3, not to an external shuffle service.
  • Set tez.shuffle.port to a port number for the shuffle handler. The default value is 15551.

If a ContainerWorker fails to secure a port number specified by tez.shuffle.port, it chooses a random port number instead. Thus several ContainerWorkers can run on the same node without conflicts. If hive.mr3.use.daemon.shufflehandler is set to false but is set to tez_shuffle, ContainerWorkers fail with NullPointerException (from ShuffleUtils.deserializeShuffleProviderMetaData()).

Currently Hive on MR3 can use the shuffle handler only with the all-in-one ContainerGroup scheme.

The following configuration keys in tez-site.xml controls the behavior of the shuffle handler.

Name Default value Description
tez.shuffle.connection-keep-alive.enable false true: keep connections alive for reuse. false: do not reuse
tez.shuffle.max.thread 0 Number of threads for the shuffle handler. 0 sets the number of threads to 2 * the number of cores
tez.shuffle.listen.queue.size 128 Size of the listening queue. Can be set to the value in /proc/sys/net/core/somaxconn.
tez.shuffle.mapoutput-info.meta.cache.size 1000 Size of meta data of the output of mappers