Installing Hive on MR3

In order to install Hive on MR3, download an MR3 release (e.g., hivemr3-0.3-hadoop2.7-hive3.0.0-tez0.9.1.tar.gz) and uncompress it in a directory of your choice. A full release contains everything for running Hive on MR3, including scripts, preset configuration files, and jar files. Alternatively download a minimal MR3 release (e.g., hivemr3-0.3-hadoop2.7-minimal.tar.gz) and rebuild all necessary components from the source code. Then the user can try Hive on MR3 after a few additional steps. The following structure shows important files and directories in the release:

├── env.sh
├── conf
│   ├── cluster
│   ├── local
│   ├── mysql
│   └── tpcds
├── hadoop
├── hive
│   ├── compile-hive.sh
│   ├── gen-tpcds.sh
│   ├── hiveserver2-service.sh
│   ├── metastore-service.sh
│   ├── run-beeline.sh
│   ├── run-hive-cli.sh
│   ├── run-tpcds.sh
│   ├── benchmarks
│   │   └── hive-testbench
│   └── hivejar
│       ├── apache-hive-1.2.2-bin
│       ├── apache-hive-2.3.3-bin
│       └── apache-hive-3.0.0-bin
├── mr3
│   ├── upload-hdfslib-mr3.sh
│   ├── mr3jar
│   ├── mr3lib
│   └── mr3-ui
└── tez
    ├── compile-tez.sh
    ├── upload-hdfslib-tez.sh
    └── tezjar
        ├── tez-0.7.0.mr3.0.1
        └── tez-0.9.1.mr3.0.1

Prerequisites for running Hive on MR3

In order to run Hive on MR3, the following requirements should be met.

  • Basic Hadoop commands such as hadoop, hdfs, and yarn should be available.
  • The user should have access to his home directory and /tmp directory on HDFS.
    • Ex. A user foo should have access to /user/foo and /tmp on HDFS.
  • If a directory /tmp/$USER already exists, it must have directory permission 733, not 700.
    • Ex. When a user hive starts HiveServer2, either a directory /tmp/hive should exist with directory permission 733, or such a directory should not exist. HiveServer2 automatically creates a new directory with permission 733 if it does not exist.
  • MySQL should be running if the user wants to run Metastore with a MySQL database. The user should also have access to the database with a user name and a password.
  • ss should be available in order to automatically terminate Metastore and HiveServer2. If not available, the user should terminate Metastore and HiveServer2 manually.
  • mvn, gcc, and javac should be available in order to generate TPC-DS datasets.

Then any user (not necessarily an administrator user) can run Hive on MR3.

In a Kerberos-enabled secure cluster

For running Hive on MR3 in a secure cluster with Kerberos, the user should have a principal as well as permission to get Kerberos tickets and create a keytab file. The following commands are commonly used:

kinit <your principal>      # for getting a new Kerberos ticket
ktutil                      # for creating a keytab file

In order to run Metastore and HiveServer2, the user (or the administrator user) should have access to a service keytab file. Typically the service keytab file is associated with user hive. The format of the principal in the service keytab file should be primary/instance@REALM.

  • Ex. hive/node0@MR3.COM where hive is the primary, node0 is the host where Metastore or HiveServer2 runs, and MR3.COM is the realm which is usually the domain name of the machine.

In comparison, the format of the principal in an ordinary keytab file is usually primary@REALM without an instance field.

In order to support impersonation in HiveServer2, Yarn should be configured to allow the user starting Metastore and HiveServer2 to impersonate. For example, in order to allow user hive to impersonate, the administrator user should add two configuration settings to core-site.xml and restart Yarn:

<property>
  <name>hadoop.proxyuser.hive.groups</name>
  <value>hive,foo,bar</value>
</property>

<property>
  <name>hadoop.proxyuser.hive.hosts</name>
  <value>red0</value>
</property> 

In this example, hive in hadoop.proxyuser.hive.groups and hadoop.proxyuser.hive.hosts denotes the user starting Metastore and HiveServer2. Thus hadoop.proxyuser.hive.groups is the key for specifying the list of groups whose members can be impersonated by user hive, and hadoop.proxyuser.hive.hosts is the key for specifying the list of hosts where user hive can impersonate.

Setting environment variables for Hive on MR3

The behavior of Hive on MR3 depends on env.sh and four configuration files (hive-site.xml, mr3-site.xml, tez-site.xml, and mapred-site.xml). hive-site.xml configures Hive, mr3-site.xml configures MR3, and tez-site.xml configures the Tez runtime. Hive reads mapred-site.xml when running Hive with the MapReduce execution engine and when generating TPC-DS data.

env.sh is a self-descriptive script located in the root directory of the installation. It contains major environment variables that should be set in every installation environment. The following environment variables should be set according to the configuration of the installation environment:

export HADOOP_HOME=/usr/hdp/2.6.4.0-91/hadoop
HADOOP_HOME_LOCAL=$HADOOP_HOME
HADOOP_NATIVE_LIB=$HADOOP_HOME/lib/native/Linux-amd64-64:$HADOOP_HOME/lib/native
HDFS_LIB_DIR=/user/$USER/lib

SECURE_MODE=false

USER_PRINCIPAL=gitlab-runner@RED
USER_KEYTAB=/home/gitlab-runner/gitlab-runner.keytab
KINIT_RENEWAL_INTERVAL=72000

TOKEN_RENEWAL_HDFS_ENABLED=false
TOKEN_RENEWAL_HIVE_ENABLED=false

LOG_LEVEL=INFO
MR3_REV=0.1
  • HADOOP_HOME_LOCAL specifies the directory for the Hadoop installation to use in local mode in which everything runs on a single machine and does not require Yarn.
  • HDFS_LIB_DIR specifies the directory on HDFS to which MR3 and Tez jar files are uploaded. Hence it is only for non-local mode.
  • SECURE_MODE specifies whether the cluster is secure with Kerberos or not.
  • USER_PRINCIPAL and USER_KEYTAB specify the principal and keytab file for the user executing HiveCLI and Beeline.
  • LOG_LEVEL specifies the logging level (DEBUG, INFO, WARNING, ERROR).
  • MR3_REV should be set to 0.1 for now.

For those who want to rebuild Hive or Tez runtime, the script also has optional environment variables that specify the directories for Hive and Tez source code (TEZ1_SRC, TEZ3_SRC, HIVE1_SRC, HIVE2_SRC, HIVE5_SRC).

Preset configuration files

The MR3 release contains four collections of preset configuration files under directories conf/local, conf/cluster, conf/mysql, and conf/tpcds. These configuration directories are intended for the following scenarios:

  • conf/local: running Hive on MR3 in local mode (in which everything runs on a single machine) with a Derby database for Metastore
  • conf/cluster (default): running Hive on MR3 in a cluster with a Derby database for Metastore
  • conf/mysql: running Hive on MR3 in a cluster with a MySQL database for Metastore
  • conf/tpcds: running Hive on MR3 with the TPC-DS benchmark included in the MR3 release with a MySQL database for Metastore

Each configuration directory has the following structure:

├── hive1
│   ├── beeline-log4j.properties
│   ├── hive-log4j.properties
│   └── hive-site.xml
├── hive2
│   ├── beeline-log4j2.properties
│   ├── hive-log4j2.properties
│   └── hive-site.xml
├── hive5
│   ├── beeline-log4j2.properties
│   ├── hive-log4j2.properties
│   └── hive-site.xml
├── mapreduce
│   └── mapred-site.xml
├── mr3
│   └── mr3-site.xml
├── tez1
│   └── tez-site.xml
└── tez3
    └── tez-site.xml

Every script in the MR3 release accepts one of the following options to choose a corresponding configuration directory:

--local             # Run jobs with configurations in conf/local/.
--cluster           # Run jobs with configurations in conf/cluster/ (default).
--mysql             # Run jobs with configurations in conf/mysql/.
--tpcds             # Run jobs with configurations in conf/tpcds/.

A script may also accept additional options to choose corresponding configuration files:

--hivesrc1          # Choose hive1-mr3 (based on Hive 1.2.2) (default).
--hivesrc2          # Choose hive2-mr3 (based on Hive 2.3.3).
--hivesrc5          # Choose hive4-mr3 (based on Hive 3.0.0).
--tezsrc1           # Choose tez1-mr3 (based on Tez 0.7.0) (default).
--tezsrc3           # Choose tez3-mr3 (based on Tez 0.9.1).

For example, --tpcds --hivesrc2 --tezsrc3 chooses:

  • conf/tpcds/hive2/hive-site.xml
  • conf/tpcds/mr3/mr3-site.xml
  • conf/tpcds/tez3/tez-site.xml
  • conf/tpcds/mapreduce/mapred-site.xml

In this way, the user can easily try different combinations of Hive and Tez when running Hive on MR3. The following table shows valid combinations of Hive and Tez.

  --tezsrc1 --tezsrc3
--hivesrc1 OK  
--hivesrc2   OK
--hivesrc5   OK

Using custom configuration settings

A script in the MR3 release may accept new configuration settings as command-line options according to the following syntax:

--hiveconf <key>=<value>  # Add a configuration key/value.

The user can append as many instances of --hiveconf as necessary to the command. A configuration value specified with --hiveconf takes the highest precedence and overrides any existing value in hive-site.xml, mr3-site.xml, and tez-site.xml (not just in hive-site.xml). Hence the user can change the behavior of Hive on MR3 without modifying preset configuration files at all. (Note that the user can use --hiveconf to configure not only Hive but also MR3 and Tez.) Alternatively the user can directly modify preset configuration files to make the change permanent.

The user may create hivemetastore-site.xml and hiveserver2-site.xml in a configuration directory for Hive (conf/???/hive1, conf/???/hive2, conf/???/hive5) as configuration files for Metastore and HiveServer2, respectively. Hive automatically reads these files when reading hive-site.xml. The order of precedence of the configuration files is as follows (lower to higher):

hive-site.xmlhivemetastore-site.xmlhiveserver2-site.xml--hiveconf command-line options

Uploading MR3 and Tez jar files

The last step before running Hive on MR3 is to upload MR3 and Tez jar files to HDFS. In order to run HiveServer2 or HiveCLI, the user should execute the following commands which copy all the MR3 and Tez jar files (under mr3/mr3jar and tez/tezjar) to the directory specified by HDFS_LIB_DIR in env.sh:

mr3/upload-hdfslib-mr3.sh
tez/upload-hdfslib-tez.sh --tezsrc1
tez/upload-hdfslib-tez.sh --tezsrc3

When running Hive on MR3, these jar files are registered as local resources for Hadoop jobs and automatically distributed to slave nodes (where NodeManagers are running). This step is unnecessary for running Hive on MR3 in local mode, or for running Metastore and Beeline.

To run HiveServer2 with doAs enabled (by setting hive.server2.enable.do to true in hive-site.xml), the user (typically the administrator user) should make the MR3 and Tez jar files readable to all end users after uploading to HDFS. This is because every job runs under an end user who actually submits it. If the MR3 and Tez jar files are not readable to the end user, the job immediately fails because no files can be registered as local resources.