Installing Hive on MR3
Compiling Hive on MR3
Configuring Hive on MR3
Running the TPC-DS Benchmark
Using the Shuffle Handler
Enabling High Availability
Changing the Logging Configuration
Enabling ACID Transactions
Using User Defined Functions
Integrating Apache Ranger
Accessing from Spark
Installing Hive on MR3
In order to install Hive on MR3, download an MR3 release (e.g.,
hivemr3-0.8-hive3.1.1.tar.gz) and uncompress it in a directory of your choice (e.g., under the user’s home directory).
It suffices to install Hive on MR3 only on the master node where HiveServer2 or HiveCLI is to run, and the user does not have to install it on slave nodes.
A full release contains everything for running Hive on MR3, including scripts, preset configuration files, and jar files.
Alternatively download a minimal MR3 release (e.g.,
hivemr3-0.8-minimal.tar.gz) and rebuild all necessary components from the source code.
Then the user can try Hive on MR3 after a few additional steps.
The following structure shows important files and directories in the release:
├── env.sh ├── conf │ ├── local │ ├── cluster │ └── tpcds ├── hadoop ├── hive │ ├── compile-hive.sh │ ├── gen-tpcds.sh │ ├── hiveserver2-service.sh │ ├── metastore-service.sh │ ├── run-beeline.sh │ ├── run-hive-cli.sh │ ├── run-tpcds.sh │ ├── benchmarks │ │ └── hive-testbench │ └── hivejar │ ├── apache-hive-1.2.2-bin │ ├── apache-hive-2.3.5-bin │ ├── apache-hive-3.1.1-bin │ └── apache-hive-4.0.0-SNAPSHOT-bin ├── mr3 │ ├── upload-hdfslib-mr3.sh │ ├── mr3jar │ ├── mr3lib │ └── mr3-ui └── tez ├── compile-tez.sh ├── upload-hdfslib-tez.sh └── tezjar └── tez-0.9.1.mr3.0.1
Prerequisites for running Hive on MR3
In order to run Hive on MR3, the following requirements should be met.
- Java 1.8 or higher should be available.
- Basic Hadoop commands such as
yarnshould be available.
- The user should have access to his home directory and
/tmpdirectory on HDFS.
- Ex. A user
fooshould have access to
- Hive on MR3 stores MR3 and Tez jar files under
- Ex. A user
- If a directory to be specified by
hive-site.xmlalready exists on HDFS, it must have directory permission 733, not 700.
- Ex. if
/tmp/hive, either a directory
/tmp/hiveshould exist with directory permission 733, or such a directory should not exist. HiveServer2 automatically creates a new directory with permission 733 if it does not exist.
- Ex. if
- MySQL should be running if the user wants to run Metastore with a MySQL database. The user should also have access to the database with a user name and a password.
ssshould be available in order to automatically terminate Metastore and HiveServer2. If not available, the user should terminate Metastore and HiveServer2 manually.
javacshould be available in order to generate TPC-DS datasets.
- Depending on the size of the cluster, the kernel configuration parameter SOMAXCONN (
net.core.somaxconn) should be set to a sufficiently large value, e.g., 16384, on every node.
Then any user (not necessarily an administrator user) can run Hive on MR3.
In a Kerberos-enabled secure cluster
For running Hive on MR3 in a secure cluster with Kerberos, the user should have a principal as well as permission to get Kerberos tickets and create a keytab file. The following commands are commonly used:
kinit <your principal> # for getting a new Kerberos ticket ktutil # for creating a keytab file
In order to run Metastore and HiveServer2, the user (or the administrator user) should have access to a service keytab file.
Typically the service keytab file is associated with user
The format of the principal in the service keytab file should be
hiveis the primary,
node0is the host where Metastore or HiveServer2 runs, and
MR3.COMis the realm which is usually the domain name of the machine.
In comparison, the format of the principal in an ordinary keytab file is usually
primary@REALM without an instance field.
In order to support impersonation in HiveServer2, Yarn should be configured to allow the user starting Metastore and HiveServer2 to impersonate.
For example, in order to allow user
hive to impersonate,
the administrator user should add two configuration settings to
core-site.xml and restart Yarn:
<property> <name>hadoop.proxyuser.hive.groups</name> <value>hive,foo,bar</value> </property> <property> <name>hadoop.proxyuser.hive.hosts</name> <value>red0</value> </property>
In this example,
hadoop.proxyuser.hive.hosts denotes the user starting Metastore and HiveServer2.
hadoop.proxyuser.hive.groups is the key for specifying the list of groups whose members can be impersonated by user
hadoop.proxyuser.hive.hosts is the key for specifying the list of nodes where user
hive can impersonate.
Setting environment variables for Hive on MR3
The behavior of Hive on MR3 depends on
env.sh and four configuration files (
hive-site.xml configures Hive,
mr3-site.xml configures MR3,
tez-site.xml configures the Tez runtime.
mapred-site.xml when running Hive with the MapReduce execution engine and when generating TPC-DS data.
env.sh is a self-descriptive script located in the root directory of the installation.
It contains major environment variables that should be set in every installation environment.
The following environment variables should be set according to the configuration of the installation environment:
export HADOOP_HOME=/usr/lib/hadoop HDFS_LIB_DIR=/user/$USER/lib HADOOP_HOME_LOCAL=$HADOOP_HOME HADOOP_NATIVE_LIB=$HADOOP_HOME/lib/native SECURE_MODE=false USER_PRINCIPAL=hive@HADOOP USER_KEYTAB=/home/hive/hive.keytab
HDFS_LIB_DIRspecifies the directory on HDFS to which MR3 and Tez jar files are uploaded. Hence it is only for non-local mode.
HADOOP_HOME_LOCALspecifies the directory for the Hadoop installation to use in local mode in which everything runs on a single machine and does not require Yarn.
SECURE_MODEspecifies whether the cluster is secure with Kerberos or not.
USER_KEYTABspecify the principal and keytab file for the user executing HiveCLI and Beeline.
For those who want to rebuild Hive or Tez runtime,
the script also has optional environment variables that specify the directories for Hive and Tez source code
Preset configuration files
The MR3 release contains three collections of preset configuration files under directories
These configuration directories are intended for the following scenarios:
conf/local(default): running Hive on MR3 in local mode (in which everything runs on a single machine) with a Derby database for Metastore
conf/cluster: running Hive on MR3 in a cluster with a Derby database for Metastore
conf/tpcds: running Hive on MR3 in a cluster with a MySQL database for Metastore
Each configuration directory has the following structure:
├── hive1 │ ├── beeline-log4j.properties │ ├── hive-log4j.properties │ └── hive-site.xml ├── hive2 │ ├── beeline-log4j2.properties │ ├── hive-log4j2.properties │ └── hive-site.xml ├── hive3 │ ├── beeline-log4j2.properties │ ├── hive-log4j2.properties │ └── hive-site.xml ├── hive4 │ ├── beeline-log4j2.properties │ ├── hive-log4j2.properties │ └── hive-site.xml ├── mapreduce │ └── mapred-site.xml ├── mr3 │ └── mr3-site.xml └── tez3 └── tez-site.xml
For typical use cases on a Hadoop cluster, the user can start with
conf/tpcds and revise configuration files (
tez-site.xml) for performance tuning.
Every script in the MR3 release accepts one of the following options to choose a corresponding configuration directory:
--local # Run jobs with configurations in conf/local/ (default). --cluster # Run jobs with configurations in conf/cluster/. --tpcds # Run jobs with configurations in conf/tpcds/.
A script may also accept an additional option to choose corresponding configuration files:
--hivesrc1 # Choose hive1-mr3 (based on Hive 1.2.2). --hivesrc2 # Choose hive2-mr3 (based on Hive 2.3.5). --hivesrc3 # Choose hive3-mr3 (based on Hive 3.1.1) (default). --hivesrc4 # Choose hive4-mr3 (based on Hive 4.0.0-SNAPSHOT).
--tpcds --hivesrc2 chooses:
In this way, the user can easily try different combinations of Hive and Tez when running Hive on MR3.
Using custom configuration settings
A script in the MR3 release may accept new configuration settings as command-line options according to the following syntax:
--hiveconf <key>=<value> # Add a configuration key/value.
The user can append as many instances of
--hiveconf as necessary to the command.
A configuration value specified with
--hiveconf takes the highest precedence and overrides any existing value in
tez-site.xml (not just in
Hence the user can change the behavior of Hive on MR3 without modifying preset configuration files at all.
(Note that the user can use
--hiveconf to configure not only Hive but also MR3 and Tez.)
Alternatively the user can directly modify preset configuration files to make the change permanent.
The user may create
hiveserver2-site.xml in a configuration directory for Hive (
as configuration files for Metastore and HiveServer2, respectively.
Hive automatically reads these files when reading
The order of precedence of the configuration files is as follows (lower to higher):
--hiveconf command-line options
Uploading MR3 and Tez jar files
The last step before running Hive on MR3 is to upload MR3 and Tez jar files to HDFS.
In order to run HiveServer2 or HiveCLI,
the user should execute the following commands which copy all the MR3 and Tez jar files (under
tez/tezjar) to the directory specified by
When running Hive on MR3, these jar files are registered as local resources for Hadoop jobs and automatically distributed to slave nodes (where NodeManagers are running). This step is unnecessary for running Hive on MR3 in local mode, or for running Metastore and Beeline.
To run HiveServer2 with doAs enabled (by setting
hive.server2.enable.do to true in
the user (typically the administrator user) should make the MR3 and Tez jar files readable to all end users after uploading to HDFS.
This is because every job runs under an end user who actually submits it.
If the MR3 and Tez jar files are not readable to the end user, the job immediately fails because no files can be registered as local resources.