Running the TPC-DS Benchmark

The Hive-MR3 release includes scripts for helping the user to test Hive-MR3 on the TPC-DS benchmark, which is the de-facto industry standard benchmark for measuring the performance of big data systems such as Hive. It contains a script for generating TPC-DS datasets and another script for running Hive-MR3. The user can specify either Tez or MR3 as the execution engine of Hive, so the scripts should also be useful for comparing the performance of Tez and MR3.

Below we illustrate how to use the scripts with an example. We assume that Hive-MR3 runs in a secure cluster with Kerberos and that Metastore uses a MySQL database.

  • Metastore should already be running before generating TPC-DS datasets.
  • HiveServer2 should also be running before executing Hive queries with Beeline.

Generating a TPC-DS dataset

Before generating a TPC-DS dataset, set the following environment variables in env.sh:

HIVE_DS_FORMAT=orc
HIVE_DS_SCALE_FACTOR=2
  • HIVE_DS_FORMAT specifies the format of TPC-DS dataset eventually to be stored in the Hive data warehouse. Three options are available: orc for ORC format (default), textfile for text format, and rcfile for RC format.
  • HIVE_DS_SCALE_FACTOR specifies the scale factor of TPC-DS dataset in gigabytes. For example, a scale factor of 1000 generates a dataset of 1000 gigabytes.

In non-local mode, the user should have write permission on the directory for the Hive warehouse (on HDFS) which is specified by HIVE1_HDFS_WAREHOUSE, HIVE2_HDFS_WAREHOUSE, or HIVE3_HDFS_WAREHOUSE in env.sh:

HIVE1_HDFS_WAREHOUSE=/tmp/hivemr3/warehouse
HIVE2_HDFS_WAREHOUSE=/tmp/hivemr3/warehouse
HIVE3_HDFS_WAREHOUSE=/tmp/hivemr3/warehouse 

This is not a problem if the same user generates a TPC-DS dataset after starting Metastore.

The user should also have write permission on the database for Metastore which is specified by HIVE1_DATABASE_NAME, HIVE2_DATABASE_NAME, or HIVE3_DATABASE_NAME in env.sh:

HIVE1_DATABASE_NAME=hivemr3
HIVE2_DATABASE_NAME=hive2mr3
HIVE3_DATABASE_NAME=hive3mr3

This is not a problem if the same user generates a TPC-DS dataset after starting Metastore because the user name and the password are set in the same configuration file hive-site.xml:

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hivemr3</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>password</value>
</property>

hive/gen-tpcds.sh is the script to run for generating TPC-DS datasets:

--local                   # Run jobs with configurations in conf/local/.
--cluster                 # Run jobs with configurations in conf/cluster/ (default).
--mysql                   # Run jobs with configurations in conf/mysql/.
--tpcds                   # Run jobs with configurations in conf/tpcds/.
--hivesrc1                # Choose hive1-mr3 (based on Hive 1.2.2) (default).
--hivesrc2                # Choose hive2-mr3 (based on Hive 2.3.2).
--hivesrc3                # Choose hive3-mr3 (based on Hive 2.1.1).
--tezsrc1                 # Choose tez1-mr3 (based on Tez 0.7.0, default for --hivesrc1).
--tezsrc2                 # Choose tez2-mr3 (based on Tez 0.8.4, default for --hivesrc3).
--tezsrc3                 # Choose tez3-mr3 (based on Tez 0.9.1, default for --hivesrc2).

Note that the user should specify which version of Tez to use because the script uses Tez when populating the database for Metastore. Here is an example of the command for generating a TPC-DS dataset:

hive/gen-tpcds.sh --tpcds --hivesrc2 --tezsrc3

The script first runs a MapReduce job to generates raw data on the directory /tmp/tpcds-generate/[HIVE_DS_SCALE_FACTOR] (such as /tmp/tpcds-generate/1000). Then it fills the data warehouse and populate the database by running a Tez job. After the script completes successfully, the directory for the data warehouse contains a new directory such as tpcds_bin_partitioned_orc_10000.db. Then the user may delete the directory for raw data.

Running Hive-MR3 on the TPC-DS benchmark

The user can run hive/run-tpcds.sh to test Hive-MR3 on the TPC-DS benchmark in various scenarios. The script runs either Beeline or HiveCLI to execute Hive queries in the TPC-DS benchmark. Here are a few important options for hive/run-tpcds.sh:

-q <num>                  # Run a single TPC-DS query where num=[1..99].
-n <num>                  # Specify the number of times that the query runs. 
--beeline                 # Execute queries using Beeline.
--session                 # Execute all queries in the same MR3 session.
--amprocess               # Run the MR3 DAGAppMaster in LocalProcess mode.
--hiveconf <key>=<value>  # Add a configuration key/value.
  • With --beeline, the script starts a new Beeline connection which attempts to connect to HiveServer2 running at the address specified in env.sh. By default, the script runs HiveCLI.
  • With --session, the script executes a sequence of queries with a single Beeline connection or HiveCLI session. The sequence of queries to be executed is determined by the following suboptions:
      --querystart <num>        # Specify the starting TPC-DS query.
      --queryend <num>          # Specify the last TPC-DS query.
      --queryskip <num>         # Skip a query.
      --repeat <num>            # Repeat the same sequence of queries.
    
  • With --amprocess, if the script runs HiveCLI, the MR3 DAGAppMaster runs in LocalProcess mode. Hence a new process starts on the same machine where the script is run. This option is ignored if --beeline is used, in which case HiveServer2 starts a DAGAppMaster. (Currently --amprocess cannot be used in a secure cluster with Kerberos; see the documentation on LocalProcess mode.)
  • The user can append as many instances of --hiveconf as necessary to the command.

Here are a few examples of running the script:

# run query 12 with a HiveCLI session
hive/run-tpcds.sh --local --hivesrc2 --tezsrc3 -q 12

# run query 12 five times with a new HiveCLI session for each run
hive/run-tpcds.sh --mysql --hivesrc2 --tezsrc3 -q 12 -n 5

# run query 12 five times with a new HiveCLI session for each run 
# where the MR3 DAGAppMaster runs in LocalProcess mode
hive/run-tpcds.sh --tpcds --hivesrc2 --tezsrc3 -q 12 -n 5 --amprocess  

# run queries 12 to 66 with a single HiveCLI session
hive/run-tpcds.sh --tpcds --hivesrc2 --tezsrc3 --session --querystart 12 --queryend 66  

# run queries 12 to 66 with a single HiveCLI session 
# whose MR3 DAGAppMaster runs in LocalProcess mode
hive/run-tpcds.sh --tpcds --hivesrc2 --tezsrc3 --session --querystart 12 --queryend 66 --amprocess  

# run queries 12 to 66, skipping query 64, with a single Beeline connection 
hive/run-tpcds.sh --tpcds --hivesrc2 --tezsrc3 --beeline --session --querystart 12 --queryend 66 --queryskip 64

# repeat five times running queries 12 to 66 with a single Beeline connection
hive/run-tpcds.sh --tpcds --hivesrc2 --tezsrc3 --beeline --session --querystart 12 --queryend 66 --repeat 5

Sample queries in the TPC-DS benchmark

By default, the script executes Hive queries drawn from the directory hive/benchmarks/hive-testbench/sample-queries-tpcds-hive1, which contains 60 TPC-DS queries. The Hive-MR3 release contains another directory hive/benchmarks/hive-testbench/sample-queries-tpcds-hive2 which contains the complete set of 99 TPC-DS queries. (The current version of Hive-MR3 fails to parse some queries in this directory.) The user can switch the directory by editing hive/run-tpcds.sh:

    if [[ $HIVE_SRC_TYPE = 1 ]]; then
        TPC_DS_QUERY_DIR=$TPC_DS_DIR/sample-queries-tpcds-hive1
    fi
    if [[ $HIVE_SRC_TYPE = 2 ]]; then
        TPC_DS_QUERY_DIR=$TPC_DS_DIR/sample-queries-tpcds-hive1
    fi
    if [[ $HIVE_SRC_TYPE = 3 ]]; then
        TPC_DS_QUERY_DIR=$TPC_DS_DIR/sample-queries-tpcds-hive1
    fi

The TPC-DS benchmark in the Hive-MR3 release is originally from https://github.com/hortonworks/hive-testbench.