Running the TPC-DS Benchmark

The MR3 release includes scripts for helping the user to test Hive on MR3 using the TPC-DS benchmark, which is the de-facto industry standard benchmark for measuring the performance of big data systems such as Hive. It contains a script for generating TPC-DS datasets and another script for running Hive on MR3. The user can specify either MR3 or Tez as the execution engine of Hive, so the scripts should also be useful for comparing the performance of MR3 and Tez.

Before generating a TPC-DS dataset, check the following:

  • HADOOP_HOME in env.sh should be set. For local mode, HADOOP_HOME_LOCAL should be set.
  • The script for generating TPC-DS datasets stores raw data in a directory under /tmp/tpcds-generate/ (by executing a MapReduce job). Either delete /tmp/tpcds-generate/ or make it accessible to the user before running the script.
  • The script also runs HiveCLI to fill the Hive data warehouse. For local mode, HIVE_CLIENT_HEAPSIZE in env.sh should be set to a sufficiently large value so that HiveCLI can accommodate many ContainerWorkers.
  • The script runs HiveCLI with MR3 as the execution engine. Hence the user should set the following configuration keys in hive-site.xml appropriately (assuming the all-in-one scheme):
<property>
  <name>hive.mr3.map.task.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.reduce.task.memory.mb</name>
  <value>4096</value>
</property>

<property>
  <name>hive.mr3.all-in-one.containergroup.memory.mb</name>
  <value>4096</value>
</property>

If these keys are set to too small values (e.g., 1024), the script may fail with an OutOfMemoryError exception:

make: *** [store_sales] Error 2
make: *** Waiting for unfinished jobs....

Now we illustrate how to run the TPC-DS benchmark with an example. We assume that Hive on MR3 runs in a secure cluster with Kerberos and that Metastore uses a MySQL database.

  • Metastore should already be running before generating TPC-DS datasets.
  • HiveServer2 should be running before executing Hive queries with Beeline.

Generating a TPC-DS dataset

Before generating a TPC-DS dataset, set the following environment variables in env.sh:

HIVE_DS_FORMAT=orc
HIVE_DS_SCALE_FACTOR=2
  • HIVE_DS_FORMAT specifies the format of TPC-DS dataset eventually to be stored in the Hive data warehouse. Three options are available: orc for ORC format (default), textfile for text format, and rcfile for RC format.
  • HIVE_DS_SCALE_FACTOR specifies the scale factor of TPC-DS dataset in gigabytes. For example, a scale factor of 1000 generates a dataset of 1000 gigabytes.

In non-local mode, the user should have write permission on the directory for the Hive data warehouse (on HDFS) which is specified by HIVE1_HDFS_WAREHOUSE, HIVE2_HDFS_WAREHOUSE, or HIVE3_HDFS_WAREHOUSE in env.sh:

HIVE1_HDFS_WAREHOUSE=/tmp/hivemr3/warehouse
HIVE2_HDFS_WAREHOUSE=/tmp/hivemr3/warehouse
HIVE5_HDFS_WAREHOUSE=/tmp/hivemr3/warehouse

This is not a problem if the same user generates a TPC-DS dataset after starting Metastore.

The user should also have write permission on the database for Metastore which is specified by HIVE1_DATABASE_NAME, HIVE2_DATABASE_NAME, or HIVE5_DATABASE_NAME in env.sh:

HIVE1_DATABASE_NAME=hivemr3
HIVE2_DATABASE_NAME=hive2mr3
HIVE5_DATABASE_NAME=hive5mr3

This is not a problem if the same user generates a TPC-DS dataset after starting Metastore because the user name and the password are set in the same configuration file hive-site.xml:

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hivemr3</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>password</value>
</property>

hive/gen-tpcds.sh is the script to run for generating TPC-DS datasets:

--local                   # Run jobs with configurations in conf/local/ (default).
--cluster                 # Run jobs with configurations in conf/cluster/.
--mysql                   # Run jobs with configurations in conf/mysql/.
--tpcds                   # Run jobs with configurations in conf/tpcds/.
--hivesrc1                # Choose hive1-mr3 (based on Hive 1.2.2) (default).
--hivesrc2                # Choose hive2-mr3 (based on Hive 2.3.3).
--hivesrc5                # Choose hive5-mr3 (based on Hive 3.1.0).

Note that the user should specify which version of Tez runtime to use. Here is an example of the command for generating a TPC-DS dataset:

hive/gen-tpcds.sh --tpcds --hivesrc2

The script first executes a MapReduce job to generates raw data on the directory /tmp/tpcds-generate/[HIVE_DS_SCALE_FACTOR] (such as /tmp/tpcds-generate/1000). Then it fills the data warehouse and populates the database by executing Hive queries with MR3 as the execution engine. After the script completes successfully, the directory for the data warehouse contains a new directory such as tpcds_bin_partitioned_orc_10000.db. Then the user may delete the directory for raw data.

Running Hive on MR3 using the TPC-DS benchmark

The user can run hive/run-tpcds.sh to test Hive on MR3 using the TPC-DS benchmark in various scenarios. The script runs either Beeline or HiveCLI to execute Hive queries in the TPC-DS benchmark. Here are a few important options for hive/run-tpcds.sh:

-q <num>                  # Run a single TPC-DS query where num=[1..99].
-n <num>                  # Specify the number of times that the query runs. 
--beeline                 # Execute queries using Beeline.
--session                 # Execute all queries in the same MR3 session.
--amprocess               # Run the MR3 DAGAppMaster in LocalProcess mode.
--hiveconf <key>=<value>  # Add a configuration key/value.
  • With --beeline, the script starts a new Beeline connection which attempts to connect to HiveServer2 running at the address specified in env.sh. By default, the script runs HiveCLI.
  • With --session, the script executes a sequence of queries with a single Beeline connection or HiveCLI session. The sequence of queries to be executed is determined by the following suboptions:
      --querystart <num>        # Specify the starting TPC-DS query.
      --queryend <num>          # Specify the last TPC-DS query.
      --queryskip <num>         # Skip a query.
      --repeat <num>            # Repeat the same sequence of queries.
    
  • With --amprocess, if the script runs HiveCLI, the MR3 DAGAppMaster runs in LocalProcess mode. Hence a new process starts on the same machine where the script is run. This option is ignored if --beeline is used, in which case HiveServer2 starts a DAGAppMaster. (Currently --amprocess cannot be used in a secure cluster with Kerberos; see the documentation on LocalProcess mode.)
  • The user can append as many instances of --hiveconf as necessary to the command.

Here are a few examples of running the script:

# run query 12 with a HiveCLI session
hive/run-tpcds.sh --local --hivesrc2 -q 12

# run query 12 five times with a new HiveCLI session for each run
hive/run-tpcds.sh --mysql --hivesrc2 -q 12 -n 5

# run query 12 five times with a new HiveCLI session for each run 
# where the MR3 DAGAppMaster runs in LocalProcess mode
hive/run-tpcds.sh --tpcds --hivesrc2 -q 12 -n 5 --amprocess  

# run queries 12 to 66 with a single HiveCLI session
hive/run-tpcds.sh --tpcds --hivesrc2 --session --querystart 12 --queryend 66  

# run queries 12 to 66 with a single HiveCLI session 
# whose MR3 DAGAppMaster runs in LocalProcess mode
hive/run-tpcds.sh --tpcds --hivesrc2 --session --querystart 12 --queryend 66 --amprocess  

# run queries 12 to 66, skipping query 64, with a single Beeline connection 
hive/run-tpcds.sh --tpcds --hivesrc2 --beeline --session --querystart 12 --queryend 66 --queryskip 64

# repeat five times running queries 12 to 66 with a single Beeline connection
hive/run-tpcds.sh --tpcds --hivesrc2 --beeline --session --querystart 12 --queryend 66 --repeat 5

Sample queries in the TPC-DS benchmark

The script executes Hive queries drawn either from the directory hive/benchmarks/hive-testbench/sample-queries-tpcds-hive1 which contains 60 TPC-DS queries, or from another directory hive/benchmarks/hive-testbench/sample-queries-tpcds-hive2 which contains the complete set of 99 TPC-DS queries. The user can switch the directory by editing hive/run-tpcds.sh:

    if [[ $HIVE_SRC_TYPE = 1 ]]; then
        TPC_DS_QUERY_DIR=$TPC_DS_DIR/sample-queries-tpcds-hive1
    fi
    if [[ $HIVE_SRC_TYPE = 2 ]]; then
        TPC_DS_QUERY_DIR=$TPC_DS_DIR/sample-queries-tpcds-hive2
    fi
    if [[ $HIVE_SRC_TYPE = 5 ]]; then
        TPC_DS_QUERY_DIR=$TPC_DS_DIR/sample-queries-tpcds-hive2
    fi

The TPC-DS benchmark in the MR3 release is originally from https://github.com/hortonworks/hive-testbench.