Setting up an Apache Hive Data Warehouse

Setting up an Apache Hive Data Warehouse

The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets using SQL in Hadoop Distributed File System. In this post, I will set up Apache Hive 3.1.2 on Apache Hadoop 3.3.0. Hive requires a ready-to-go Hadoop framework. I strongly recommend you to set up your Hadoop framework using the post that I’ve already published named Running a Multi-Node Hadoop Cluster.

Apache Hive uses a database to store its metadata. I will use a Postgres database as the external metastore thus we should install a Postgres database as our first step. I may use some CentOS (RHEL) commands. We will go through the following steps to run Apache Hive.

  1. Configure Postgres Database
  2. Configure Apache Hive
  3. Resolve Some Errors
  4. Start Apache Hive

Configure Postgres Database

After running the Hadoop framework, we can start setting up Apache Hive. I will assume that you have set up the Hadoop framework using my post Running a Multi-Node Hadoop Cluster to be on the same page. Select one of the Hadoop nodes as the Hive server and install Postgres on it. Apache Hive uses an execution engine (Map-Reduce as default, Tez, or Spark) to run queries distributedly although it will be available on only one server to use. Now, let’s configure the Postgres database.

[root@hiveserver ~]# sudo -u postgres psql
postgres=# create user hive_user with encrypted password 'hive_password';
postgres=# create database hive_database;
postgres=# grant all privileges on database hive_database to hive_user;

Configure Apache Hive

Download and extract Hive.

[hduser@hiveserver ~]# wget {Apache Hive LINK (I downloaded apache-hive-3.1.2-bin.tar.gz)}
[hduser@hiveserver ~]# cd /opt
[hduser@hiveserver /opt]# sudo tar xzf /home/hduser/apache-hive-3.1.2-bin.tar.gz
[hduser@hiveserver /opt]# sudo ln -s apache-hive-3.1.2-bin/ hive
[hduser@hiveserver /opt]# sudo chown -R hduser:hadoop hive
[hduser@hiveserver /opt]# sudo chown -R hduser:hadoop apache-hive-3.1.2-bin

All Hive config files are available under the directory /pot/hive/conf. Now, edit Hive config files.

[hduser@hiveserver ~]# cd /opt/hive/conf/
[hduser@hiveserver /opt/hive/conf]# cp hive-env.sh.template hive-env.sh
[hduser@hiveserver /opt/hive/conf]# vi hive-env.sh
# Add these commands
HADOOP_HOME=/opt/hadoop
export HIVE_HOME=/opt/hive
export HIVE_CONF_DIR=${HIVE_HOME}/conf
[hduser@hiveserver /opt/hive/conf]# cp hive-default.xml.template hive-site.xml
[hduser@hiveserver /opt/hive/conf]# vi hive-site.xml
# Add these elements
  <property>
    <name>system:java.io.tmpdir</name>
    <value>/tmp/hive/java</value>
  </property>
  <property>
    <name>system:user.name</name>
    <value>${user.name}</value>
  </property>

# Find javax.jdo.option.ConnectionUserName property and change its value to username of Postgres database: hive_user

# Find javax.jdo.option.ConnectionPassword property and change its value to username of Postgres database: hive_password

# Find javax.jdo.option.ConnectionURL property and change its value to username of Postgres database: jdbc:postgresql://hiveserver:5432/hive_database

# Find hive.server2.enable.doAs property and change its value to false

Hive uses some directories in HDFS as a temporary one or to store data.

[hduser@hiveserver ~]# # Create Hive directories
[hduser@hiveserver ~]# ## Create warehouse and temp directories
[hduser@hiveserver ~]# hdfs dfs -mkdir /tmp
[hduser@hiveserver ~]# hdfs dfs -chmod g+w /tmp
[hduser@hiveserver ~]# hdfs dfs -mkdir -p /user/hive/warehouse
[hduser@hiveserver ~]# hdfs dfs -chmod g+w /user/hive/warehouse
[hduser@hiveserver ~]# ## Check permissions to be drwxrwxr-x
[hduser@hiveserver ~]# hdfs dfs -ls /
[hduser@hiveserver ~]# hdfs dfs -ls /user/hive

It is time to set environment variables.

[hduser@hiveserver ~]# vi .bashrc
# Add these commands
# Hive
export HIVE_HOME=/opt/hive
export HIVE_CONF_DIR=$HIVE_HOME/conf
export PATH=$HIVE_HOME/bin:$PATH

OK! Get a rest!

Resolve Some Errors

Although we have configured Hive correctly, you may face some errors. The first one is a peculiar RuntimeException of type com.ctc.wstx.WstxParsingException.

Exception in thread "main" java.lang.RuntimeException: com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character

I don’t know exactly why it should happen because we copied template config files and edited them. But the solution is simple, we need to remove one special character in hive-site.xml.

[hduser@hiveserver /opt/hive/conf]# vi hive-site.xml

# Find hive.txn.xlock.iow property and delete the special character between for and transactional words in its description.

The other error is more peculiar.

Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)

This error happens because of inconsistency among library versions. Resolving this error requires a small version controlling. Check version of guava library in Apache Hive and Apache Hadoop, and copy the latest to the other.

[hduser@hiveserver ~]# ls /opt/hive/lib/ | grep guava
guava-19.0.jar
[hduser@hiveserver ~]# ls /opt/hadoop/share/hadoop/hdfs/lib/ | grep guava
guava-27.0-jre.jar

In my case, I should copy the guava jar file from Hadoop libraries to Hive libraries.

[hduser@hiveserver ~]# mv /opt/hive/lib/guava-19.0.jar /opt/hive/lib/guava-19.0.jar.bak
[hduser@hiveserver ~]# cp /opt/hadoop/share/hadoop/hdfs/lib/guava-27.0-jre.jar /opt/hive/lib/guava-27.0-jre.jar

Hopefully, there will be no more errors. Let’s start Hive.

Start Apache Hive

Use these commands to start the Hive server and meta store server.

[hduser@hiveserver ~]# hiveserver2
[hduser@hiveserver ~]# hive --service metastore

You can also create services to control the Hive server and meta store server.

[hduser@hiveserver ~]# cd /etc/systemd/system
[hduser@hiveserver /etc/systemd/system]# sudo vi hive.service

# Add these lines
[Unit]
Description=Hive Server 2
After=syslog.target network.target remote-fs.target nss-lookup.target network-online.target
Requires=network-online.target hdfs.service yarn.service

[Service]
User=hduser
Group=hadoop
Type=simple
ExecStart=/opt/hive/bin/hiveserver2
WorkingDirectory=/home/hduser
TimeoutStartSec=2min
Restart=on-failure

[Install]
WantedBy=multi-user.target

[hduser@hiveserver /etc/systemd/system]# sudo vi hive-metastore.service

# Add these lines
[Unit]
Description=Hive Meta Store Server
After=syslog.target network.target remote-fs.target nss-lookup.target network-online.target
Requires=network-online.target hdfs.service yarn.service hive.service

[Service]
User=hduser
Group=hadoop
Type=simple
ExecStart=/opt/hive/bin/hive --service metastore
WorkingDirectory=/home/hduser
TimeoutStartSec=2min
Restart=on-failure

[Install]
WantedBy=multi-user.target

I hope this tutorial was useful. Thank You.

 


 

Author: Saeid Dadkhah

میانگین آرا: 0 / 5. شمارش رای‌ها: 0

Leave a Reply

Your email address will not be published. Required fields are marked *