Read Carbondata Table from Apache Hive

Apache Carbondata an indexed columnar data store heavily depends on Apache Spark but also supports other Big Data frameworks like Apache Hive and Presto. This article explains how to read a Carbondata table created in Apache Spark from Apache Hive in two sections: 1. How to create a table in HDFS using Apache Spark, 2. How to read the Carbondata table from Apache Hive.

Read Carbondata Table from Apache Hive
Requirements:
  • Oracle JDK 1.8
  • Apache Spark
  • Apache Hadoop (Carbondata officially support Hive 2.x. In this article, Apache Hadoop 2.7.7 is used)
  • Apache Hive (Carbondata officially support Hive 2.x. So better to stick to 2.x version. In this article, Apache Hive 2.3.6 is used to demonstrate the integration)
  • Carbondata libraries
Please follow the Integrate Carbondata with Apache Spark Shell article to compile Carbondata from source and integrate it with Apache Spark. This article is written based on the assumption that you have already followed all the steps from the above-mentioned article.

Creating Carbondata Table in Hadoop from Apache Spark

Step 1:
Create a file named sample.csv using the following command which will be later used to populate the Carbondata table.
cat > sample.csv << EOF
id,name,scale,country,salary
1,yuhai,1.77,china,33000.1
2,runlin,1.70,china,33000.2
EOF

Step 2:
Copy the sample.csv to the HDFS filestore.
$HADOOP_HOME/bin/hdfs dfs -put sample.csv /user/hadoop/

Step 3:
Edit the $SPARK_HOME/conf/hive-site.xml file as shown below:
<configuration>
  <property>
  <name>hive.metastore.uris</name>
    <!-- hostname must point to the Hive metastore URI in your cluster -->
    <value>thrift://localhost:9083</value>
    <description>URI for client to contact metastore server</description>
  </property>

  <property>
    <name>hive.metastore.pre.event.listeners</name>
    <value>org.apache.carbondata.hive.CarbonHiveMetastoreListener</value>
  </property>
</configuration>

Here we point to the Hive metastore URI and add Carbondata Hive metastore listener. In this article, I assume that Apache Spark, Apache Hadoop, and Apache Hive are all installed in the same machine and accessible from the same user: hadoop.

Hadoop, Spark, and Hive have to be used from the same user to avoid permission issues. If you installed Hadoop as root user, you have to use Spark and Hive also as the root user.

Step 4:

Start the Spark Shell with Carbondata assembly JAR and Carbondata Hive JAR as shown below:
$SPARK_HOME/bin/spark-shell --jars ~/carbondata/assembly/target/scala-2.11/apache-carbondata-1.6.0-bin-spark2.3.2-hadoop2.7.2.jar,~/carbondata/integration/hive/target/carbondata-hive-1.6.0.jar

Step 5:
Import the SparkSession and CarbonSession by entering the following statements in the Spark shell.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.CarbonSession._

Step 6:
Declare the following variables which will be used in the upcoming steps.
val rootPath = "hdfs://localhost:9000/user/hadoop/carbon"
val storeLocation = s"$rootPath/store"
val warehouse = s"$rootPath/warehouse"
val metaStoreDB = s"$rootPath/metastore_db"

Here the localhost:9000 is the name of the default Hadoop file system. The address is defined in the $HADOOP_HOME/etc/hadoop/mapred-site.xml file. Please check this file if you don't know the address.

Step 7:
Create a CarbonSession object using the warehouse location, store location and metastore location defined in the previous step.
val carbon = SparkSession.builder().enableHiveSupport().config("spark.sql.warehouse.dir", warehouse).config(org.apache.carbondata.core.constants.CarbonCommonConstants.STORE_LOCATION, storeLocation).getOrCreateCarbonSession(storeLocation, metaStoreDB)


Step 8:
Create a Carbondata table with the name hive_carbon using the Carbon session created in Step 8.
carbon.sql("create table hive_carbon(id int, name string, scale decimal, country string, salary double) STORED BY 'carbondata'")

Step 9:
Load the sample.csv file we copied to the HDFS in Step 2 into the hive_carbon table using the following command:
carbon.sql("LOAD DATA INPATH 'hdfs://localhost:54310/user/hadoop/sample.csv' INTO TABLE hive_carbon")

Step 10:
If everything was flawless until Step 10, the hive_carbon table should be created in the HDFS store and you should be able to query the table form Spark using the following command:
carbon.sql("SELECT * FROM hive_carbon").show()


Reading Carbondata Table from Apache Hive

Step 1:
Apache Hive requires some libraries to be able to read Carbondata table from an HDFS location. Some of these libraries are from the Carbondata project and some have to be downloaded separately. If you built the Carbondata store in your computer using Apache Maven as explained in Integrate Carbondata with Apache Spark Shell, you can find the JARs to be downloaded in the local Maven repository.

To copy the JAR files from Carbondata project to Hive, use the following commands:
cp ~/carbondata/integration/hive/target/carbondata-hive-1.6.0.jar $HIVE_HOME/lib/
cp ~/carbondata/hadoop/target/carbondata-hadoop-1.6.0.jar $HIVE_HOME/lib/
cp ~/carbondata/core/target/carbondata-core-1.6.0.jar $HIVE_HOME/lib/
cp ~/carbondata/common/target/carbondata-common-1.6.0.jar $HIVE_HOME/lib/

Download the following JARs and copy them to the $HIVE_HOME/lib/ folder.
  • carbondata-format-1.6.0.jar
  • zstd-jni-1.3.2-2.jar
As mentioned earlier, if you built the Carbondata source from your machine, you can copy them from the local maven repository using the following commands:
cp ~/.m2/repository/org/apache/carbondata/carbondata-format/1.6.0/carbondata-format-1.6.0.jar $HIVE_HOME/lib/
cp ~/.m2/repository/com/github/luben/zstd-jni/1.3.2-2/zstd-jni-1.3.2-2.jar $HIVE_HOME/lib/

Step 2:
If hiveserver2 is already running, shutdown it start again to load the libraries copied into the $HIVE_HOME/lib folder.
hive --service hiveserver2

Step 3:
Start beeline using the following command:
beeline -u jdbc:hive2://localhost:10000 -n hive

Step 4:
If you run SHOW TABLES, beeline will list the available tables including the hive_carbon created from Apache Spark. However, if you select from the hive_carbon, beeline will print an empty table because Hive will not recursively search for the data in the hive_carbon table. To enable recursive search, enter the following command from the beeline.
set mapreduce.input.fileinputformat.input.dir.recursive=true;

The above property should enable recursive directory traversal in Hive. If it doesn't work, please try the following properties too:
set hive.mapred.supports.subdirectories=true;
set mapreduce.dir.recursive=true;

Step 5:
Now you can read the hive_carbon table from Hive. Enter the following SQL and see if you can read the hive_carbon table.
select * from hive_carbon;
If you find this article useful or if you face any problems in integrating Carbondata with Apache Hive, please leave a comment below.
Previous
Next Post »

Contact Form

Name

Email *

Message *