Integrate Carbondata with Apache Spark Shell

Apache Carbondata an indexed columnar data store solution for fast analytics on big data platform, e.g.Apache Hadoop, Apache Spark, etc. This article is written to provide a quick start guide on how to integrate Carbondata with Apache Spark Shell. Why another article while there is a quick start guide on the official website? Things are not always as smooth as expected. In my experience, integrating Carbondata with Apache Spark using pre-built binaries didn't work as expected. So here is the quick start tutorial.

Integrate Carbondata with Apache Spark Shell
Requirements:
Carbondata requires Java 1.7 or 1.8 to run and Apache Maven to build from source. Please make sure that you have Oracle JDK 1.8, supporting Apache Maven and Git to setup Carbondata. If you don't have Oracle JDK or Apache Maven installed in your system, please follow the given links below to install them first.


Step 1:
Clone the Carbondata source code in your machine. As I have mentioned earlier, pre-built binaries didn't work for me. Therefore, we are going to build Carbondata from source in the following steps.
git clone https://github.com/apache/carbondata.git

Step 2:
Checkout to a recent stable release. At the time of writing this article, apache-CarbonData-1.6.1-rc1 was released but I wasn't able to build it using the command given in Step 3. Therefore, I check out the release tag: apache-CarbonData-1.6.0-rc3.
git checkout apache-CarbonData-1.6.0-rc3

Step 3:
Run the following Maven command to build the project. Please check the official Carbondata link to get the updated build command. Carbondata 1.6.0 supports Apache Spark 2.3.2, so I used the following command:
mvn -DskipTests -Pspark-2.3 -Dspark.version=2.3.2 clean package
Wait until the build is successful. If there are any dependency issues, please use a lower version of Carbondata.

Step 4:
Download the Apache Spark version which you used to build Carbon data in Step 3. Here, I am using Apache Spark 2.3.2 built with Hadoop 2.7. To download Apache Spark, go to the Spark Release Archives and download the tgz file. Wget command to download Apache Spark 2.3.2 is given below. Please copy the link from the official website.
wget https://archive.apache.org/dist/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz

Step 5:
Extract the spark-2.3.2-bin-hadoop2.7.tgz file downloaded in the previous step anywhere you like.
tar -xvzf spark-2.3.2-bin-hadoop2.7.tgz


Step 6:
Open a terminal in the home folder (or anywhere you prefer) and create a file named sample.csv using the following command.
cat > sample.csv << EOF
id,name,city,age
1,david,shenzhen,31
2,eason,shenzhen,27
3,jarry,wuhan,35
EOF

Step 7:
Create a new directory named carbonstore anywhere you prefer. In this article, I am creating the directory in my home folder.
mkdir ~/carbonstore

Step 8:
Open a terminal in the extracted Spark folder and run the following command to start Spark with the Carbondata assembly JAR. In my case, I cloned Carbondata into my home directory. Replace the JAR path according to your location.
./bin/spark-shell --jars /home/gobinath/carbondata/assembly/target/scala-2.11/apache-carbondata-1.6.0-bin-spark2.3.2-hadoop2.7.2.jar

Step 9:
Import the SparkSession and CarbonSession by entering the following statements in the Spark shell.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.CarbonSession._

Step 10:
Create a CarbonSession object using the following command. Note that I am providing the path to the carbonstore folder I created in Step 7.
val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("/home/gobinath/carbonstore")
If you created the carbonstore folder somewhere else, please update the above command with your path.

Step 11:
If everything was fine so far, this is the time to play with Carbondata. Let's create a table named test_table using the following command in the Spark shell.
carbon.sql(
           s"""
              | CREATE TABLE IF NOT EXISTS test_table(
              |   id string,
              |   name string,
              |   city string,
              |   age Int)
              | STORED AS carbondata
           """.stripMargin)

Step 12:
Load the data we have written to sample.csv in Step 6 into this table.
carbon.sql("LOAD DATA INPATH '/home/gobinath/sample.csv' INTO TABLE test_table")

Step 13:
Run the following command to select everything from the test_table.
carbon.sql("SELECT * FROM test_table").show()

Executing the above statement should print all the rows from the test_table. If you restart the Spark Shell, please follow Step 9 and 10 with the same carbonstore location to query existing tables.

I hope you have successfully setup Carbondata and integrated it with Spark Shell. If you find it useful or if you have any issues with setting up Carbondata, please comment below.
Previous
Next Post »

Contact Form

Name

Email *

Message *