Apache Maven for Beginners
Install Oracle JDK 12 on Linux
Apache Spark Tutorial
Install Ballerina on Linux
Complex Event Processing - An Introduction

Read and Write ORC Files in Core Java

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome the limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data. There are hundreds of computing engine from Hive to Presto to read and write ORC files. When it comes to reading or writing ORC files using core Java, there is no enough help except the official document. This article is for you if you are looking forward to writing your own code to read or write ORC files.
 
In this article, we will create a simple ORC writer and reader to write ORC files and to read from ORC files. Later the ORC writer and the reader will be enhanced to support any common ORC types with some minor optimizations.

Read and Write ORC Files in Core Java

Requirements:
Read More

Install PyCharm on Linux

Considering the popularity of the Install IntelliJ IDEA on Linux post, I decided to write another post about how to install PyCharm the famous Python IDE on Linux. IntelliJ and PyCharm are from the same company and built on the same code base. However the final executable file and some configurations are different in PyCharm.

Read More

Presto SQL for Newbies

In the series of Presto SQL articles, this article explains what is Presto SQL and how to use Presto SQL for newcomers. Presto is a high performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Kafka, and MongoDB. One can even query data from multiple data sources within a single query.

Let's begin with what is Presto. Presto is a massively parallel programming engine that allows users to execute against any databases. If you define a database as software that stores data and processes it, Presto does not fall under the database category. Rather I prefer to call it a data or computing engine because Presto itself does not provide a storage solution. Instead, Presto focuses on how to query different data sources such as MySQL, SQLServer, Hive, Cassandra even possibly CSV files. Presto achieves such flexibility of querying anything using its plugin architecture as shown below:

In the future if you find a new database to be supported by Presto, you only need to write a new connector to connect that database with Presto. Though it looks like connectors doing the heavy lifting here, actually connectors only provide simple API to connect to the database. For example, connectors tell Presto what are the tables available in the underlying database and how to read raw data from them. Given that information, Presto decides how to process those data and respond to a user's request. The coolest thing here is that you can join a table from one database with a table in another database. For example, consider a bank has account details in MySQL database and transaction history in Hive, they don't need to migrate data from one database to another to join them. Presto supports SQL like the following query out of the box:

SELECT acc.account_no as account_no, trans.amount
FROM mysql.bank.accounts acc LEFT JOIN hive.bank.transactions trans
    ON acc.account_no = trans.account_no
WHERE trans.amount > 1000;


Read More

Install Oracle JDK 14 on Linux


Even though OpenJDK is available in Linux repositories, some applications strictly require Oracle Java Development Kit. This article shows you how to manually install Oracle JDK $java_version on your Linux system. This article uses JDK 14$java_update_no to demonstrate the installation. In the provided commands, replace the version specific paths and file names according to your downloaded version.
Oracle provides deb and rpm installers
If your Linux distribution is using DEB package format like Debian, you can download and install the jdk-$java_version$java_update_no_linux-x64_bin.deb file using the following command:
sudo dpkg -i jdk-$java_version$java_update_no_linux-x64_bin.deb
If your  Linux distribution is using RPM package format like Cent OS, you can download and install the jdk-$java_version$java_update_no_linux-x64_bin.rpm file using the following command:
sudo rpm -ivh jdk-$java_version$java_update_no_linux-x64_bin.rpm

However, this article explains the manual installation method which is applicable for all Linux distributions out there. Personally, I prefer the manual installation because I have more control over the changes made in the system.

Install Oracle JDK $java_version on Linux

Read More

Setup Presto SQL Development Environment

Presto SQL a massively parallel processing big-data engine grasps the attention of many big-data developers. This article is for those who like to set up a development environment for the Presto SQL community edition. The below-mentioned steps are applicable for any Presto variations including Presto DB with class names and file names replaced by equivalent names.

Requirements:

Setup Presto SQL Development Environment


Read More

Install the latest Oracle JDK on Linux


Even though OpenJDK is available in Linux repositories, some applications strictly require Oracle Java Development Kit. This article shows you how to manually install Oracle JDK $java_version on your Linux system. This article uses JDK 14$java_update_no to demonstrate the installation. In the provided commands, replace the version specific paths and file names according to your downloaded version.
Oracle provides deb and rpm installers
If your Linux distribution is using DEB package format like Debian, you can download and install the jdk-$java_version$java_update_no_linux-x64_bin.deb file using the following command:
sudo dpkg -i jdk-$java_version$java_update_no_linux-x64_bin.deb
If your  Linux distribution is using RPM package format like Cent OS, you can download and install the jdk-$java_version$java_update_no_linux-x64_bin.rpm file using the following command:
sudo rpm -ivh jdk-$java_version$java_update_no_linux-x64_bin.rpm

However, this article explains the manual installation method which is applicable for all Linux distributions out there. Personally, I prefer the manual installation because I have more control over the changes made in the system.

Install Oracle JDK $java_version on Linux

Read More
Smelly instanceof Operator

Smelly instanceof Operator

The instanceof operator in Java is used to check if a given reference is an instance of (aka object of) a given class. Though it is useful in some situations, it is a bad practice to use the instanceof operator. Whenever I see instanceof in my student's projects or a code review I raise the alarm. This article explains why the instanceof operator is considered a bad practice and how to avoid it.
Read More

Resume Tips for Software Engineers

Years ago as an international student preparing my resume to get my first job in Canada, I had a lot of questions and I did search a lot on how to make a resume that fits Canadian employer's requirements and style. Things have changed. Eventually, I got offers from some high tech companies and landed in a good job and now I am interviewing candidates who are like I was a couple of years ago. After looking at more and more resumes, I decided to share my experience here with a hope that will make someone's life better.


For whom this article is for? Well! for anyone looking for a new job in Canada. Especially if you are  new immigrant who has no idea about what Canadian employers are looking for, this article is tailored to your requirements. Though my experience is limited to Canada the companies I applied for are mostly US-based companies so I hope it can be applied anywhere in North America. This article is targeting only those who are in the software industry. I don't know how much it will overlap with other industries. The rest of the article is divided into two topics: 1. Resume Sections, 2. Resume Format. The first topic covers what to include and not to include in your resume and the second topic provides some formatting tips to make your resume get you a call from the recruiter.
Read More

Presto SQL: Join Algorithms

Presto is a distributed big data SQL engine initially developed by Facebook and later open-sourced and being led by the community. The last article Presto SQL: Types of Joins covers the fundamentals of join operators available in Presto and how they can be used in SQL queries. With that knowledge, you can now learn the internals of Presto and how it executes join operations internally. This article presents how Presto executes join operations and the algorithms used to join tables.


Read More

Presto SQL: Types of Joins

SQL Join is one of the most important and expensive SQL operation and require deep understanding from database engineers to write efficient SQL queries. From database engineers' perspective, understanding how join operation works help them to optimize them for efficient execution. This article, explains the join operations supported in the open source distributed computing engine: Presto SQL. This article is based on now archived prestodb.rocks blog which I referred to learn the Join algorithms of Presto.



Read More

Read Carbondata Table from Apache Hive

Apache Carbondata an indexed columnar data store heavily depends on Apache Spark but also supports other Big Data frameworks like Apache Hive and Presto. This article explains how to read a Carbondata table created in Apache Spark from Apache Hive in two sections: 1. How to create a table in HDFS using Apache Spark, 2. How to read the Carbondata table from Apache Hive.

Read Carbondata Table from Apache Hive
Requirements:
  • Oracle JDK 1.8
  • Apache Spark
  • Apache Hadoop (Carbondata officially support Hive 2.x. In this article, Apache Hadoop 2.7.7 is used)
  • Apache Hive (Carbondata officially support Hive 2.x. So better to stick to 2.x version. In this article, Apache Hive 2.3.6 is used to demonstrate the integration)
  • Carbondata libraries
Please follow the Integrate Carbondata with Apache Spark Shell article to compile Carbondata from source and integrate it with Apache Spark. This article is written based on the assumption that you have already followed all the steps from the above-mentioned article.

Read More

Integrate Carbondata with Apache Spark Shell

Apache Carbondata an indexed columnar data store solution for fast analytics on big data platform, e.g.Apache Hadoop, Apache Spark, etc. This article is written to provide a quick start guide on how to integrate Carbondata with Apache Spark Shell. Why another article while there is a quick start guide on the official website? Things are not always as smooth as expected. In my experience, integrating Carbondata with Apache Spark using pre-built binaries didn't work as expected. So here is the quick start tutorial.

Integrate Carbondata with Apache Spark Shell
Requirements:
Carbondata requires Java 1.7 or 1.8 to run and Apache Maven to build from source. Please make sure that you have Oracle JDK 1.8, supporting Apache Maven and Git to setup Carbondata. If you don't have Oracle JDK or Apache Maven installed in your system, please follow the given links below to install them first.

Read More

Install MySQL 8 on Ubuntu/Linux Mint

Ubuntu official software repository provides MySQL 5.x which can be installed by following the article: Install MySQL with phpMyAdmin on Ubuntu. However the latest release of MySQL: 8.x, requires you to manually add the software repository into your system which makes the installation process little tricky. This article walks you through the end-to-end installation process of MySQL 8.


Read More

Contact Form

Name

Email *

Message *