Apache Maven for Beginners

Apache Maven is a build tool widely being used by Java developers to manage project dependencies, control build process and automate tests. Apache Maven makes our life easier especially in building a complex Java project. However, beginners stay away from Apache Maven as I did years ago just because they find it complex to learn and use. This article simplifies the concept of Apache Maven and introduces Maven in a smooth way to beginners. In this article, you will see how you can use Apache Maven to manage your project dependencies using a simple Java project as an example. The article is structured into two main topics: Apache Maven in Eclipse and Apache Maven in IntelliJ IDEA. Of course, you can use Apache Maven without any IDEs. However, I stick with IDEs to make it simple for beginners. Other applications of Apache Maven like build management and test automation will be covered in another article.
Let's begin with manual dependency management using a simple calculator application. Suppose you want to develop a Calculator that receives a simple arithmetic expression like "2 + 3 * 5" as input and prints the output in the console. It is a complex task to evaluate such a String input and calculate the result by ourselves. Fortunately, there is a library: exp4j which can evaluate a String expression and return the output.
Read More

Spark 05: List Action Movies with Spark flatMap

Welcome to the fifth article in the series of Apache Spark tutorials. In this article, you will learn the application of flatMap transform operation. After the introduction to flatMap operation, a sample Spark application is developed to list all action movies from the MovieLens dataset.

Spark 05: List Action Movies with Spark flatMap

In the previous articles, we have used the map transform operation which transforms an entity into another entity where the transformation is one-to-one. For example, suppose you have a String RDD named lines, applying lines.map(x => x.toUpperCase) operation creates a new String RDD with the same number of records but with uppercase string literals as shown below:
Read More

Install Ballerina on Linux


Ballerina is a new open source JVM based language specially designed for integration purposes by WSO2 the world's #1 open source integration vendor. In this article, you will see how to manually install Ballerina on Linux systems. Visit the official website and download the installer for your system. There is an installer for Windows, Mac, Debian-based Linux and Fedora-based Linux. I prefer to install Ballerina manually because it is universal for all Linux operating systems out there.

Install Ballerina on Linux

Read More

Spark 04: Key-Value RDD and Average Movie Ratings

In the first article of this series: Spark 01: Movie Rating Counter, we created three RDDs (data, filteredData and ratingData) each contains a singular datatype. For example, data and filteredData were String RDDs and the ratingRDD was a Float RDD. However, it is common to use an RDD which can store complex datatypes especially Key-Value pairs depending on the requirement. In this article, we will use a Key-Value RDD to calculate the average rating of each movie in our MoviLens dataset.  Those who don't have the MovieLens dataset, please visit the Spark 01: Movie Rating Counter article to setup your environment.

Spark 04: Key Value RDD and Average Movie Ratings

As you already know, the ratings.csv file has the fields movieId and rating. A given movie may get different ratings from different users. To get the average ratings of each movie, we need to add all ratings of each movie individually and divide the sum by the number of ratings.

Read More

Spark 03: Understanding Resilient Distributed Dataset

You are not qualified as an Apache Spark developer until you know what is a Resilient Distributed Dataset (RDD). It is the fundamental technique to represent data in the Spark memory. There are advanced data representation techniques like DataFrame built on top of RDD. However, it is always better to start with the most basic dataset: RDD. RDD is nothing other than a data structure with some special properties or features.

Spark 03: Understanding Resilient Distributed Dataset

We all know that Apache Spark is a distributed general-purpose cluster-computing framework. There are some common problems faced in a distributed environment including but not limited to:
  1. Remote access of data is expensive
  2. High chance of failure
  3. Runtime errors are expensive and hard to track
  4. Wasting computing power is way too expensive
RDD is designed to address the abovementioned problems. In the following section, you will see the properties of RDD and how it solves these problems.
Read More

Contact Form

Name

Email *

Message *