Big Data: Alternative to INTERSECT operations in Hive

In my last post, we have seen the alternative of MINUS/EXCEPT operations in Apache Hive.

One more such operation which Hive does not support is INTERSECT, and in this post, we will see the alternate for the same.

Before we proceed, let’s understand what is Intersect operation in SQL? So, as the name itself explains; Intersect operation returns the intersection of two datasets, In simple words, the INTERSECT operation will return only those rows which will be common to both of the SELECT statements.

There are 2 alternatives for Intersect operation in Hive:

  1. Using EXISTS
  2. INNER JOIN

EXISTS:

Let’s

#To get the ID’s available in both T2 and T1 using EXISTS

SELECT T2.ID
FROM T2
WHERE EXISTS
(SELECT 1 FROM T1
WHERE T1.ID=T2.ID)

Demonstration of EXISTS as an alternative of INTERSECT operator

As we can see in the image, the output of the query ran using INTERSECT, and the output of the query ran using EXISTS is the same.

JOIN:

Let’s try to achieve the same result using INNER JOIN

#To get the ID’s available in both T2 and T1 using INNER JOIN

SELECT T2.T_id
FROM T2 INNER JOIN T1
ON T2.T_id = T1.T_id;

Demonstration of INNER JOIN as an alternative of INTERSECT operator

As we can see in the image, the output of all the queries ran using INNER JOIN is the same as that of INTERSECT query output.

Thank You for reading, hope you like it..!! Please like and share.

Would love to connect on LinkedIn: Nandkishor Manjrekar

Hello There, I am an IT Professional and BigData enthusiast, love to play with data, currently working as Technical Lead in MNC.