The problem encountered in this article is to de-duplicate the data from Hive SQL SELECT
with certain columns as key
. The following is a step-by-step discussion.
DISTINCT
When it comes to de-duplication, DISTINCT
naturally comes to mind. But in Hive SQL, it has two problems.
DISTINCT
will use all the columns fromSELECT
as keys for de-duplication. That is, as long as one column has different data,DISTINCT
will consider it different and keep it.DISTINCT
will put all the data on one reducer, which will cause serious data skew and take a lot of time.
ROW_NUMBER() OVER
Two problems with DISTINCT
can be solved with ROW_NUMBER() OVER
. For example, if we want to de-duplicate the key by key1
and key2
columns, we would write code like this.
In this way, Hive will put the data on different mappers with key1
and key2
as keys, and then arrange the data in ascending order by column
for a group with the same key1
and key2
, and finally keep the first data in each group after the arrangement. This completes the task of de-duplication by key1
and key2
.
Note the role of PARTITION BY
here: first, to break up the data by key1
and key2
to solve the above problem (2); second, to combine with ORDER BY
and rn = 1
to group and de-weight the data by key1
and key2
to solve the above problem (1).
But obviously, this is not-elegant, and not surprisingly inefficient.
GROUP BY and COLLECT_SET / COLLECT_LIST
A core part of the ROW_NUMBER() OVER
solution is to use PARTITION BY
to group data by key, and the same can be done with GROUP BY
. However, GROUP BY
needs to be used in conjunction with an aggregation function. The combination of ORDER BY
and the rn = 1
condition enables the keep first
function. We need to consider what kind of aggregation function can achieve or indirectly achieve such a function? It is easy to think of COLLECT_SET
and COLLECT_LIST
.
So there is this code.
For columns other than key1
and key2
, we use COLLECT_LIST
to collect them and output the first collected result. The reason for using COLLECT_LIST
instead of COLLECT_SET
is that SET is unordered, so you can’t guarantee that the output columns are all from the same piece of data. If this is not required or restricted, then you can use COLLECT_SET
, which is more resource efficient.
Compared to the previous approach, it is much more efficient because the sorting and (possibly) storing to disk actions are omitted. However, since it is (possibly) not stored to disk, the data in COLLECT_LIST
is cached in memory. This approach may trigger an OOM if the number of duplicates is particularly large, so consider breaking the data up further and then merging it, or simply switch to the previous approach.