The problem encountered in this article is to de-duplicate the data from Hive SQL SELECT with certain columns as key. The following is a step-by-step discussion.
DISTINCT
When it comes to de-duplication, DISTINCT naturally comes to mind. But in Hive SQL, it has two problems.
DISTINCTwill use all the columns fromSELECTas keys for de-duplication. That is, as long as one column has different data,DISTINCTwill consider it different and keep it.DISTINCTwill put all the data on one reducer, which will cause serious data skew and take a lot of time.
ROW_NUMBER() OVER
Two problems with DISTINCT can be solved with ROW_NUMBER() OVER. For example, if we want to de-duplicate the key by key1 and key2 columns, we would write code like this.
In this way, Hive will put the data on different mappers with key1 and key2 as keys, and then arrange the data in ascending order by column for a group with the same key1 and key2, and finally keep the first data in each group after the arrangement. This completes the task of de-duplication by key1 and key2.
Note the role of PARTITION BY here: first, to break up the data by key1 and key2 to solve the above problem (2); second, to combine with ORDER BY and rn = 1 to group and de-weight the data by key1 and key2 to solve the above problem (1).
But obviously, this is not-elegant, and not surprisingly inefficient.
GROUP BY and COLLECT_SET / COLLECT_LIST
A core part of the ROW_NUMBER() OVER solution is to use PARTITION BY to group data by key, and the same can be done with GROUP BY. However, GROUP BY needs to be used in conjunction with an aggregation function. The combination of ORDER BY and the rn = 1 condition enables the keep first function. We need to consider what kind of aggregation function can achieve or indirectly achieve such a function? It is easy to think of COLLECT_SET and COLLECT_LIST.
So there is this code.
For columns other than key1 and key2, we use COLLECT_LIST to collect them and output the first collected result. The reason for using COLLECT_LIST instead of COLLECT_SET is that SET is unordered, so you can’t guarantee that the output columns are all from the same piece of data. If this is not required or restricted, then you can use COLLECT_SET, which is more resource efficient.
Compared to the previous approach, it is much more efficient because the sorting and (possibly) storing to disk actions are omitted. However, since it is (possibly) not stored to disk, the data in COLLECT_LIST is cached in memory. This approach may trigger an OOM if the number of duplicates is particularly large, so consider breaking the data up further and then merging it, or simply switch to the previous approach.