The datediff function in Hive SQL returns the number of days between 2 dates. One of the more interesting pitfalls found during use is that
1
2
3
4
5
6
7
8
|
SELECT customer_id, COUNT(DISTINCT date(createdate)) - 1 AS frequency
, datediff(MAX(createdate), MIN(createdate)) AS recency
, datediff(CURRENT_DATE, MIN(createdate)) AS T
, CASE
WHEN COUNT(DISTINCT createdate) - 1 = 0 THEN 0
ELSE SUM(totaltakeoff) / COUNT(DISTINCT createdate)
END AS monetary_value
FROM orderdb.orderdetail
|
Where createdate is a datetime type, the execution of SQL finds that there is data with recency>T. The following code is executed and the result also has problems.
1
2
3
4
5
6
7
8
|
SELECT customer_id, COUNT(DISTINCT date(createdate)) - 1 AS frequency
, datediff(to_date(MAX(createdate)), to_date(MIN(createdate))) AS recency
, datediff(CURRENT_DATE, to_date(MIN(createdate))) AS T
, CASE
WHEN COUNT(DISTINCT createdate) - 1 = 0 THEN 0
ELSE SUM(totaltakeoff) / COUNT(DISTINCT createdate)
END AS monetary_value
FROM orderdb.orderdetail
|
The possible cause is CURRENT_DATE, where the value of CURRENT_DATE is equal to TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP()), use TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP()) and do the test again, the result is normal again.
1
2
3
4
5
6
7
8
|
SELECT customer_id, COUNT(DISTINCT date(createdate)) - 1 AS frequency
, datediff(to_date(MAX(createdate)), to_date(MIN(createdate))) AS recency
, datediff(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())), to_date(MIN(createdate))) AS T
, CASE
WHEN COUNT(DISTINCT createdate) - 1 = 0 THEN 0
ELSE SUM(totaltakeoff) / COUNT(DISTINCT createdate)
END AS monetary_value
FROM orderdb.orderdetail
|
The main difference is that current_date returns content in date format, while to_date returns a string (before version 2.1 it was a string, after version 2.1 it returned date). Do a further test as follows.
1
2
3
4
5
6
7
8
9
|
SELECT datediff('2018-09-16 00:01:55', '2018-09-16 00:00:52') AS t1
, datediff(current_date, '2018-09-16 00:00:52') AS t2
, datediff(current_date, '2018-09-17 00:00:52') AS t3
, datediff(current_date, '2018-09-16') AS t4
, datediff(current_date, '2018-09-17') AS t5
, datediff(TO_DATE(current_date), '2018-09-16 00:00:52') AS t6
, datediff(TO_DATE(current_date), '2018-09-17 00:00:52') AS t7
, datediff(TO_DATE(current_date), '2018-09-16') AS t8
, datediff(TO_DATE(current_date), '2018-09-17') AS t9
|
The result after execution is: 0 0 0 0 1 0 1 0. Looking further, it was found that a bug had already been filed by the person.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
[ https://issues.apache.org/jira/browse/HIVE-18304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297824#comment-16297824
]
Hengyu Dai commented on HIVE-18304:
-----------------------------------
SimpleDateFormat.parse(String source) method will convert String type(UTC) to java.util.Date
type(use current JVM timezone), this may lead deviations in time when JVM timezone is not
UTC, my environment is GMT+8, 8 hours is added comparing to the UTC time.
while for a date type argument, the default JVM timezone is used.
The patch uploaded treats String type and Date type at the same way to remove the deviations.
> datediff() UDF returns a wrong result when dealing with a (date, string) input
> ------------------------------------------------------------------------------
>
> Key: HIVE-18304
> URL: https://issues.apache.org/jira/browse/HIVE-18304
> Project: Hive
> Issue Type: Bug
> Components: UDF
> Reporter: Hengyu Dai
> Assignee: Hengyu Dai
> Priority: Minor
> Attachments: 0001.patch
>
>
> for date type argument, datediff() use DateConverter to convert input to a java Date
object,
> for example, a '2017-12-18' will get 2017-12-18T00:00:00.000+0800
> for string type argument, datediff() use TextConverter to convert a string to date,
> for '2012-01-01' we will get 2012-01-01T08:00:00.000+0800
> now, datediff() will return a number less than the real date diff
> we should use TextConverter to deal with date input too.
> reproduce:
> {code:java}
> select datediff(cast('2017-12-18' as date), '2012-01-01'); --2177
> select datediff('2017-12-18', '2012-01-01'); --2178
> {code}
|