Conclusion

Let’s start with the general conclusion (full conclusion at the end of the article)

  • With the same semantics and indexes: group by and distinct can both use indexes with the same efficiency.
  • With the same semantics and no indexes: distinct is more efficient than group by. The reason is that both distinct and group by perform grouping operations, but group by may sort and trigger filesort, resulting in inefficient sql execution.

Based on this conclusion, you may ask.

  • Why are group by and distinct equally efficient when the semantics are the same and indexes are available?
  • Under what circumstances does group by perform a sorting operation?

With these two questions in mind, let’s find the answers. Next, let’s look at the basic use of distinct and group by.

Usage of distinct

1
SELECT DISTINCT columns FROM table_name WHERE where_conditions;

For example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
mysql> select distinct age from student;
+------+
| age  |
+------+
|   10 |
|   12 |
|   11 |
| NULL |
+------+
4 rows in set (0.01 sec)

The DISTINCT keyword is used to return a uniquely different value. It is used before the first field in a query statement and works on all columns in the main clause.

If a column has a NULL value and the DISTINCT clause is used for that column, MySQL will keep one NULL value and delete the others because the DISTINCT clause treats all NULL values as the same.

distinct multi-column de-duplication

The distinct multi-column de-duplication is done based on the specified de-duplicated column information, i.e. only if all the specified column information is the same, it will be considered as duplicate information.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
SELECT DISTINCT column1,column2 FROM table_name WHERE where_conditions;
mysql> select distinct sex,age from student;
+--------+------+
| sex    | age  |
+--------+------+
| male   |   10 |
| female |   12 |
| male   |   11 |
| male   | NULL |
| female |   11 |
+--------+------+
5 rows in set (0.02 sec)

Use of group by

For basic de-duplication, the use of group by is similar to that of distinct.

Single column de-duplication

Syntax.

1
SELECT columns FROM table_name WHERE where_conditions GROUP BY columns;

Example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
mysql> select age from student group by age;
+------+
| age  |
+------+
|   10 |
|   12 |
|   11 |
| NULL |
+------+
4 rows in set (0.02 sec)

Multi-column de-duplication

Syntax.

1
SELECT columns FROM table_name WHERE where_conditions GROUP BY columns;

Example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
mysql> select sex,age from student group by sex,age;
+--------+------+
| sex    | age  |
+--------+------+
| male   |   10 |
| female |   12 |
| male   |   11 |
| male   | NULL |
| female |   11 |
+--------+------+
5 rows in set (0.03 sec)

Difference example

The difference between the two syntaxes is that group by can perform single column de-duplication, while group by works by sorting the results in groups and then returning the first data in each group. And the de-duplication is done according to the fields that follow group by.

Example.

1
2
3
4
5
6
7
8
mysql> select sex,age from student group by sex;
+--------+-----+
| sex    | age |
+--------+-----+
| male   |  10 |
| female |  12 |
+--------+-----+
2 rows in set (0.03 sec)

Principles of distinct and group by

In most examples, DISTINCT can be regarded as special GROUP BY, their implementation is based on grouping operation, and both can be implemented by loose index scan, compact index scan (about index scan will be described in detail in other articles, so I won’t introduce it in detail here).

Both DISTINCT and GROUP BY can be scanned and searched using indexes. For example, in the following two sql (just look at the last extra of the table alone), we analyze these two sql and we can see that in the extra, both sql use a compact index scan Using index for group-by.

So, in general, for the same semantic DISTINCT and GROUP BY statements, we can optimize them using the same indexing optimizations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
mysql> explain select int1_index from test_distinct_groupby group by int1_index;
+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+
| id | select_type | table                 | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra                    |
+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+
|  1 | SIMPLE      | test_distinct_groupby | NULL       | range | index_1       | index_1 | 5       | NULL |  955 |   100.00 | Using index for group-by |
+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+
1 row in set (0.05 sec)

mysql> explain select distinct int1_index from test_distinct_groupby;
+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+
| id | select_type | table                 | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra                    |
+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+
|  1 | SIMPLE      | test_distinct_groupby | NULL       | range | index_1       | index_1 | 5       | NULL |  955 |   100.00 | Using index for group-by |
+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+
1 row in set (0.05 sec)

However, for GROUP BY, before MYSQL 8.0, GROUP Y by default does an implicit sort based on fields.

As you can see, the following sql statement uses a temporary table along with a filesort.

1
2
3
4
5
6
7
mysql> explain select int6_bigger_random from test_distinct_groupby GROUP BY int6_bigger_random;
+----+-------------+-----------------------+------------+------+---------------+------+---------+------+-------+----------+---------------------------------+
| id | select_type | table                 | partitions | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra                           |
+----+-------------+-----------------------+------------+------+---------------+------+---------+------+-------+----------+---------------------------------+
|  1 | SIMPLE      | test_distinct_groupby | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 97402 |   100.00 | Using temporary; Using filesort |
+----+-------------+-----------------------+------------+------+---------------+------+---------+------+-------+----------+---------------------------------+
1 row in set (0.04 sec)

Implicit sorting

For implicit sorting, we can refer to the official explanation of Mysql.

GROUP BY implicitly sorts by default (that is, in the absence of ASC or DESC designators for GROUP BY columns). However, relying on implicit GROUP BY sorting (that is, sorting in the absence of ASC or DESC designators) or explicit sorting for GROUP BY (that is, by using explicit ASC or DESC designators for GROUP BY columns) is deprecated. To produce a given sort order, provide an ORDER BY clause.

So, prior to Mysql 8.0, Group by would sort the results by default based on the role field (the field that follows Group by). When indexes are available, Group by does not require additional sorting; however, when indexes are not available, the Mysql optimizer has to choose to implement GROUP BY by using a temporary table and then sorting.

And when the size of the result set exceeds the size of the temporary table set by the system, Mysql will copy the temporary table data to the disk and then perform the operation, and the execution efficiency of the statement will become extremely low. This is the reason why Mysql chooses to deprecate this operation (implicit sorting).

For these reasons, Mysql has been optimized and updated for this in 8.0.

Previously (MySQL 5.7 and lower), GROUP BY sorted implicitly under certain conditions. In MySQL 8.0, that no longer occurs, so specifying ORDER BY NULL at the end to suppress implicit sorting (as was done previously) is no longer necessary. However, query results may differ from previous MySQL versions. To produce a given sort order, provide an ORDER BY clause.

https://dev.mysql.com/doc/refman/8.0/en/order-by-optimization.html

Thus, our conclusion also comes out:

  • In case of the same semantics and indexing:

    Both group by and distinct can use indexes with the same efficiency. Because group by and distinct are nearly equivalent, distinct can be seen as a special group by.

  • In the case of identical semantics and no indexes:

    distinct is more efficient than group by. The reason is that both distinct and group by perform grouping operation, but group by performs implicit sorting before Mysql 8.0, which leads to triggering filesort and inefficient sql execution.

    But since Mysql 8.0, Mysql has removed implicit sorting, so at this point, with the same semantics and no indexes, the execution efficiency of group by and distinct is also nearly equivalent.

  1. the semantics of group by are clearer
  2. group by can do some more complex processing of the data

The semantics of group by is clearer than that of distinct. And since the distinguish keyword will take effect for all fields, group by is more flexible when it comes to compound business processing. group by can perform more complex processing on data according to grouping, such as filtering data by having, or performing operations on data by aggregation functions.