Conclusion
Let’s start with the general conclusion (full conclusion at the end of the article)
- With the same semantics and indexes:
group by
and distinct can both use indexes with the same efficiency. - With the same semantics and no indexes: distinct is more efficient than
group by
. The reason is that both distinct andgroup by
perform grouping operations, butgroup by
may sort and trigger filesort, resulting in inefficient sql execution.
Based on this conclusion, you may ask.
- Why are
group by
and distinct equally efficient when the semantics are the same and indexes are available? - Under what circumstances does
group by
perform a sorting operation?
With these two questions in mind, let’s find the answers. Next, let’s look at the basic use of distinct and group by
.
Usage of distinct
|
|
For example.
The DISTINCT
keyword is used to return a uniquely different value. It is used before the first field in a query statement and works on all columns in the main clause.
If a column has a NULL value and the DISTINCT
clause is used for that column, MySQL will keep one NULL value and delete the others because the DISTINCT
clause treats all NULL values as the same.
distinct multi-column de-duplication
The distinct multi-column de-duplication is done based on the specified de-duplicated column information, i.e. only if all the specified column information is the same, it will be considered as duplicate information.
|
|
Use of group by
For basic de-duplication, the use of group by
is similar to that of distinct.
Single column de-duplication
Syntax.
|
|
Example.
Multi-column de-duplication
Syntax.
|
|
Example.
Difference example
The difference between the two syntaxes is that group by
can perform single column de-duplication, while group by
works by sorting the results in groups and then returning the first data in each group. And the de-duplication is done according to the fields that follow group by
.
Example.
Principles of distinct and group by
In most examples, DISTINCT
can be regarded as special GROUP BY
, their implementation is based on grouping operation, and both can be implemented by loose index scan, compact index scan (about index scan will be described in detail in other articles, so I won’t introduce it in detail here).
Both DISTINCT
and GROUP BY
can be scanned and searched using indexes. For example, in the following two sql (just look at the last extra of the table alone), we analyze these two sql and we can see that in the extra, both sql use a compact index scan Using index for group-by
.
So, in general, for the same semantic DISTINCT
and GROUP BY
statements, we can optimize them using the same indexing optimizations.
|
|
However, for GROUP BY
, before MYSQL 8.0, GROUP Y
by default does an implicit sort based on fields.
As you can see, the following sql statement uses a temporary table along with a filesort.
|
|
Implicit sorting
For implicit sorting, we can refer to the official explanation of Mysql.
GROUP BY implicitly sorts by default (that is, in the absence of ASC or DESC designators for GROUP BY columns). However, relying on implicit GROUP BY sorting (that is, sorting in the absence of ASC or DESC designators) or explicit sorting for GROUP BY (that is, by using explicit ASC or DESC designators for GROUP BY columns) is deprecated. To produce a given sort order, provide an ORDER BY clause.
So, prior to Mysql 8.0, Group by
would sort the results by default based on the role field (the field that follows Group by
). When indexes are available, Group by
does not require additional sorting; however, when indexes are not available, the Mysql optimizer has to choose to implement GROUP BY
by using a temporary table and then sorting.
And when the size of the result set exceeds the size of the temporary table set by the system, Mysql will copy the temporary table data to the disk and then perform the operation, and the execution efficiency of the statement will become extremely low. This is the reason why Mysql chooses to deprecate this operation (implicit sorting).
For these reasons, Mysql has been optimized and updated for this in 8.0.
Previously (MySQL 5.7 and lower), GROUP BY sorted implicitly under certain conditions. In MySQL 8.0, that no longer occurs, so specifying ORDER BY NULL at the end to suppress implicit sorting (as was done previously) is no longer necessary. However, query results may differ from previous MySQL versions. To produce a given sort order, provide an ORDER BY clause.
https://dev.mysql.com/doc/refman/8.0/en/order-by-optimization.html
Thus, our conclusion also comes out:
-
In case of the same semantics and indexing:
Both
group by
and distinct can use indexes with the same efficiency. Becausegroup by
and distinct are nearly equivalent, distinct can be seen as a specialgroup by
. -
In the case of identical semantics and no indexes:
distinct is more efficient than
group by
. The reason is that both distinct andgroup by
perform grouping operation, butgroup by
performs implicit sorting before Mysql 8.0, which leads to triggering filesort and inefficient sql execution.But since Mysql 8.0, Mysql has removed implicit sorting, so at this point, with the same semantics and no indexes, the execution efficiency of
group by
and distinct is also nearly equivalent.
Reasons why group by is recommended
- the semantics of
group by
are clearer group by
can do some more complex processing of the data
The semantics of group by
is clearer than that of distinct. And since the distinguish keyword will take effect for all fields, group by
is more flexible when it comes to compound business processing. group by
can perform more complex processing on data according to grouping, such as filtering data by having, or performing operations on data by aggregation functions.